How to set up PySpark for your Jupyter notebook

Configurare noua (How To)

Situatie

PySpark allows Python programmers to interface with the Spark framework to manipulate data at scale and work with objects over a distributed filesystem.

Backup

Python 3.4+ is required for the latest version of PySpark, so make sure you have it installed before continuing. (Earlier Python versions will not work.)

python3 --version

Install the pip3 tool.

sudo apt install python3-pip

Install Jupyter for Python 3.

pip3 install jupyter

Augment the PATH variable to launch Jupyter Notebook easily from anywhere.

export PATH=$PATH:~/.local/bin

Choose a Java version.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

Check the installation.

java -version

Set some Java-related PATH variables.

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=/usr/lib/jvm/java-8-oracle/jre

Install Scala.

sudo apt-get install scala

Check the Scala installation.

scala -version

Install py4j for the Python-Java integration.

pip3 install py4j

Install Apache Spark; go to the Spark download page and choose the latest (default) version. I am using Spark 2.3.1 with Hadoop 2.7. After downloading, unpack it in the location you want to use it.

sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz

Now, add a long set of commands to your .bashrc shell script. These will set environment variables to launch PySpark with Python 3 and enable it to be called from Jupyter Notebook. Take a backup of .bashrc before proceeding.

Open .bashrc using any editor you like, such as gedit .bashrc. Add the following lines at the end:

export SPARK_HOME='/{YOUR_SPARK_DIRECTORY}/spark-2.3.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

Remember to replace {YOUR_SPARK_DIRECTORY} with the directory where you unpacked Spark above.

Solutie

Tip solutie

Permanent
Etichetare:

Voteaza

(6 din 13 persoane apreciaza acest articol)

Despre Autor

Leave A Comment?