PySpark allows Python programmers to interface with the Spark framework to manipulate data at scale and work with objects over a distributed filesystem.
Python 3.4+ is required for the latest version of PySpark, so make sure you have it installed before continuing. (Earlier Python versions will not work.)
python3 --version
Install the pip3 tool.
sudo apt install python3-pip
Install Jupyter for Python 3.
pip3 install jupyter
Augment the PATH variable to launch Jupyter Notebook easily from anywhere.
export PATH=$PATH:~/.local/bin
Choose a Java version.
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default
Check the installation.
java -version
Set some Java-related PATH variables.
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=/usr/lib/jvm/java-8-oracle/jre
Install Scala.
sudo apt-get install scala
Check the Scala installation.
scala -version
Install py4j for the Python-Java integration.
pip3 install py4j
Install Apache Spark; go to the Spark download page and choose the latest (default) version. I am using Spark 2.3.1 with Hadoop 2.7. After downloading, unpack it in the location you want to use it.
sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz
Now, add a long set of commands to your .bashrc shell script. These will set environment variables to launch PySpark with Python 3 and enable it to be called from Jupyter Notebook. Take a backup of .bashrc before proceeding.
Open .bashrc using any editor you like, such as gedit .bashrc. Add the following lines at the end:
export SPARK_HOME='/{YOUR_SPARK_DIRECTORY}/spark-2.3.1-bin-hadoop2.7'
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
Remember to replace {YOUR_SPARK_DIRECTORY} with the directory where you unpacked Spark above.
Leave A Comment?