How to install pyspark in pycharm

Situatie

PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.

Backup

Firstly in your Pycharm interface, install Pyspark by following these steps:

Go to File -> Settings -> Project Interpreter

Click on install button and search for PySpark
Click on install package button.
Manually with user provided Spark installation

Now, create Run configuration:

Go to Run -> Edit configurations
Add new Python configuration
Set Script path so it points to the script you want to execute
Edit Environment variables field so it contains at least:

SPARK_HOME – it should point to the directory with Spark installation. It should contain directories such as bin (with spark-submit, spark-shell, etc.) and conf (with spark-defaults.conf, spark-env.sh, etc.)

PYTHONPATH – it should contain $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/py4j-some-version.src.zip if not available otherwise. some-version should match Py4J version used by a given Spark installation (0.8.2.1 – 1.5, 0.9 – 1.6, 0.10.3 – 2.0, 0.10.4 – 2.1, 0.10.4 – 2.2, 0.10.6 – 2.3)

Apply the settings

Add PySpark library to the interpreter path (required for code completion):

Go to File -> Settings -> Project Interpreter
Open settings for an interpreter you want to use with Spark
Edit interpreter paths so it contains path to $SPARK_HOME/python (an Py4J if required)
Save the settings
Finally
Use newly created configuration to run your script.

Solutie

Tip solutie

Permanent

Follow Us