Setting up Apache (Py)Spark with Jupyter Notebook in Arch Linux

This blog post details the process I took to install Apache Spark
on Arch Linux and the following integration with Jupyter Notebook.
In what follows, I’m describing the adaptation for Arch Linux of
the Spark installation guide for Mac OS X[3] written by rocket-ron.

What Is Apache Spark?

“Apache Spark is a fast and general engine for large-scale data processing”
(Apache Spark Website[2]).

As the website states, Spark is an open source platform for
processing data on a large scale, and supports cluster computing.
It supports Java, Python, R, and Scala languages, but (as you can
probably guess from this blog post title) I will focus on the
Python API – PySpark.

Some features of Spark:

  • MLib – Machine Learning Library
  • Spark SQL Context – Interface for executing SQL queries on a dataset
  • Support for Multiple Data Formats – HDFS, SQL, Compress GZIP text files,plain text files and more…

More information can be found on the Apache Spark Website.

Installing Apache Spark

Arch & OS X use different package managers. On Arch I used the AUR helper
[4] packer, which allows the installation of
Spark and its dependencies (e.g. Scala, hadoop, Python, etc.) with one
single command:

packer -S apache-spark  

Setup Jupyter PySpark Kernel

Even though the packer command installs Spark and all its dependencies,
Spark still requires the environment to be set up correctly.
Following rocket-ron’s example, I set up a Jupyter Notebook kernel to
provide an interactive coding environment. This process involves two
tasks: installing the JSON Kernel Configuration file, and setting up
the startup script for the PySpark profile.

Of course, this requires Jupyter Notebooks to be installed on the system.
This can be done running:

sudo pacman -S jupyter-notebook  

and typing the user password if requested.

1. Kernel Configuration

The content of the kernel.json file, which is used to install the
PySpark kernel is:

{ "language": "python", "argv": [  "python",  "-m",  "ipykernel",  "--profile=pyspark",  "-f",  "{connection_file}" ], "display_name": "pySpark (Spark 1 .6.1)"}

I put this kernel.json in its own directory (e.g. ~/pyspark/) so that I
could install it via Jupyter command line:

sudo jupyter kernel-spec install ~/pyspark/  

2. PySpark Profile Startup Script

In order to create a PySpark profile at startup, we need to create the file
~/.ipython/profile_pyspark/startup/00-pyspark-setup.py, with the followingcontent:

# Configure the necessary Spark environmentimport os  import syspyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS","")if not "pyspark-shell" in pyspark_submit_args :    pyspark_submit_args += " pyspark-shell"os.environ["PYSPARK_SUBMIT_ARGS"]= pyspark_submit_argsspark_home = os.environ.get("SPARK_HOME", '/opt/apache-spark')  sys.path.insert(0, spark_home + "/python")# Add the py4j to the path.# You may need to change the version number to match your installsys.path.insert(0,os.path.join(spark_home, "python/lib/py4j-0.9-src.zip"))# Initialize PySparkmajor_version = sys.version_info.major  pyspark_shell_file = os.path.join(spark_home, "python/pyspark/shell.py")if (major_version==2):    execfile(pyspark_shell_file)elif (major_version==3):    with open(pyspark_shell_file) as f:    code = compile(f.read(), pyspark_shell_file, "exec")    exec(code)else:    error_str = "Unrecognised Python Version: {0}".format(major_version)  raise EnvironmentError(1, error_str, "00-pyspark-setup.py")

This script runs initial code required to setup a Spark Context object
ready for use. This object tells Spark how to access a cluster
[8] and acts as the interface for initial
Spark operations, such as loading a dataset into the Spark environment.

Note: At the time of writing the apache-spark AUR package installed Sparkin /opt/apache-spark. If this changes, or you manually install Spark
elsewhere, be sure to set the environment variable SPARK_HOME to point to the
installation. E.g. Add export SPARK_HOME=/path/to/spark to your .bashrc file

Checking It Works

Now, on a running jupyter notebook, you should be able to create a
new notebook with a pySpark (Spark 1.6.1) kernel, which can be tested
with the below wordcount example in a cell:

import pyspark  spark_context = pyspark.SparkContext()  lines = spark_context.textFile("filename")  words = lines.flatMap(lambda line: line.split())  count = words.count()  print("Word Count: " + str(count))  

Where filename is the path of a text file.

ALAS, You Could Use Vagrant

In search for more power during an assignment, I decided to set Spark up
on my gaming PC, which runs Windows. Predicting that installing Spark on
Windows would have undoubtedly been a headache, I prepared a Vagrant
file and scripts to provision an Arch VM. These files have been pushed to the
ALAS[1](Arch Linux Apache Spark) GitHub Repository.

Debian/Ubuntu Based Systems

For those using a derivative of the Debian Linux distribution (e.g. Ubuntu),
an in-depth guide is provided by Kristian Holsheimer on
GitHub[7].

Sources:

  1. ALAS – Arch Linux Apache Spark Repository
  2. Apache Spark Website
  3. AUR – Apache Spark Package
  4. AUR Helpers
  5. Configuring Spark 1.6.1 to work with Jupyter 4.x Notebooks on Mac OS X with Homebrew
  6. Jupyter Website
  7. Spark + PySpark setup guide
  8. Spark Programming Guide – Initialising Spark section
  9. Vagrant Website