Setting up a Local Spark Development Environment using Docker

apache spark distributed computing docker python Mar 02, 2019

Every time I want to get started with new tech I figure out how to get a stack up and running that closely resembles a real-world production instance as much as possible.

This is a get up and running post. It does not get into the nitty gritty details of developing with Spark, since I am only just getting comfortable with Spark myself. Mostly I wanted to get up and running, and write a post about some of the issues that came up along the way.

What is Spark?

Spark is a distributed computing library with support for Java, Scala, Python, and R. It's what I refer to as a world domination technology, where you want to do lots of computations, and you want to do it fast. You can run computations from the embarrassingly parallel, such as parallelizing a for loop to complex workflows, and support for distributed machine learning as well. You can transparently scale out your computations to not only multiple cores, but even multiple machines by creating a spark cluster. How cool is that?

My favorite introduction to Spark and the Spark ecosystem is here at the mapr blog.

Why should I learn Spark?

Well, I'm not sure why you should learn Spark! I can tell you why I am interested in Spark and distributed computing in general.

A big part of my day job (you know, when I'm not hanging out with my kids at the beach, drawing, or playing video games) is speeding up Research and Data Scientists computational workflows. This is one of my favorite aspects of my job. I love making stuff fast!

Most of the time I'm able to break the analysis itself into smaller pieces, and run each of the smaller pieces in parallel using traditional High Performance Computing (HPC) technologies. Ok, mostly I abuse the HPC scheduler, and use it to set up a graph of the analysis workflow, which it then executes.

All of this is done without ever touching the code itself. More recently I am interested in speeding computations up IN the code. Mostly I am looking for libraries that can be at least somewhat easily dropped into an existing codebase without having to write my own thread queues, writing any mpi code, or vectorizing code.

Cool Stuff! How can I learn Spark?

Spark is a pretty big framework and ecosystem. There are a ton of getting started tutorials on the web, which is basically what I'll go over now. If you're looking for something more machine learning based there is a course by the awesome Jose Portilla on Udemy. I haven't taken this course, but I took his Python Basics on Youtube, and I'm slowly moving along on his Tensorflow Course. I haven't decided if I want to dig deep into Spark or Dask, another distributed computing tech, but if I decide on Spark I will definitely take his course.

Install Spark and Get the Code

As usual, I'm not going to actually install Spark. I'm going to find myself a nice docker file. Then, I'm not going to be entirely happy with it and install conda. That is the one of the many perks of using docker. You can take other people's Dockerfiles, and either modify or use them as your base docker layer.

I wandered the internet in my "I don't want to install this" quest, and found this great github repo from gettyimages. They have plenty of other helpful images and docker-compose configurations too.

I forked the getty github repo, and added my own flavor to their already awesome Dockerfile to add miniconda and pyspark. Get my my project template here.


docker-compose up -d

I just want to add that if you're using a cloud provider there is already a bootstrap method somewhere separate from docker. I tend to be either working on internal infrastructure, or playing with things from my laptop, and then docker is. the way to go!

Since this is running with docker-compose magic, we have the web UI all ready to go too.

There's nothing especially interesting happening here, because we haven't run anything. So let's find some examples!

Run some examples

My favorite way to learn anything is to start tinkering with simple hello world examples. There are tons of examples in the official spark github repo available here.

Run the sort example

Now, I want to run the sort example available from the spark repo.

Something important to note is initializing the spark object. Some of the examples only run in the local context, which is fine for playing with, but the whole point of using docker-compose is to simulate a cluster and get our nice web UI.

This examples takes an input file of numbers and sort it. If you're following along with the github repo you'll have the file available. Mine has the numbers 1-10 with the odd numbers of listed first, and the even numbers listed second. Otherwise you will need to supply your own. Just make sure its only numbers.

If you take a look at the compose file you will that the data dir in the github repo is bound as /tmp/data in the docker container. If you change this you will need to also change the file mappings below.

# Run one of the examples from installing Spark. This is a great way to make sure your cluster initialized correctly.
docker-compose exec master bin/run-example SparkPi 10
# Run the sort job from the spark examples
docker-compose exec master python /tmp/data/sort.py /tmp/data/sort_this

Check it out again in the Spark UI at http://localhost:8090/.

You should see your applications under 'Completed', unless maybe you opened your browser particularly quickly, and then it may be under running.

If you click on the application URL you get some more detailed information, including stdout and stderr logs.

Run the word count example

Somewhere on the internet I found a word count example that had me stumped. Running with the default it didn't show up in the webUI. This led me to go and investigate the SparkContext, which is mostly how you submit it. Many of the examples you see online will have a 'local' context. This is fine for dev work, and possibly even for some debugging, but won't scale to your cluster. You need to ensure the SparkContext is set to spark://master:7077. The initial spark:// will always be the same. The master maps back to the hostname, which in our case maps to the docker-compose service name. If you renamed your service as 'spark-master', you would need to update this.

For this particular example I needed to make a small change to get it working on my cluster and show up in the UI.

## Local context , good for debugging
with pyspark.SparkContext("local", "PySparkWordCount") as sc:

## Deploy to our docker cluster, as things should be!
with pyspark.SparkContext("spark://master:7077", "PySparkWordCount") as sc:

You can submit it to the cluster, and view the job in the web UI like so:

docker-compose exec master python /tmp/data/word_count.py

WrapUp

Hopefully, this gave you a good idea of how to scour the internet for useful docker images and how to submit jobs to a spark cluster. In the future I hope to get more into the details of developing with Spark, and the various apis for making my code run super fast. I also want to have a full scale deploy a spark cluster with docker tutorial up at some point, for those of us who aren't lucky enough to be on the cloud full time yet!

Bioinformatics Solutions on AWS Newsletter

Get the first 3 chapters of my book, Bioinformatics Solutions on AWS, as well as weekly updates on the world of Bioinformatics and Cloud Computing, completely free, by filling out the form next to this text.

Bioinformatics Solutions on AWS

If you'd like to learn more about AWS and how it relates to the future of Bioinformatics, sign up here.

We won't send spam. Unsubscribe at any time.