Docker Compose for Apache Spark

Programmer
4 min readSep 23, 2019

--

Use Apache Spark to showcase building a Docker Compose stack

In this article, I shall try to present a way to build a clustered application using Apache Spark.

Motivation

In a typical development setup of writing an Apache Spark application, one is generally limited into running a single node spark application during development from a local compute (like laptop).

Also, docker is generally installed in most of the developer’s compute. The better case scenario is to have a Linux hosted Operating System (instead of Windows or Mac OS as the later tends to be a bit slower with docker).

The need for multiple node instance of Apache Spark is when spark-submit based features are being built into a service under development.

Setup

Docker (most commonly installed as Docker CE) needs to be installed along with docker-compose.

Once installed, make sure docker service is running.

On a side note, if you would would prefer to clone the code, feel free to clone https://github.com/babloo80/docker-spark-cluster

Base Image

In order to get started, we’ll need to setup a base-image containing Apache Spark 2.4.4 (Please substitute it with the version that fits your requirements).

$ cat spark-master/Dockerfile

containing instructions to grab spark binary from apache with a container supporting Java 8.

FROM openjdk:8u222-stretchENV DAEMON_RUN=true
ENV SPARK_VERSION=2.4.4
ENV HADOOP_VERSION=2.7
ENV SPARK_HOME=/spark
RUN apt-get update && apt-get install -y curl vim wget software-properties-common ssh net-tools ca-certificates jq ca-certificates wget tarRUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz

Using this as our base image, we can build the next.

Master Image

$ cat spark-master/Dockerfile

containing a launcher script to start the master node.

FROM spark-base:latestCOPY start-master.sh /ENV SPARK_MASTER_PORT 7077
ENV SPARK_MASTER_WEBUI_PORT 8080
ENV SPARK_MASTER_LOG /spark/logs
CMD ["/bin/bash", "/start-master.sh"]

And start-master.sh

export SPARK_MASTER_HOST=`hostname`
. "/spark/sbin/spark-config.sh"
. "/spark/bin/load-spark-env.sh"
mkdir -p $SPARK_MASTER_LOG
export SPARK_HOME=/spark
ln -sf /dev/stdout $SPARK_MASTER_LOG/spark-master.out
cd /spark/bin && /spark/sbin/../bin/spark-class org.apache.spark.deploy.master.Master --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG/spark-master.out

Worker Image

$ cat spark-worker/Dockerfile

containing launcher script for a worker (similar to master). Except that, we’ll have one or more worker to generalize with.

FROM spark-base:latestCOPY start-worker.sh /ENV SPARK_WORKER_WEBUI_PORT 8081
ENV SPARK_WORKER_LOG /spark/logs
CMD ["/bin/bash", "/start-worker.sh"]

Next step is to build the image to prepare for docker-compose.yml. In this step, we build all the Dockfile we just created. And execute this script to prepare for the next step.

$cat build-images.sh 
#!/bin/bash
set -edocker build -t spark-base:latest ./docker/base
docker build -t spark-master:latest ./docker/spark-master
docker build -t spark-worker:latest ./docker/spark-worker

The last one is docker-compose.yml . Here, we create an easy to remember IP Address 10.5.0.2 for the master node so that one can hardcode the spark master as spark://10.5.0.2:7070 .

We also have two instances of worker setup with 4 cores each and 2 GB each of memory. For more information of possible env vars for master / worker, please refer to http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts

version: "3.7"
networks:
spark-network:
ipam:
config:
- subnet: 10.5.0.0/16
services:
spark-master:
image: spark-master:latest
ports:
- "9090:8080"
- "7077:7077"
volumes:
- ./apps:/opt/spark-apps
- ./data:/opt/spark-data
environment:
- "SPARK_LOCAL_IP=spark-master"
networks:
spark-network:
ipv4_address: 10.5.0.2
spark-worker-1:
image: spark-worker:latest
depends_on:
- spark-master
environment:
- SPARK_MASTER=spark://spark-master:7077
- SPARK_WORKER_CORES=4
- SPARK_WORKER_MEMORY=2G
volumes:
- ./apps:/opt/spark-apps
- ./data:/opt/spark-data
networks:
spark-network:
ipv4_address: 10.5.0.3
spark-worker-2:
image: spark-worker:latest
depends_on:
- spark-master
environment:
- SPARK_MASTER=spark://spark-master:7077
- SPARK_WORKER_CORES=4
- SPARK_WORKER_MEMORY=2G
volumes:
- ./apps:/opt/spark-apps
- ./data:/opt/spark-data
networks:
spark-network:
ipv4_address: 10.5.0.4

Once, all the configs are in place, we can start the docker-image as shown below using docker-compose up.

docker-compose starting

Once up, I can access the spark master node using http://10.5.0.2:8080/.

In order to verify if the cluster is started correctly, I have attempted to start spark-shell into this cluster as below.

As a POC, I am actually running the spark-shell using a copy from my local using spark-shell — master spark://10.5.0.2:7077 (commonly shipped with )

And the corresponding screenshot from the master node to pair up with the start from above app.

Spark Master Monitoring Screen

Bonus

I have updated the configuration repo https://github.com/babloo80/docker-spark-cluster to support deployment into Docker Swarm.

I shall also add on steps to build Apache Spark from scratch. This is generally useful for building custom versions of add-ons with Spark.

An example can be Apache Spark 2.4.x with Scala 2.12 with Hadoop 2.7. The above build uses Scala 2.11 only.

Credits

I felt the resources available was either tricky to setup or didn’t work as desired. Hence, this is my attempt to have them work. In any case, I still need to give credit :).

https://github.com/big-data-europe/docker-spark

⛑ Suggestions / Feedback ! 😃

--

--

Programmer
Programmer

No responses yet