Docker Compose for Apache Spark

4 min readSep 23, 2019

Use Apache Spark to showcase building a Docker Compose stack

In this article, I shall try to present a way to build a clustered application using Apache Spark.

Motivation

In a typical development setup of writing an Apache Spark application, one is generally limited into running a single node spark application during development from a local compute (like laptop).

Also, docker is generally installed in most of the developer’s compute. The better case scenario is to have a Linux hosted Operating System (instead of Windows or Mac OS as the later tends to be a bit slower with docker).

The need for multiple node instance of Apache Spark is when spark-submit based features are being built into a service under development.

Setup

Docker (most commonly installed as Docker CE) needs to be installed along with docker-compose.

Once installed, make sure docker service is running.

On a side note, if you would would prefer to clone the code, feel free to clone https://github.com/babloo80/docker-spark-cluster

Base Image

In order to get started, we’ll need to setup a base-image containing Apache Spark 2.4.4 (Please substitute it with the version that fits your requirements).

$ cat spark-master/Dockerfile

containing instructions to grab spark binary from apache with a container supporting Java 8.

FROM openjdk:8u222-stretchENV DAEMON_RUN=true
ENV SPARK_VERSION=2.4.4
ENV HADOOP_VERSION=2.7
ENV SPARK_HOME=/sparkRUN apt-get update && apt-get install -y curl vim wget software-properties-common ssh net-tools ca-certificates jq ca-certificates wget tarRUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
      && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \
      && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz

Using this as our base image, we can build the next.

Master Image

$ cat spark-master/Dockerfile

containing a launcher script to start the master node.

FROM spark-base:latestCOPY start-master.sh /ENV SPARK_MASTER_PORT 7077
ENV SPARK_MASTER_WEBUI_PORT 8080
ENV SPARK_MASTER_LOG /spark/logsCMD ["/bin/bash", "/start-master.sh"]

And start-master.sh

export SPARK_MASTER_HOST=`hostname`
. "/spark/sbin/spark-config.sh"
. "/spark/bin/load-spark-env.sh"
mkdir -p $SPARK_MASTER_LOG
export SPARK_HOME=/spark
ln -sf /dev/stdout $SPARK_MASTER_LOG/spark-master.out
cd /spark/bin && /spark/sbin/../bin/spark-class org.apache.spark.deploy.master.Master --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG/spark-master.out

Worker Image

$ cat spark-worker/Dockerfile

containing launcher script for a worker (similar to master). Except that, we’ll have one or more worker to generalize with.

FROM spark-base:latestCOPY start-worker.sh /ENV SPARK_WORKER_WEBUI_PORT 8081
ENV SPARK_WORKER_LOG /spark/logsCMD ["/bin/bash", "/start-worker.sh"]

Next step is to build the image to prepare for docker-compose.yml. In this step, we build all the Dockfile we just created. And execute this script to prepare for the next step.

$cat build-images.sh 
#!/bin/bashset -edocker build -t spark-base:latest ./docker/base
docker build -t spark-master:latest ./docker/spark-master
docker build -t spark-worker:latest ./docker/spark-worker

The last one is docker-compose.yml . Here, we create an easy to remember IP Address 10.5.0.2 for the master node so that one can hardcode the spark master as spark://10.5.0.2:7070 .

We also have two instances of worker setup with 4 cores each and 2 GB each of memory. For more information of possible env vars for master / worker, please refer to http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts

version: "3.7"
networks:
  spark-network:
    ipam:
     config:
       - subnet: 10.5.0.0/16services:
  spark-master:
    image: spark-master:latest
    ports:
      - "9090:8080"
      - "7077:7077"
    volumes:
       - ./apps:/opt/spark-apps
       - ./data:/opt/spark-data
    environment:
      - "SPARK_LOCAL_IP=spark-master"
    networks:
      spark-network:
        ipv4_address: 10.5.0.2spark-worker-1:
    image: spark-worker:latest
    depends_on:
      - spark-master
    environment:
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=4
      - SPARK_WORKER_MEMORY=2G
    volumes:
       - ./apps:/opt/spark-apps
       - ./data:/opt/spark-data
    networks:
      spark-network:
        ipv4_address: 10.5.0.3spark-worker-2:
    image: spark-worker:latest
    depends_on:
      - spark-master
    environment:
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=4
      - SPARK_WORKER_MEMORY=2G
    volumes:
       - ./apps:/opt/spark-apps
       - ./data:/opt/spark-data
    networks:
      spark-network:
        ipv4_address: 10.5.0.4

Once, all the configs are in place, we can start the docker-image as shown below using docker-compose up.

Once up, I can access the spark master node using http://10.5.0.2:8080/.

In order to verify if the cluster is started correctly, I have attempted to start spark-shell into this cluster as below.

As a POC, I am actually running the spark-shell using a copy from my local using spark-shell — master spark://10.5.0.2:7077 (commonly shipped with )

And the corresponding screenshot from the master node to pair up with the start from above app.

Bonus

I have updated the configuration repo https://github.com/babloo80/docker-spark-cluster to support deployment into Docker Swarm.

I shall also add on steps to build Apache Spark from scratch. This is generally useful for building custom versions of add-ons with Spark.

An example can be Apache Spark 2.4.x with Scala 2.12 with Hadoop 2.7. The above build uses Scala 2.11 only.

Credits

I felt the resources available was either tricky to setup or didn’t work as desired. Hence, this is my attempt to have them work. In any case, I still need to give credit :).

https://github.com/big-data-europe/docker-spark

⛑ Suggestions / Feedback ! 😃