Apache Spark + Kubernetes — v1

Configure Apache Spark with Kubernetes

Many people like to use k8 for the clustering and scaling capabilities. And many other people like to use Apache Spark for big data processing in a cluster.

In order to be able to get best of both worlds, a new experimental resource-manager has been added to apache-spark (https://github.com/apache/spark/tree/master/resource-managers/kubernetes). Its scheduler looks as simple as spark-standalone, yet it provides resiliency at the executor level (to reschedule it onto another pod during failure).

Other alternatives from certain cloud providers exists. And these are not free and are at times loaded with features that are never used.

This article provides a step-by-step guide in getting spark to run onto a k8 cluster.

In this step, we attempt to build spark from source with Scala 2.12 and Hadoop 2.7. This step can be skipped if the pre-released binary from http://spark.apache.org/downloads.html is sufficient.

git clone git@github.com:apache/spark.git
git checkout v2.4.5
./dev/change-scala-version.sh 2.12
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --tgz -Phadoop-2.7 -Phive -Pkubernetes -Pscala-2.12

This should generate a package named spark-2.4.5-bin-2.7.3.tgz.

mkdir -p /tmp/work/spark
tar zxvf spark-2.4.5-bin-2.7.3.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-2.4.5-bin-2.7.3/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-2.4.5-bin-2.7.3

Optionally add any custom jars into the spark/jars location. In my case, I copied in gcs-connector and db drivers like postgres.

cp mssql-jdbc-7.4.1.jre8.jar gcs-connector-hadoop2-latest.jar postgresql-42.2.9.jar /tmp/work/spark/jars

Note: google binaries are prone to version conflict with guava and Spark 2.4.x uses 14.0.1.

This step would generate three docker images (JVM based job, PySpark job and SparkR job).

>pwd
/tmp/work/spark
bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 build
bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 push

Note: This step can fail as Step 1 does not package with SparkR related files. Hence, the workaround is the publish JVM based job only. It can be resolved by also including R and/or using OOTB binaries from Step 1.

The docker image from this step would be used by k8 to create new executor pods.

k8 pods require a Cluser-IP network to be able to communicate with each-other and to the driver. Hence, we build an image for driver-pod that we would run as another pod inside k8.

This jump-box can be used to submit a job in client-mode and/or in clustered-mode.

Along with Apache Spark, we would also require kubectl to be available in the driver-pod.

The Dockerfile for build step would look like this. Note, that the spark binary used for this step is an extract from Step 1.

FROM ubuntu:latest
COPY spark-2.4.5-bin-hadoop2.7 /spark
RUN apt update && apt install -y apt-transport-https apt-utils openjdk-8-jdk
COPY --from=lachlanevenson/k8s-kubectl:v1.10.3 /usr/local/bin/kubectl /usr/local/bin/kubectl
RUN ln -s /spark /opt/spark
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/jre/
ENV SPARK_HOME /opt/spark
ENV K8S_CACERT /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
ENV K8S_TOKEN /var/run/secrets/kubernetes.io/serviceaccount/token

The build step of the Dockerfile is

docker build . -t gcr.io/project/spark-seed:v2.4.5
docker push gcr.io/project/spark-seed:v2.4.5

Now that the image for the driver-pod is ready, let’s configure

  • create a k8 namespace called spark-app for our application.
  • create a service-account called driver-sa to be used by the driver-pod pod during its creation.
  • k8 rbac to allow creation of pods from with-in the jump-boxusing the service-account driver-sa.
kubectl create namespace spark-appkubectl create serviceaccount driver-sa -n spark-appkubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark-app:driver-sa -n spark-app

Here, we create a new pod called driver-pod. The step below would drop us into a linux shell. This pod would act as a master-node and a driver to run a spark-job.

kubectl run --generator=run-pod/v1 driver-pod -it --rm=true -n spark-app --image=gcr.io/project-name/spark-seed:v2.4.5 --serviceaccount='driver-sa' --image-pull-policy Always

Note, this node would be used in the role of spark-master and spark-driver.

In order to be able to have spark spawn executors, we’ll have to create a service-account (capable of creating pods for executors)

kubectl create serviceaccount spark-sa -n spark-appkubectl create rolebinding spark-sa-rb --clusterrole=edit --serviceaccount=spark-app:spark-sa -n spark-app

Note, this step can be executed on the build host or on the driver-pod.

For more info refer to https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac

In this step, we create the required env variables for the next subsequent steps.

export NAMESPACE=spark-app
export SPARK_SA=spark-sa
export DOCKER_IMAGE=gcr.io/project/spark:v2.4.5
export DRIVER_NAME=jump-1
export DRIVER_PORT=20009

This is an important step that allows the executors to access the driver-pod IP and Port as its master node.

kubectl expose pod $DRIVER_NAME --port=$DRIVER_PORT --type=ClusterIP --cluster-ip=None -n=spark-app

Final step is to see if a spark application can be created. The following step would create a spark-shell with 3x7 cores cluster with 16 GB of memory per executor. And this needs to be executed from driver-pod container.

bin/spark-shell \
--master k8s://https://kubernetes.default.svc.cluster.local:443 \
--deploy-mode client \
--name spark-sh-1 \
--conf spark.kubernetes.authenticate.caCertFile=$K8S_CACERT \
--conf spark.kubernetes.authenticate.oauthTokenFile=$K8S_TOKEN \
--conf spark.kubernetes.authenticate.serviceAccountName=$SPARK_SA \
--conf spark.kubernetes.namespace=$NAMESPACE \
--conf spark.driver.pod.name=$DRIVER_NAME \
--conf spark.executor.memory=16G \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.executor.request.cores=7 \
--conf spark.executor.cores=7 \
--conf spark.kubernetes.container.image=$DOCKER_IMAGE \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.driver.host=$DRIVER_NAME.$NAMESPACE.svc.cluster.local \
--conf spark.kubernetes.node.selector.pool=worker \
--conf spark.driver.port=$DRIVER_PORT

For more configuration tweaks on this step, please refer to

http://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-propertieshttp://spark.apache.org/docs/latest/configuration.html

In many cases, one may want to access the web-ui to see the progress of the running job.

This can be achieved by using port-forward from kubectl

kubectl port-forward pods/driver-pod 4040:4040 -n spark-app

Note: The port-forward pattern is host-port:container-port.

I shall be writing a v2 of this same setup, which would have a declarative configuration on the k8 steps.

Hope this experiment is useful for someone. Comments / Feedback welcome!

Listener and reader