Configure Apache Spark with Kubernetes
Many people like to use k8 for the clustering and scaling capabilities. And many other people like to use Apache Spark for big data processing in a cluster.
In order to be able to get best of both worlds, a new experimental
resource-manager has been added to
apache-spark (https://github.com/apache/spark/tree/master/resource-managers/kubernetes). Its scheduler looks as simple as spark-standalone, yet it provides resiliency at the executor level (to reschedule it onto another pod during failure).
Other alternatives from certain cloud providers exists. And these are not free and are at times loaded with features that are never used.
This article provides a step-by-step guide in getting spark to run onto a k8 cluster.
Step 1: Build spark binary from source
In this step, we attempt to build spark from source with Scala 2.12 and Hadoop 2.7. This step can be skipped if the pre-released binary from http://spark.apache.org/downloads.html is sufficient.
git clone email@example.com:apache/spark.git
git checkout v2.4.5
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --tgz -Phadoop-2.7 -Phive -Pkubernetes -Pscala-2.12
This should generate a package named
Step 2: Extract this tar into a staging location.
mkdir -p /tmp/work/spark
tar zxvf spark-2.4.5-bin-2.7.3.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-2.4.5-bin-2.7.3/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-2.4.5-bin-2.7.3
Optionally add any custom jars into the
spark/jars location. In my case, I copied in
gcs-connector and db drivers like
cp mssql-jdbc-7.4.1.jre8.jar gcs-connector-hadoop2-latest.jar postgresql-42.2.9.jar /tmp/work/spark/jars
Note: google binaries are prone to version conflict with
guava and Spark 2.4.x uses 14.0.1.
Step 3: Build and push image to be usable as executors.
This step would generate three docker images (JVM based job, PySpark job and SparkR job).
/tmp/work/sparkbin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 build
bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 push
Note: This step can fail as Step 1 does not package with
SparkR related files. Hence, the workaround is the publish
JVM based job only. It can be resolved by also including
R and/or using OOTB binaries from Step 1.
The docker image from this step would be used by k8 to create new executor pods.
Step 4: Build driver image
k8 pods require a
Cluser-IP network to be able to communicate with each-other and to the driver. Hence, we build an image for
driver-pod that we would run as another pod inside k8.
This jump-box can be used to submit a job in
client-mode and/or in
Apache Spark, we would also require
kubectl to be available in the
Dockerfile for build step would look like this. Note, that the spark binary used for this step is an extract from Step 1.
COPY spark-2.4.5-bin-hadoop2.7 /spark
RUN apt update && apt install -y apt-transport-https apt-utils openjdk-8-jdk
COPY --from=lachlanevenson/k8s-kubectl:v1.10.3 /usr/local/bin/kubectl /usr/local/bin/kubectl
RUN ln -s /spark /opt/spark
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/jre/
ENV SPARK_HOME /opt/spark
ENV K8S_CACERT /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
ENV K8S_TOKEN /var/run/secrets/kubernetes.io/serviceaccount/token
The build step of the
docker build . -t gcr.io/project/spark-seed:v2.4.5
docker push gcr.io/project/spark-seed:v2.4.5
Step 5: Prepare the k8 cluster
Now that the image for the
driver-pod is ready, let’s configure
- create a k8 namespace called
spark-appfor our application.
- create a service-account called
driver-sato be used by the
driver-podpod during its creation.
- k8 rbac to allow creation of pods from with-in the
jump-boxusing the service-account
kubectl create namespace spark-appkubectl create serviceaccount driver-sa -n spark-appkubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark-app:driver-sa -n spark-app
Step 6: Start a jumpbox (using spark-seed:v2.4.5) as a pod.
Here, we create a new pod called
driver-pod. The step below would drop us into a linux shell. This pod would act as a master-node and a driver to run a spark-job.
kubectl run --generator=run-pod/v1 driver-pod -it --rm=true -n spark-app --image=gcr.io/project-name/spark-seed:v2.4.5 --serviceaccount='driver-sa' --image-pull-policy Always
Note, this node would be used in the role of
Step 7: Create another service-account for spark-k8
In order to be able to have spark spawn executors, we’ll have to create a service-account (capable of creating pods for executors)
kubectl create serviceaccount spark-sa -n spark-appkubectl create rolebinding spark-sa-rb --clusterrole=edit --serviceaccount=spark-app:spark-sa -n spark-app
Note, this step can be executed on the build host or on the
For more info refer to https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac
Step 8: Prepare env vars
In this step, we create the required env variables for the next subsequent steps.
Step 9: Enable the driver-pod to be accessible from the executors
This is an important step that allows the executors to access the
driver-pod IP and Port as its master node.
kubectl expose pod $DRIVER_NAME --port=$DRIVER_PORT --type=ClusterIP --cluster-ip=None -n=spark-app
Step 10: Create a spark-shell
Final step is to see if a spark application can be created. The following step would create a
spark-shell with 3x7 cores cluster with 16 GB of memory per executor. And this needs to be executed from
--master k8s://https://kubernetes.default.svc.cluster.local:443 \
--deploy-mode client \
--name spark-sh-1 \
--conf spark.kubernetes.authenticate.caCertFile=$K8S_CACERT \
--conf spark.kubernetes.authenticate.oauthTokenFile=$K8S_TOKEN \
--conf spark.kubernetes.authenticate.serviceAccountName=$SPARK_SA \
--conf spark.kubernetes.namespace=$NAMESPACE \
--conf spark.driver.pod.name=$DRIVER_NAME \
--conf spark.executor.memory=16G \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.executor.request.cores=7 \
--conf spark.executor.cores=7 \
--conf spark.kubernetes.container.image=$DOCKER_IMAGE \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.driver.host=$DRIVER_NAME.$NAMESPACE.svc.cluster.local \
--conf spark.kubernetes.node.selector.pool=worker \
For more configuration tweaks on this step, please refer to
Step 11: Port-forward access to Web Console
In many cases, one may want to access the web-ui to see the progress of the running job.
This can be achieved by using
kubectl port-forward pods/driver-pod 4040:4040 -n spark-app
port-forward pattern is
I shall be writing a v2 of this same setup, which would have a declarative configuration on the k8 steps.
Hope this experiment is useful for someone. Comments / Feedback welcome!