Apache Spark + Kubernetes — v1
Configure Apache Spark with Kubernetes
Many people like to use k8 for the clustering and scaling capabilities. And many other people like to use Apache Spark for big data processing in a cluster.
In order to be able to get best of both worlds, a new experimental resource-manager
has been added to apache-spark
(https://github.com/apache/spark/tree/master/resource-managers/kubernetes). Its scheduler looks as simple as spark-standalone, yet it provides resiliency at the executor level (to reschedule it onto another pod during failure).
Other alternatives from certain cloud providers exists. And these are not free and are at times loaded with features that are never used.
This article provides a step-by-step guide in getting spark to run onto a k8 cluster.
Step 1: Build spark binary from source
In this step, we attempt to build spark from source with Scala 2.12 and Hadoop 2.7. This step can be skipped if the pre-released binary from http://spark.apache.org/downloads.html is sufficient.
git clone git@github.com:apache/spark.git
git checkout v2.4.5
./dev/change-scala-version.sh 2.12
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --tgz -Phadoop-2.7 -Phive -Pkubernetes -Pscala-2.12
This should generate a package named spark-2.4.5-bin-2.7.3.tgz
.
Step 2: Extract this tar into a staging location.
mkdir -p /tmp/work/spark
tar zxvf spark-2.4.5-bin-2.7.3.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-2.4.5-bin-2.7.3/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-2.4.5-bin-2.7.3
Optionally add any custom jars into the spark/jars
location. In my case, I copied in gcs-connector
and db drivers like postgres
.
cp mssql-jdbc-7.4.1.jre8.jar gcs-connector-hadoop2-latest.jar postgresql-42.2.9.jar /tmp/work/spark/jars
Note: google binaries are prone to version conflict with guava
and Spark 2.4.x uses 14.0.1.
Step 3: Build and push image to be usable as executors.
This step would generate three docker images (JVM based job, PySpark job and SparkR job).
>pwd
/tmp/work/sparkbin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 build
bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 push
Note: This step can fail as Step 1 does not package with SparkR
related files. Hence, the workaround is the publish JVM based job
only. It can be resolved by also including R
and/or using OOTB binaries from Step 1.
The docker image from this step would be used by k8 to create new executor pods.
Step 4: Build driver image
k8 pods require a Cluser-IP
network to be able to communicate with each-other and to the driver. Hence, we build an image for driver-pod
that we would run as another pod inside k8.
This jump-box can be used to submit a job in client-mode
and/or in clustered-mode
.
Along with Apache Spark
, we would also require kubectl
to be available in the driver-pod
.
The Dockerfile
for build step would look like this. Note, that the spark binary used for this step is an extract from Step 1.
FROM ubuntu:latest
COPY spark-2.4.5-bin-hadoop2.7 /spark
RUN apt update && apt install -y apt-transport-https apt-utils openjdk-8-jdk
COPY --from=lachlanevenson/k8s-kubectl:v1.10.3 /usr/local/bin/kubectl /usr/local/bin/kubectl
RUN ln -s /spark /opt/spark
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/jre/
ENV SPARK_HOME /opt/spark
ENV K8S_CACERT /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
ENV K8S_TOKEN /var/run/secrets/kubernetes.io/serviceaccount/token
The build step of the Dockerfile
is
docker build . -t gcr.io/project/spark-seed:v2.4.5
docker push gcr.io/project/spark-seed:v2.4.5
Step 5: Prepare the k8 cluster
Now that the image for the driver-pod
is ready, let’s configure
- create a k8 namespace called
spark-app
for our application. - create a service-account called
driver-sa
to be used by thedriver-pod
pod during its creation. - k8 rbac to allow creation of pods from with-in the
jump-box
using the service-accountdriver-sa
.
kubectl create namespace spark-appkubectl create serviceaccount driver-sa -n spark-appkubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark-app:driver-sa -n spark-app
Step 6: Start a jumpbox (using spark-seed:v2.4.5) as a pod.
Here, we create a new pod called driver-pod
. The step below would drop us into a linux shell. This pod would act as a master-node and a driver to run a spark-job.
kubectl run --generator=run-pod/v1 driver-pod -it --rm=true -n spark-app --image=gcr.io/project-name/spark-seed:v2.4.5 --serviceaccount='driver-sa' --image-pull-policy Always
Note, this node would be used in the role of spark-master
and spark-driver
.
Step 7: Create another service-account for spark-k8
In order to be able to have spark spawn executors, we’ll have to create a service-account (capable of creating pods for executors)
kubectl create serviceaccount spark-sa -n spark-appkubectl create rolebinding spark-sa-rb --clusterrole=edit --serviceaccount=spark-app:spark-sa -n spark-app
Note, this step can be executed on the build host or on the driver-pod
.
For more info refer to https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac
Step 8: Prepare env vars
In this step, we create the required env variables for the next subsequent steps.
export NAMESPACE=spark-app
export SPARK_SA=spark-sa
export DOCKER_IMAGE=gcr.io/project/spark:v2.4.5
export DRIVER_NAME=jump-1
export DRIVER_PORT=20009
Step 9: Enable the driver-pod to be accessible from the executors
This is an important step that allows the executors to access the driver-pod
IP and Port as its master node.
kubectl expose pod $DRIVER_NAME --port=$DRIVER_PORT --type=ClusterIP --cluster-ip=None -n=spark-app
Step 10: Create a spark-shell
Final step is to see if a spark application can be created. The following step would create a spark-shell
with 3x7 cores cluster with 16 GB of memory per executor. And this needs to be executed from driver-pod
container.
bin/spark-shell \
--master k8s://https://kubernetes.default.svc.cluster.local:443 \
--deploy-mode client \
--name spark-sh-1 \
--conf spark.kubernetes.authenticate.caCertFile=$K8S_CACERT \
--conf spark.kubernetes.authenticate.oauthTokenFile=$K8S_TOKEN \
--conf spark.kubernetes.authenticate.serviceAccountName=$SPARK_SA \
--conf spark.kubernetes.namespace=$NAMESPACE \
--conf spark.driver.pod.name=$DRIVER_NAME \
--conf spark.executor.memory=16G \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.executor.request.cores=7 \
--conf spark.executor.cores=7 \
--conf spark.kubernetes.container.image=$DOCKER_IMAGE \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.driver.host=$DRIVER_NAME.$NAMESPACE.svc.cluster.local \
--conf spark.kubernetes.node.selector.pool=worker \
--conf spark.driver.port=$DRIVER_PORT
For more configuration tweaks on this step, please refer to
http://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-propertieshttp://spark.apache.org/docs/latest/configuration.html
Step 11: Port-forward access to Web Console
In many cases, one may want to access the web-ui to see the progress of the running job.
This can be achieved by using port-forward
from kubectl
kubectl port-forward pods/driver-pod 4040:4040 -n spark-app
Note: The port-forward
pattern is host-port
:container-port
.
I shall be writing a v2 of this same setup, which would have a declarative configuration on the k8 steps.
Hope this experiment is useful for someone. Comments / Feedback welcome!