Apache Spark + Kubernetes — v1

Step 1: Build spark binary from source

In this step, we attempt to build spark from source with Scala 2.12 and Hadoop 2.7. This step can be skipped if the pre-released binary from http://spark.apache.org/downloads.html is sufficient.

git clone git@github.com:apache/spark.git
git checkout v2.4.5
./dev/change-scala-version.sh 2.12
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --tgz -Phadoop-2.7 -Phive -Pkubernetes -Pscala-2.12

Step 2: Extract this tar into a staging location.

mkdir -p /tmp/work/spark
tar zxvf spark-2.4.5-bin-2.7.3.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-2.4.5-bin-2.7.3/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-2.4.5-bin-2.7.3
cp mssql-jdbc-7.4.1.jre8.jar gcs-connector-hadoop2-latest.jar postgresql-42.2.9.jar /tmp/work/spark/jars

Step 3: Build and push image to be usable as executors.

This step would generate three docker images (JVM based job, PySpark job and SparkR job).

bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 build
bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 push

Step 4: Build driver image

k8 pods require a Cluser-IP network to be able to communicate with each-other and to the driver. Hence, we build an image for driver-pod that we would run as another pod inside k8.

FROM ubuntu:latest
COPY spark-2.4.5-bin-hadoop2.7 /spark
RUN apt update && apt install -y apt-transport-https apt-utils openjdk-8-jdk
COPY --from=lachlanevenson/k8s-kubectl:v1.10.3 /usr/local/bin/kubectl /usr/local/bin/kubectl
RUN ln -s /spark /opt/spark
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/jre/
ENV SPARK_HOME /opt/spark
ENV K8S_CACERT /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
ENV K8S_TOKEN /var/run/secrets/kubernetes.io/serviceaccount/token
docker build . -t gcr.io/project/spark-seed:v2.4.5
docker push gcr.io/project/spark-seed:v2.4.5

Step 5: Prepare the k8 cluster

Now that the image for the driver-pod is ready, let’s configure

  • create a k8 namespace called spark-app for our application.
  • create a service-account called driver-sa to be used by the driver-pod pod during its creation.
  • k8 rbac to allow creation of pods from with-in the jump-boxusing the service-account driver-sa.
kubectl create namespace spark-appkubectl create serviceaccount driver-sa -n spark-appkubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark-app:driver-sa -n spark-app

Step 6: Start a jumpbox (using spark-seed:v2.4.5) as a pod.

Here, we create a new pod called driver-pod. The step below would drop us into a linux shell. This pod would act as a master-node and a driver to run a spark-job.

kubectl run --generator=run-pod/v1 driver-pod -it --rm=true -n spark-app --image=gcr.io/project-name/spark-seed:v2.4.5 --serviceaccount='driver-sa' --image-pull-policy Always

Step 7: Create another service-account for spark-k8

In order to be able to have spark spawn executors, we’ll have to create a service-account (capable of creating pods for executors)

kubectl create serviceaccount spark-sa -n spark-appkubectl create rolebinding spark-sa-rb --clusterrole=edit --serviceaccount=spark-app:spark-sa -n spark-app

Step 8: Prepare env vars

In this step, we create the required env variables for the next subsequent steps.

export NAMESPACE=spark-app
export SPARK_SA=spark-sa
export DOCKER_IMAGE=gcr.io/project/spark:v2.4.5
export DRIVER_NAME=jump-1
export DRIVER_PORT=20009

Step 9: Enable the driver-pod to be accessible from the executors

This is an important step that allows the executors to access the driver-pod IP and Port as its master node.

kubectl expose pod $DRIVER_NAME --port=$DRIVER_PORT --type=ClusterIP --cluster-ip=None -n=spark-app

Step 10: Create a spark-shell

Final step is to see if a spark application can be created. The following step would create a spark-shell with 3x7 cores cluster with 16 GB of memory per executor. And this needs to be executed from driver-pod container.

bin/spark-shell \
--master k8s://https://kubernetes.default.svc.cluster.local:443 \
--deploy-mode client \
--name spark-sh-1 \
--conf spark.kubernetes.authenticate.caCertFile=$K8S_CACERT \
--conf spark.kubernetes.authenticate.oauthTokenFile=$K8S_TOKEN \
--conf spark.kubernetes.authenticate.serviceAccountName=$SPARK_SA \
--conf spark.kubernetes.namespace=$NAMESPACE \
--conf spark.driver.pod.name=$DRIVER_NAME \
--conf spark.executor.memory=16G \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.executor.request.cores=7 \
--conf spark.executor.cores=7 \
--conf spark.kubernetes.container.image=$DOCKER_IMAGE \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.driver.host=$DRIVER_NAME.$NAMESPACE.svc.cluster.local \
--conf spark.kubernetes.node.selector.pool=worker \
--conf spark.driver.port=$DRIVER_PORT

Step 11: Port-forward access to Web Console

In many cases, one may want to access the web-ui to see the progress of the running job.

kubectl port-forward pods/driver-pod 4040:4040 -n spark-app



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store