Apache Spark + Kubernetes — v1

Step 1: Build spark binary from source

git clone git@github.com:apache/spark.git
git checkout v2.4.5
./dev/change-scala-version.sh 2.12
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --tgz -Phadoop-2.7 -Phive -Pkubernetes -Pscala-2.12

Step 2: Extract this tar into a staging location.

mkdir -p /tmp/work/spark
tar zxvf spark-2.4.5-bin-2.7.3.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-2.4.5-bin-2.7.3/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-2.4.5-bin-2.7.3
cp mssql-jdbc-7.4.1.jre8.jar gcs-connector-hadoop2-latest.jar postgresql-42.2.9.jar /tmp/work/spark/jars

Step 3: Build and push image to be usable as executors.

bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 build
bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 push

Step 4: Build driver image

FROM ubuntu:latest
COPY spark-2.4.5-bin-hadoop2.7 /spark
RUN apt update && apt install -y apt-transport-https apt-utils openjdk-8-jdk
COPY --from=lachlanevenson/k8s-kubectl:v1.10.3 /usr/local/bin/kubectl /usr/local/bin/kubectl
RUN ln -s /spark /opt/spark
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/jre/
ENV SPARK_HOME /opt/spark
ENV K8S_CACERT /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
ENV K8S_TOKEN /var/run/secrets/kubernetes.io/serviceaccount/token
docker build . -t gcr.io/project/spark-seed:v2.4.5
docker push gcr.io/project/spark-seed:v2.4.5

Step 5: Prepare the k8 cluster

  • create a k8 namespace called spark-app for our application.
  • create a service-account called driver-sa to be used by the driver-pod pod during its creation.
  • k8 rbac to allow creation of pods from with-in the jump-boxusing the service-account driver-sa.
kubectl create namespace spark-appkubectl create serviceaccount driver-sa -n spark-appkubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark-app:driver-sa -n spark-app

Step 6: Start a jumpbox (using spark-seed:v2.4.5) as a pod.

kubectl run --generator=run-pod/v1 driver-pod -it --rm=true -n spark-app --image=gcr.io/project-name/spark-seed:v2.4.5 --serviceaccount='driver-sa' --image-pull-policy Always

Step 7: Create another service-account for spark-k8

kubectl create serviceaccount spark-sa -n spark-appkubectl create rolebinding spark-sa-rb --clusterrole=edit --serviceaccount=spark-app:spark-sa -n spark-app

Step 8: Prepare env vars

export NAMESPACE=spark-app
export SPARK_SA=spark-sa
export DOCKER_IMAGE=gcr.io/project/spark:v2.4.5
export DRIVER_NAME=jump-1
export DRIVER_PORT=20009

Step 9: Enable the driver-pod to be accessible from the executors

kubectl expose pod $DRIVER_NAME --port=$DRIVER_PORT --type=ClusterIP --cluster-ip=None -n=spark-app

Step 10: Create a spark-shell

bin/spark-shell \
--master k8s://https://kubernetes.default.svc.cluster.local:443 \
--deploy-mode client \
--name spark-sh-1 \
--conf spark.kubernetes.authenticate.caCertFile=$K8S_CACERT \
--conf spark.kubernetes.authenticate.oauthTokenFile=$K8S_TOKEN \
--conf spark.kubernetes.authenticate.serviceAccountName=$SPARK_SA \
--conf spark.kubernetes.namespace=$NAMESPACE \
--conf spark.driver.pod.name=$DRIVER_NAME \
--conf spark.executor.memory=16G \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.executor.request.cores=7 \
--conf spark.executor.cores=7 \
--conf spark.kubernetes.container.image=$DOCKER_IMAGE \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.driver.host=$DRIVER_NAME.$NAMESPACE.svc.cluster.local \
--conf spark.kubernetes.node.selector.pool=worker \
--conf spark.driver.port=$DRIVER_PORT

Step 11: Port-forward access to Web Console

kubectl port-forward pods/driver-pod 4040:4040 -n spark-app




Listener and reader

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Configuring Hadoop Cluster using Ansible Playbook

Simple IoT Messages Delivery With GoLang — 1

Publishing android libraries to Jitpack with Gradle 7.+

Power BI can’t find my SharePoint/OneDrive file! What should I do now?

Deploying Flask on AWS Lambda

How to build a scalable architecture with AWS


Code Review.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Listener and reader

More from Medium

Apache Cassandra: CQL Commands

Hive on Spark with Spark Operator

Installation of Kubernetes cluster with docker containers

Running a spark job using spark on k8s operator