Apache Spark + Kubernetes — v1

Step 1: Build spark binary from source

git clone git@github.com:apache/spark.git
git checkout v2.4.5
./dev/change-scala-version.sh 2.12
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --tgz -Phadoop-2.7 -Phive -Pkubernetes -Pscala-2.12

Step 2: Extract this tar into a staging location.

mkdir -p /tmp/work/spark
tar zxvf spark-2.4.5-bin-2.7.3.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-2.4.5-bin-2.7.3/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-2.4.5-bin-2.7.3
cp mssql-jdbc-7.4.1.jre8.jar gcs-connector-hadoop2-latest.jar postgresql-42.2.9.jar /tmp/work/spark/jars

Step 3: Build and push image to be usable as executors.

>pwd
/tmp/work/spark
bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 build
bin/docker-image-tool.sh -r gcr.io/project-name -t v2.4.5 push

Step 4: Build driver image

FROM ubuntu:latest
COPY spark-2.4.5-bin-hadoop2.7 /spark
RUN apt update && apt install -y apt-transport-https apt-utils openjdk-8-jdk
COPY --from=lachlanevenson/k8s-kubectl:v1.10.3 /usr/local/bin/kubectl /usr/local/bin/kubectl
RUN ln -s /spark /opt/spark
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/jre/
ENV SPARK_HOME /opt/spark
ENV K8S_CACERT /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
ENV K8S_TOKEN /var/run/secrets/kubernetes.io/serviceaccount/token
docker build . -t gcr.io/project/spark-seed:v2.4.5
docker push gcr.io/project/spark-seed:v2.4.5

Step 5: Prepare the k8 cluster

  • create a k8 namespace called spark-app for our application.
  • create a service-account called driver-sa to be used by the driver-pod pod during its creation.
  • k8 rbac to allow creation of pods from with-in the jump-boxusing the service-account driver-sa.
kubectl create namespace spark-appkubectl create serviceaccount driver-sa -n spark-appkubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark-app:driver-sa -n spark-app

Step 6: Start a jumpbox (using spark-seed:v2.4.5) as a pod.

kubectl run --generator=run-pod/v1 driver-pod -it --rm=true -n spark-app --image=gcr.io/project-name/spark-seed:v2.4.5 --serviceaccount='driver-sa' --image-pull-policy Always

Step 7: Create another service-account for spark-k8

kubectl create serviceaccount spark-sa -n spark-appkubectl create rolebinding spark-sa-rb --clusterrole=edit --serviceaccount=spark-app:spark-sa -n spark-app

Step 8: Prepare env vars

export NAMESPACE=spark-app
export SPARK_SA=spark-sa
export DOCKER_IMAGE=gcr.io/project/spark:v2.4.5
export DRIVER_NAME=jump-1
export DRIVER_PORT=20009

Step 9: Enable the driver-pod to be accessible from the executors

kubectl expose pod $DRIVER_NAME --port=$DRIVER_PORT --type=ClusterIP --cluster-ip=None -n=spark-app

Step 10: Create a spark-shell

bin/spark-shell \
--master k8s://https://kubernetes.default.svc.cluster.local:443 \
--deploy-mode client \
--name spark-sh-1 \
--conf spark.kubernetes.authenticate.caCertFile=$K8S_CACERT \
--conf spark.kubernetes.authenticate.oauthTokenFile=$K8S_TOKEN \
--conf spark.kubernetes.authenticate.serviceAccountName=$SPARK_SA \
--conf spark.kubernetes.namespace=$NAMESPACE \
--conf spark.driver.pod.name=$DRIVER_NAME \
--conf spark.executor.memory=16G \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.executor.request.cores=7 \
--conf spark.executor.cores=7 \
--conf spark.kubernetes.container.image=$DOCKER_IMAGE \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.driver.host=$DRIVER_NAME.$NAMESPACE.svc.cluster.local \
--conf spark.kubernetes.node.selector.pool=worker \
--conf spark.driver.port=$DRIVER_PORT
http://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-propertieshttp://spark.apache.org/docs/latest/configuration.html

Step 11: Port-forward access to Web Console

kubectl port-forward pods/driver-pod 4040:4040 -n spark-app

--

--

--

Listener and reader

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Configuring Hadoop Cluster using Ansible Playbook

Simple IoT Messages Delivery With GoLang — 1

Publishing android libraries to Jitpack with Gradle 7.+

Power BI can’t find my SharePoint/OneDrive file! What should I do now?

Deploying Flask on AWS Lambda

How to build a scalable architecture with AWS

http://www.rebooten.com/2016/02/the-pie-theory.html

Code Review.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Programmer

Programmer

Listener and reader

More from Medium

Apache Cassandra: CQL Commands

Hive on Spark with Spark Operator

Installation of Kubernetes cluster with docker containers

Running a spark job using spark on k8s operator