Building Apache Spark 3.x to support S3 / GCS

Step 1: Building Spark from source

Here, we build spark from source. This step can take 20+ mins to run. The output of this step is a tgz file containing a deployable Spark binary.

git clone
git checkout v3.1.1
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/ --tgz -Phadoop-3.2 -Phive -Pkubernetes -Pscala-2.12

Step 2: Prepare the newly created binary to enable AWS / GCS

We are going to stage any required binaries onto a temp location and then build a docker image of this.

mkdir -p /tmp/work/spark
tar zxvf spark-3.1.1-bin-3.2.0.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-3.1.1-bin-3.2.0/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-3.0.1-bin-3.2.0

Step 3: Customizing Spark to support AWS

We’ll need the most recent version of hadoop-aws-3.2.x and aws-java-sdk. I use maven-central as the starting point to look for the latest AWS binaries — and

spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "key")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "secret")

Step 3: Customizing spark to support GCP

We’ll need the most recent version of Google Storage Hadoop Connector in shadedformat available from*

Step 4: Kubernetes

One can take this a step further by building docker images (usable inside kubernetes with the above customization into Spark).

./bin/ -r -t 3.1.1 build
./bin/ -r -t 3.1.1 push


I found provides some useful optimizations to be added.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store