Building Apache Spark 3.x to support S3 / GCS

Step 1: Building Spark from source

git clone
git checkout v3.1.1
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/ --tgz -Phadoop-3.2 -Phive -Pkubernetes -Pscala-2.12

Step 2: Prepare the newly created binary to enable AWS / GCS

mkdir -p /tmp/work/spark
tar zxvf spark-3.1.1-bin-3.2.0.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-3.1.1-bin-3.2.0/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-3.0.1-bin-3.2.0

Step 3: Customizing Spark to support AWS

spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "key")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "secret")

Step 3: Customizing spark to support GCP

Step 4: Kubernetes

./bin/ -r -t 3.1.1 build
./bin/ -r -t 3.1.1 push





Listener and reader

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Understanding Git Flow

Continuous Release with CircleCI and Houston

Happenings in the Apache Pulsar Neighborhood- Sept. ’21 dot 2

AWS Simple Storage Service (S3)

Adding AppStore’s Rate and Review feature to your iOS Application

On-Demand capacity reservation on AWS (ODCR)

Salesforce source control that empowers admins.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Listener and reader

More from Medium

Introduction to Apache Airflow and its Components

Apache Airflow

Airflow on Kubernetes Cluster: Using your own Python modules on Spark Jobs

Configure Spark Application

Running Hive Jobs using Spark Execution Engine on Google Cloud Dataproc Cluster