Building Apache Spark 3.x to support S3 / GCS

Step 1: Building Spark from source

Here, we build spark from source. This step can take 20+ mins to run. The output of this step is a tgz file containing a deployable Spark binary.

git clone git@github.com:apache/spark.git
git checkout v3.1.1
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --tgz -Phadoop-3.2 -Phive -Pkubernetes -Pscala-2.12

Step 2: Prepare the newly created binary to enable AWS / GCS

We are going to stage any required binaries onto a temp location and then build a docker image of this.

mkdir -p /tmp/work/spark
tar zxvf spark-3.1.1-bin-3.2.0.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-3.1.1-bin-3.2.0/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-3.0.1-bin-3.2.0

Step 3: Customizing Spark to support AWS

We’ll need the most recent version of hadoop-aws-3.2.x and aws-java-sdk. I use maven-central as the starting point to look for the latest AWS binaries — https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws and https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle.

spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "key")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "secret")

Step 3: Customizing spark to support GCP

We’ll need the most recent version of Google Storage Hadoop Connector in shadedformat available from https://search.maven.org/search?q=g:com.google.cloud.bigdataoss%20AND%20a:gcs-connector%20AND%20v:hadoop3-*

Step 4: Kubernetes

One can take this a step further by building docker images (usable inside kubernetes with the above customization into Spark).

./bin/docker-image-tool.sh -r gcr.io/my-proj -t 3.1.1 build
./bin/docker-image-tool.sh -r gcr.io/my-proj -t 3.1.1 push

Conclusion

I found https://spark.apache.org/docs/latest/cloud-integration.html provides some useful optimizations to be added.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store