Building Apache Spark 3.x to support S3 / GCS

Customize Apache Spark 3.1.1 to work with S3 / GCS

Apache Spark 3.0.1 can be built from source code along with
(1) AWS specific binaries to enable reading and writing to s3
(2) GCP specific binaries to enable reading and writing to gcs
(3) Azure — at the time of writing this article hadoop-azure (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/3.2.1) does not provide any OOTB shaded jar. — TBD (add steps to generate one), until then stay tuned 😊

Step 1: Building Spark from source

Here, we build spark from source. This step can take 20+ mins to run. The output of this step is a tgz file containing a deployable Spark binary.

git clone git@github.com:apache/spark.git
git checkout v3.1.1
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --tgz -Phadoop-3.2 -Phive -Pkubernetes -Pscala-2.12

Step 2: Prepare the newly created binary to enable AWS / GCS

We are going to stage any required binaries onto a temp location and then build a docker image of this.

mkdir -p /tmp/work/spark
tar zxvf spark-3.1.1-bin-3.2.0.tgz -C /tmp/work/spark
mv /tmp/work/spark/spark-3.1.1-bin-3.2.0/* /tmp/work/spark
rm -rf /tmp/work/spark/spark-3.0.1-bin-3.2.0

Step 3: Customizing Spark to support AWS

We’ll need the most recent version of hadoop-aws-3.2.x and aws-java-sdk. I use maven-central as the starting point to look for the latest AWS binaries — https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws and https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle.

Note,

(1) aws-java-sdk-bundle is shaded (as it doesn’t have any compile time dependency) and should be safe to use with other dependencies.

(2) Its required to match the same version of hadoop and hadoop-aws being used.

(3) I have tried to use a newer version of aws-java-sdk and haven’t had compatibility issues with hadoop-aws . But, one can use the exact compile-time dependency jar (to be on the safer side).

Copy hadoop-aws-3.2.x.jar and aws-java-sdk.jar into spark classpath (i.e, ./jars ).

Once done, this can be verified by using spark-shell by reading files from s3a . In order to has right credentials setup to access s3, one needs to setup spark config provide access-key and secret to AWS.

spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "key")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "secret")

Step 3: Customizing spark to support GCP

We’ll need the most recent version of Google Storage Hadoop Connector in shadedformat available from https://search.maven.org/search?q=g:com.google.cloud.bigdataoss%20AND%20a:gcs-connector%20AND%20v:hadoop3-*

Note: One needs to copy the latest version of hadoop 3.2.x connector for GoogleStorage. The official documentation for this is available at https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters.

Copy gcs-connector-hadoop3–2.1.6-shaded.jar into spark classpath (i.e, ./jars ).

Once done, this can be verified by using spark-shell by reading files from gs . In order to has right credentials setup to access gcs, one needs to setup environment variable with GOOGLE_APPLICATION_CREDENTIALS.

Step 4: Kubernetes

One can take this a step further by building docker images (usable inside kubernetes with the above customization into Spark).

./bin/docker-image-tool.sh -r gcr.io/my-proj -t 3.1.1 build

and

./bin/docker-image-tool.sh -r gcr.io/my-proj -t 3.1.1 push

In the example above, I am publishing it to gcr docker repository.

Note: The base image for docker is based on 11-jre-slim.

I have some additional notes jotted down here to be able to have a working kubernetes cluster.

Conclusion

I found https://spark.apache.org/docs/latest/cloud-integration.html provides some useful optimizations to be added.

Listener and reader