Docker Compose for Apache Hadoop HDFS
HDFS as dfs deployed to local for development and testing.
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
In this article we shall focus on NameNode + DataNode as a single node cluster setup.
In an ideal world, we should be using the “Single Node Setup Instructions” and should be able to deploy an HDFS cluster from https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-common/SingleCluster.html.
Except that, it doesn’t work well with docker. I hope to bridge some of the nuances of deploying into docker.
If you would like to jump into an Compose image and start using one, please feel free to look at https://github.com/babloo80/docker-hadoop.
Nuances of HDFS+Docker
- Most of the deployments I have looked around provides HDFS with Hadoop.
And honestly with Apache Spark around, there is almost no need for Hadoop’s compute. Hence, some tweaks for this.
- HDFS / Hadoop scripts are pretty stiff with ssh and this has been a bit of troublemaker during my experiment.
- JAVA_HOME needs to be manually configured.
- Formatting the name-node needs to be a manual step, as this needs to be done for the first time only.
Step 1/3: HDFS Configurations
core-sites.xml is a config file required by name node to advertise the port on which the namenode would service data. Here, its set to
hadoop-env.sh. I have pulled the file from hadoop dist and added a line for
hdfs-sites.xml is required for overriding the defaults on configuration directory.
Step 2/3: Dockerfile
This docker file below uses Java 8 base image with tweaks on having the ssh key to be able to
ssh localhost . It also grabs Hadoop 2.7.7 from apache to be used to generate the base image.
FROM openjdk:8u222-stretchRUN apt-get update && apt-get install -y software-properties-common ssh net-tools ca-certificates tar rsync sed# passwordless ssh
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN mkdir -p /root/.ssh
COPY config /root/.ssh/
#RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
RUN chmod 0600 ~/.ssh/authorized_keysRUN wget --no-verbose http://apache-mirror.8birdsvideo.com/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz && tar -xvzf hadoop-2.7.7.tar.gz && mv hadoop-2.7.7 hadoop && rm hadoop-2.7.7.tar.gz && rm -rf hadoop/share/doc/*
RUN ln -s /hadoop/etc/hadoop /etc/hadoop
RUN mkdir -p /logs
RUN mkdir /hadoop-dataENV HADOOP_CONF_DIR=/etc/hadoop
ENV PATH /hadoop/bin/:$PATHCOPY core-site.xml /etc/hadoop/
COPY hdfs-site.xml /etc/hadoop/
COPY hadoop-env.sh /etc/hadoop/
With the above base image, another Dockerfile for
namenode is required. This basically formats the Namenode to generate
fs-image. And starts up the dfs-server.
COPY run.sh /run.sh
RUN chmod a+x run.shCMD ["/bin/bash", "/run.sh"]ENTRYPOINT ["sh", "-c", "service ssh restart; /hadoop/sbin/start-dfs.sh; sleep infinity"]
Step 3/3: Docker Compose
Docker-Componse.yml configuration makes sure the required environment variable advertised in place along with opening up of ports.
Once the image is started using
docker-compose up . The following screen would confirm that the
namenode is started correctly.
I shall keep this article updated with tweaks to ports, config directories et al.
Hope this article saved some time in development today?
⛑ Suggestions / Feedback ! 😃