Docker Compose for Apache Hadoop HDFS

Programmer
3 min readSep 24, 2019

HDFS as dfs deployed to local for development and testing.

Introduction

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.

In this article we shall focus on NameNode + DataNode as a single node cluster setup.

In an ideal world, we should be using the “Single Node Setup Instructions” and should be able to deploy an HDFS cluster from https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-common/SingleCluster.html.

Except that, it doesn’t work well with docker. I hope to bridge some of the nuances of deploying into docker.

If you would like to jump into an Compose image and start using one, please feel free to look at https://github.com/babloo80/docker-hadoop.

Nuances of HDFS+Docker

  1. Most of the deployments I have looked around provides HDFS with Hadoop.
    And honestly with Apache Spark around, there is almost no need for Hadoop’s compute. Hence, some tweaks for this.
  2. HDFS / Hadoop scripts are pretty stiff with ssh and this has been a bit of troublemaker during my experiment.
  3. JAVA_HOME needs to be manually configured.
  4. Formatting the name-node needs to be a manual step, as this needs to be done for the first time only.

Step 1/3: HDFS Configurations

(a)core-sites.xml is a config file required by name node to advertise the port on which the namenode would service data. Here, its set to hdfs://localhost:9000

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
</property>
</configuration>

(b) Add JAVA_HOME to hadoop-env.sh. I have pulled the file from hadoop dist and added a line for JAVA_HOME.

(c) hdfs-sites.xml is required for overriding the defaults on configuration directory.

<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/hadoop/data/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/hadoop/data/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Step 2/3: Dockerfile

This docker file below uses Java 8 base image with tweaks on having the ssh key to be able to ssh localhost . It also grabs Hadoop 2.7.7 from apache to be used to generate the base image.

FROM openjdk:8u222-stretchRUN apt-get update && apt-get install -y software-properties-common ssh net-tools ca-certificates tar rsync sed# passwordless ssh
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN mkdir -p /root/.ssh
COPY config /root/.ssh/
#RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
RUN chmod 0600 ~/.ssh/authorized_keys
RUN wget --no-verbose http://apache-mirror.8birdsvideo.com/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz && tar -xvzf hadoop-2.7.7.tar.gz && mv hadoop-2.7.7 hadoop && rm hadoop-2.7.7.tar.gz && rm -rf hadoop/share/doc/*
RUN ln -s /hadoop/etc/hadoop /etc/hadoop
RUN mkdir -p /logs
RUN mkdir /hadoop-data
ENV HADOOP_CONF_DIR=/etc/hadoop
ENV PATH /hadoop/bin/:$PATH
COPY core-site.xml /etc/hadoop/
COPY hdfs-site.xml /etc/hadoop/
COPY hadoop-env.sh /etc/hadoop/

With the above base image, another Dockerfile for namenode is required. This basically formats the Namenode to generate fs-image. And starts up the dfs-server.

FROM hdfs-base
ENV HADOOP_CONF_DIR=/etc/hadoop
COPY run.sh /run.sh
RUN chmod a+x run.sh
CMD ["/bin/bash", "/run.sh"]ENTRYPOINT ["sh", "-c", "service ssh restart; /hadoop/sbin/start-dfs.sh; sleep infinity"]

Step 3/3: Docker Compose

The Docker-Componse.yml configuration makes sure the required environment variable advertised in place along with opening up of ports.

version: '3'services:
namenode:
image: hdfs-namenode
ports:
- "9000:9000"
- "2222:22"
- "50070:50070"
volumes:
- ./data:/hadoop/data
environment:
- "HADOOP_CONF_DIR=/etc/hadoop"
- "USER=root"

Once the image is started using docker-compose up . The following screen would confirm that the namenode is started correctly.

Namenode DFS-Health Monitor

I shall keep this article updated with tweaks to ports, config directories et al.

Hope this article saved some time in development today?

⛑ Suggestions / Feedback ! 😃

--

--