In this tutorial we will show you how to install Apache Hadoop on Ubuntu 18.04 LTS. For those of you who didn’t know, Apache Hadoop is an open source framework used for distributed storage as well as distributed processing of big data on clusters of computers which runs on commodity hardwares. Hadoop stores data in Hadoop Distributed File System (HDFS) and the processing of these data is done using MapReduce. YARN provides an API for requesting and allocating resources in the Hadoop cluster.
This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo’ to the commands to get root privileges. I will show you through the step by step installation Apache Hadoop on an Ubuntu 18.04 Bionic Beaver server.
Install Apache Hadoop on Ubuntu 18.04 LTS Bionic Beaver
Step 1. First, make sure that all your system packages are up-to-date by running these following apt-get commands in the terminal.
1
2
|
sudoapt–getupdate
sudoapt–getupgrade
|
Step 2. Installing Java (OpenJDK).
Since hadoop is based on java, make sure you have java jdk installed on the system. If you don’t have Java installed on your system, use following link to install it first.
1
2
3
4
|
root@idroot.net~# java -version
javaversion“1.8.0_192”
Java(TM)SERuntimeEnvironment(build1.8.0_192–b02)
JavaHotSpot(TM)64–BitServerVM(build25.74–b02,mixedmode)
|
Step 3. Installing Apache Hadoop on Ubuntu 18.04.
To avoid security issues, we recommend to setup new Hadoop user group and user account to deal with all Hadoop related activities, following command:
1
2
|
sudoaddgrouphadoopgroup
sudoadduser—ingrouphadoopgrouphadoopuser
|
After creating the user, it also required to set up key based ssh on its own account. To do this use execute following commands:
1
2
3
4
5
6
|
su–hadoopuser
ssh–keygen–trsa–P“”
cat/home/hadoopuser/.ssh/id_rsa.pub>>/home/hadoopuser/.ssh/authorized_keys
chmod600authorized_keys
ssh–copy–id–i~/.ssh/id_rsa.pubslave–1
sshslave–1
|
Download the latest stable version of Apache Hadoop, At the moment of writing this article it is version 2.8.1:
1
2
3
|
wgethttp://www-us.apache.org/dist/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
tarxzfhadoop–3.1.1.tar.gz
mvhadoop–3.1.1hadoop
|
Step 4. Configure Apache Hadoop.
Setting up the environment variables. Edit ~/.bashrc file and append following values at end of file:
1
2
3
4
5
6
7
8
|
exportHADOOP_HOME=/home/hadoop/hadoop
exportHADOOP_INSTALL=$HADOOP_HOME
exportHADOOP_MAPRED_HOME=$HADOOP_HOME
exportHADOOP_COMMON_HOME=$HADOOP_HOME
exportHADOOP_HDFS_HOME=$HADOOP_HOME
exportYARN_HOME=$HADOOP_HOME
exportHADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
exportPATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
|
Apply environmental variables to current running session:
1
|
source~/.bashrc
|
Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable:
1
|
exportJAVA_HOME=/usr/jdk1.8.0_192/
|
Hadoop has many of configuration files, which need to configure as per requirements of your hadoop infrastructure. Let’s start with the configuration with basic Hadoop single node cluster setup:
1
|
cd$HADOOP_HOME/etc/hadoop
|
Edit core-site.xml:
1
2
3
4
5
6
|
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
|
Edit hdfs-site.xml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
|
Edit mapred-site.xml:
1
2
3
4
5
6
|
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
|
Edit yarn-site.xml:
1
2
3
4
5
6
|
<configuration>
<property>
<name>yarn.nodemanager.aux–services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
|
Now format namenode using the following command, do not forget to check the storage directory:
1
|
hdfsnamenode–format
|
Start all hadoop services use the following command:
1
2
3
|
cd$HADOOP_HOME/sbin/
start–dfs.sh
start–yarn.sh
|
You should observe the output to ascertain that it tries to start datanode on slave nodes one by one. To check if all services are started well using ‘jps‘ command:
1
|
jps
|
Step 5. Accessing Apache Hadoop.
Apache Hadoop will be available on HTTP port 8088 and port 50070 by default. Open your favorite browser and navigate to http://yourdomain.com:50070 or http://server-ip:50070. If you are using a firewall, please open port 8088 and 50070 to enable access to the control panel.
Congratulation’s! You have successfully installed Hadoop. Thanks for using this tutorial for installing Apache Hadoop on Ubuntu 18.04 LTS system. For additional help or useful information, we recommend you to check the official Apache Hadoop web site.