Daniels Page: Hadoop 2.2 Multi Node Cluster Setup

In this tutorial you will learn how to setup Hadoop Multi Node cluster

If you are using putty to access your Linux box remotely, please install openssh by running this command, this also helps in configuring SSH access easily in the later part of the installation:

sudo apt-get install openssh-server

Prerequisites:

Installing Java v1.7
Adding dedicated Hadoop system user.
Configuring SSH access.
Disabling IPv6.

Before starting of installing any applications or softwares, please makes sure your list of packages from all repositories and PPA’s is up to date or if not update them by using this command:

sudo apt-get update

Installing Java v1.7:

For running Hadoop it requires Java v1. 7+

Download Latest oracle Java Linux version of the oracle website by using this command

wget https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz

If it fails to download, please check with this given command which helps to avoid passing username and password.

wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz"

Unpack the compressed Java binaries, in the directory:

sudo tar xvzf jdk-7u25-linux-x64.tar.gz

Create a Java directory using mkdir under /user/local/ and change the directory to /usr/local/Java by using this command

mkdir -R /usr/local/Java
cd /usr/local/Java

Copy the Oracle Java binaries into the /usr/local/Java directory.

sudo cp -r jdk-1.7.0_45 /usr/local/java

Edit the system PATH file /etc/profile and add the following system variables to your system path

sudo nano /etc/profile    or  sudo gedit /etc/profile

Scroll down to the end of the file using your arrow keys and add the following lines below to the end of your /etc/profile file:

JAVA_HOME=/usr/local/Java/jdk1.7.0_45
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH

Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located. This will tell the system that the new Oracle Java version is available for use.

sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.7.0_45/bin/javac" 1

sudo update-alternatvie --set javac /usr/local/Java/jdk1.7.0_45/bin/javac

This command notifies the system that Oracle Java JDK is available for use

Reload your system wide PATH /etc/profile by typing the following command:

. /etc/profile

Test to see if Oracle Java was installed correctly on your system.

Java -version

Adding dedicated Hadoop system user.

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required but it is recommended, because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.

a. Adding group:

sudo addgroup Hadoop

b. Creating a user and adding the user to a group:

sudo adduser –ingroup Hadoop hduser

Configuring SSH access:

The need for SSH Key based authentication is required so that the master node can then login to slave nodes (and the secondary node) to start/stop them and also local machine if you want to use Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.

Before this step you have to make sure that SSH is up and running on your machine and configured it to allow SSH public key authentication.

Generating an SSH key for the hduser user.
a. Login as hduser with sudo
b. Run this Key generation command:

ssh-keyegen -t rsa -P ""

It will ask to provide the file name in which to save the key, just press has entered so that it will generate the key at ‘/home/hduser/ .ssh’

Enable SSH access to your local machine with this newly created key.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the hduser user.

ssh hduser@localhost

This will add localhost permanently to the list of known hosts

Disabling IPv6.

We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. You will need to run the following commands using a root account:

sudo gedit /etc/sysctl.conf

Add the following lines to the end of the file and reboot the machine, to update the configurations correctly.

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Hadoop installation

Go to Apache Downloads and download Hadoop version 2.2.0 (prefer to download any stable versions)

Run this following command to download Hadoop version 2.2.0

wget http://apache.mirrors.pair.com/hadoop/common/stable2/hadoop-2.2..tar.gz

Unpack the compressed hadoop file by using this command:

tar –xvzf hadoop-2.2.0.tar.gz

Move hadoop package of your choice, I picked /opt/hadoop-2.2.0 for my convenience

sudo mv hadoop-2.2.0 /opt/hadoop-2.2.0

Make sure to change the owner of all the files to the hduser user and hadoop group by using this command:

sudo chown -R hduser:hadoop Hadoop

Add the follwing lines into .bashrc file

root@arrakis[~]#cd ~
root@arrakis[~]#vi .bashrc

copy and paste following line at end of the file

export HADOOP_HOME=/opt/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

Modify hadoop environment file

Add JAVA_HOME to libexec/hadoop-config.sh at beginning of the file

root@arrakis[~]#vi /opt/hadoop-2.2.0/libexec/hadoop-config.sh
….
export JAVA_HOME='/usr/local/Java/jdk1.7.0_45'
….

Add JAVA_HOME to hadoop/hadoop-env.sh at beginning of the file

root@arrakis[~]#vi /opt/hadoop-2.2.0/etc/hadoop/hadoop-env.sh
….
export JAVA_HOME='/usr/local/Java/jdk1.7.0_45'
….

Check Hadoop installation

root@arrakis[~]#cd /opt/hadoop-2.2.0/bin
root@arrakis[bin]#./hadoop version
Hadoop 2.2.0
…..

At this point Hadoop installed in your node.

Create folder for tmp

root@arrakis[~]#mkdir -p $HADOOP_HOME/tmp

Configuration : Multi-node setup

Add IP address of Master and all Slaves to /etc/hosts – for both Master and all the slave nodes

Add the association between the hostnames and the IP address for the master and the slaves on all the nodes in the /etc/hosts. Make sure that the all the nodes in the cluster are able to ping to each other.

hduser@arrakis:/opt/hadoop-2.2.0/bin#vi /etc/hosts
10.184.39.67 master
10.184.36.134 slave

In our case we only have one slave, if you have more slaves name them as slave1, slave2…

Password-less ssh from master to slave

hduser@arrakis:[~]#ssh-keygen -t rsa -P ""
hduser@arrakis:[~]#ssh-copy-id -i /home/hduser/.ssh/id_dsa.pub hduser@slave
root@arrakis[bin]#ssh slave

[Note : If you skip this step, you will have to provide passwords for all slaves when Master start the process ./start-*.sh . If you have configured multiple slaves repeat the process for every node ]

Add the Slave entries in $HADOOP_CONF_DIR/slaves - only at Master node

Add all the slave entries in slaves file in Master node.

hduser@arrakis:[~]#vi /opt/hadoop-2.2.0/etc/hadoop/slaves
 slave

Note : again – we only have one slave in this example, if you have more slaves add all the slave hostnames

Hadoop Configuration
- both Master and all the slave

Add the properties in following hadoop configuration file which is availabile under $HADOOP_CONF_DIR

core-site.xml

hduser@arrakis[~]#cd /opt/hadoop-2.2.0/etc/hadoop
hduser@arrakis[hadoop]#vi core-site.xml

#Paste following between <configuration> tag

<property>
    <name>fs.default.name</name>
    <value>hdfs://master:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop-2.2.0/tmp</value>
  </property>

hdfs-site.xml

hduser@arrakis[hadoop]#vi hdfs-site.xml

#Paste following between <configuration> tag

<property>
<name>dfs.replication</name>
<value>2</value>
 </property>
  <property>
<name>dfs.permissions</name>
<value>false</value>
</property>

Note : Our replication values is 2 [one master and one slave ]. If you have more slaves put replication value based on that.

mapred-site.xml

hduser@arrakis[hadoop]#vi mapred-site.xml

#Paste following between <configuration> tag

 <property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

yarn-site.xml

hduser@arakis[hadoop]#vi yarn-site.xml

#Paste following between <configuration> tag

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce.shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux- services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource- tracker.address</name>
    <value>master:8025</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>master:8040</value>
  </property>

Format the namenode – only at Master node

hduser@arrakis:/opt/hadoop-2.2.0/bin#cd /opt/hadoop-2.2.0/bin
hduser@arrakis:/opt/hadoop-2.2.0/bin# ./hadoop namenode -format

Administering Hadoop – Start & Stop
- Only at Master node

Start the process at Master node – slave nodes will automatically start

start-dfs.sh : to start namenode and datanode

hduser@arrakis:[~]# cd /opt/hadoop-2.2.0/sbin
hduser@arrakis:[sbin]# ./start-dfs.sh

check Master

hduser@arrakis:[sbin]#jps
 17675 Jps
 17578 SecondaryNameNode
 17409 NameNode

check Slave

hduser@slave:[sbin]#jps
 9317 Jps
 9250 DataNode

start-yarn.sh : to start resourcemanager and nodemanager

hduser@arrakis:[sbin]# ./start-yarn.sh

check Master

hduser@arrakis:[sbin]#jps
 17578 SecondaryNameNode
 17917 ResourceManager
 17409 NameNode
 18153 Jps

check Slave

hduser@slave:[sbin]#jps
 9317 Jps
 9250 DataNode
 9357 NodeManager

Working with Hadoop

execute command at master

hduser@arrakis:/opt/hadoop-2.2.0/bin# ./hdfs dfs -mkdir -p /user/hadoop2
hduser@arrakis:/opt/hadoop-2.2.0/bin# ./hdfs dfs -put /root/Desktop/test.html /user/hadoop2
hduser@arrakis:/opt/hadoop-2.2.0/bin# ./hdfs dfs -ls
Found 1 items
-rw-r--r-- 2 root supergroup 225 2013-11-11 20:19 /user/hadoop2/test.html

check slave node

hduser@slave:/opt/hadoop-2.2.0/bin# ./hdfs dfs -ls user/hadoop2/
Found 1 items
-rw-r--r-- 2 root supergroup 225 2013-11-11 20:19 /user/hadoop2/test.html
hduser@slave:/opt/hadoop-2.2.0/bin# /opt/hadoop-2.2.0/bin# ./hdfs dfs -cat /user/hadoop2/test.html
test file. Welcome to Hadoop2.2.0 Installation. !!!!!!!!!!!

Daniels Page

Wednesday, April 2, 2014

Hadoop 2.2 Multi Node Cluster Setup

In this tutorial you will learn how to setup Hadoop Multi Node cluster

Prerequisites:

Installing Java v1.7:

Adding dedicated Hadoop system user.

Configuring SSH access:

Disabling IPv6.

Hadoop installation

Configuration : Multi-node setup

Hadoop Configuration
- both Master and all the slave

Administering Hadoop – Start & Stop
- Only at Master node

Working with Hadoop

No comments:

Post a Comment

how to make a batch file to crash windows

Sites I suggest

Wednesday, April 2, 2014

Hadoop 2.2 Multi Node Cluster Setup

In this tutorial you will learn how to setup Hadoop Multi Node cluster

Prerequisites:

Installing Java v1.7:

Adding dedicated Hadoop system user.

Configuring SSH access:

Disabling IPv6.

Hadoop installation

Configuration : Multi-node setup

Hadoop Configuration- both Master and all the slave

Administering Hadoop – Start & Stop- Only at Master node

Working with Hadoop

No comments:

Post a Comment

how to make a batch file to crash windows

Hadoop Configuration
- both Master and all the slave

Administering Hadoop – Start & Stop
- Only at Master node