Skip to content
This repository has been archived by the owner on Apr 5, 2022. It is now read-only.

Pseudo Distributed Hadoop Setup

Thomas Risberg edited this page May 14, 2013 · 24 revisions

Basic instructions for setting up Hadoop in a Pseudo-Distributed Mode

Make sure you have a current JDK installed (Java 6 or better)

Configure ssh

Make sure you can ssh to the system

Create ssh key - no need to do this if you already have one

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

Add the ssh key to authorized keys so you can log in without a password

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 622 ~/.ssh/authorized_keys

Try connecting to local host with ssh (should not be prompted for password)

ssh localhost

Download hadoop-1.0.4-bin.tar.gz

Navigate to http://hadoop.apache.org/releases.html

Click Download link, then click Download release now! link

Pick a download mirror

Click hadoop-1.0.4/ directory link

Download "hadoop-1.0.4-bin.tar.gz"

Unpack the downloaded tar

I've unpacked it in a directory named '~/Hadoop'

cd ~/Hadoop
tar xvzf ~/Downloads/hadoop-1.0.4-bin.tar.gz

Create a file to source for the environment settings

Create a hadoop-1.0.4-env file with the following content (modify the JAVA_HOME and HADOOP_PREFIX to match your system):

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64

export HADOOP_PREFIX="/home/trisberg/Hadoop/hadoop-1.0.4"
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_CONF_DIR=$HADOOP_PREFIX/conf
export HADOOP_LIBEXEC_DIR=$HADOOP_PREFIX/libexec

export PATH=$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin:$PATH

Modify the main config files in the hadoop-1.0.4/conf directory

core-site.xml should have this content (you can modify the 'hadoop.tmp.dir' directory to your liking):

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/${user.name}/Hadoop/hadoop-1.0.4-store</value>
        <description>A base for other temporary directories.</description>
    </property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
    </property>
</configuration>

hdfs-site.xml should have this content:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.support.broken.append</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
</configuration>

mapred-site.xml should have this content:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:8021</value>
    </property>
</configuration>

Modify the hadoop-1.0.4.conf/hadoop-env.sh file.

You need to add the JAVA_HOME setting that we also set in step 5 above (again adjust this to match your system):

...

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64

...

We are ready to run:

Start by sourcing the environment settings

source hadoop-1.0.4-env

Format the hadoop file system

You only do this step once for a new cluster!

hadoop namenode -format

Start Hadoop namenode, datanode and secondary-namenode

start-dfs.sh

Check that you have the dfs daemons running

jps

You should see something like:

[trisberg@localhost ~]$ jps
27932 SecondaryNameNode
27827 DataNode
26384 NameNode
27988 Jps

Start Hadoop job-tracker and task-tracker

start-mapred.sh

Check that you have the dfs and mapred daemons running

 jps

You should see something like:

[trisberg@localhost ~]$ jps
28170 TaskTracker
27932 SecondaryNameNode
28053 JobTracker
27827 DataNode
26384 NameNode
28259 Jps

If one of the processes is missing you need to check the logs for clues to what could be wrong. Log files are available in the hadoop-1.0.4/logs directory.

Once the cluster is up and running you can access the web interfaces on these adresses:

This would be a good time to run the tests for the spring-hadoop project.

When you are done testing you can use these commands to shut the cluster down:

stop-mapred.sh
stop-dfs.sh