Pseudo Distributed Hadoop Setup

Basic instructions for setting up Hadoop in a Pseudo-Distributed Mode

Make sure you have a current JDK installed (Java 6 or better)

Configure ssh

Make sure you can ssh to the system

Create ssh key and add it to authorized keys

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 622 ~/.ssh/authorized_keys

Try connecting to local host with ssh (should not be prompted for password)

ssh localhost

Download hadoop-1.0.4-bin.tar.gz

Navigate to http://hadoop.apache.org/releases.html

Click Download link, then click Download release now! link

Pick a download mirror

Click hadoop-1.0.4/ directory link

Download "hadoop-1.0.4-bin.tar.gz"

Unpack the downloaded tar

I've unpacked it in a directory named '~/Hadoop'

cd ~/Hadoop
tar xvzf ~/Downloads/hadoop-1.0.4-bin.tar.gz

Create a file to source for the environment settings

Create a hadoop-1.0.4-env file with the following content (modify the JAVA_HOME and HADOOP_PREFIX to match your system):

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64

export HADOOP_PREFIX="/home/trisberg/Hadoop/hadoop-1.0.4"
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_CONF_DIR=$HADOOP_PREFIX/conf
export HADOOP_LIBEXEC_DIR=$HADOOP_PREFIX/libexec

export PATH=$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin:$PATH

Modify the main config files in the hadoop-1.0.4/conf directory

core-site.xml should have this content (you can modify the 'hadoop.tmp.dir' directory to your liking):

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/${user.name}/Hadoop/hadoop-1.0.4-store</value>
        <description>A base for other temporary directories.</description>
    </property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
    </property>
</configuration>

hdfs-site.xml should have this content:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.support.broken.append</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
</configuration>

mapred-site.xml should have this content:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:8021</value>
    </property>
</configuration>

Modify the hadoop-1.0.4.conf/hadoop-env.sh file.

You need to add the JAVA_HOME setting that we also set in step 5 above (again adjust this to match your system):

...

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64

...

We are ready to run:

Start by sourcing the environment settings

source hadoop-1.0.4-env

Format the hadoop file system

You only do this step once for a new cluster!

hadoop namenode -format

Start Hadoop namenode, datanode and secondary-namenode

start-dfs.sh

Check that you have the dfs daemons running

jps

You should see something like:

[trisberg@localhost ~]$ jps
27932 SecondaryNameNode
27827 DataNode
26384 NameNode
27988 Jps

Start Hadoop job-tracker and task-tracker

start-mapred.sh

Check that you have the dfs and mapred daemons running

jps

You should see something like:

[trisberg@localhost ~]$ jps
28170 TaskTracker
27932 SecondaryNameNode
28053 JobTracker
27827 DataNode
26384 NameNode
28259 Jps

Once the cluster is up and running you can access the web interfaces on these adresses:

NameNode: http://localhost:50070/dfshealth.jsp
JobTracker: http://localhost:50030/jobtracker.jsp

This would be a good time to run the tests for the spring-hadoop project.

source is available here: https://github.com/SpringSource/spring-hadoop
run tests using the command: ./gradlew -Phd.fs=hdfs://localhost:8020 -Phd.jt=localhost:8021 clean build

When you are done testing you can use these commands to shut the cluster down:

stop-mapred.sh
stop-dfs.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly