Pseudo Distributed Hadoop Setup

Basic instructions for setting up Hadoop in a Pseudo-Distributed Mode

You need an operating system that can run Hadoop. It should run well on most Linux distributions, we have tried Ubuntu and Fedora. Mac OS X 10.6 (Snow Leopard) or later should work as well.

If you are using Windows, try installing Hadoop in a Virtual Machine running Linux.

NOTE: When you use Mac OS X, make sure to replace any /home directory references with /Users

Make sure you have a current JDK installed (Java 6 or better)

Configure ssh

Make sure you can ssh to your local system

Create ssh key - no need to do this if you already have one

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

Add the ssh key to authorized keys so you can log in without a password

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 622 ~/.ssh/authorized_keys

Try connecting to local host with ssh (should not be prompted for password)

ssh localhost

Download hadoop-1.0.4-bin.tar.gz

Navigate to http://hadoop.apache.org/releases.html

Click Download link, then click Download release now! link

Pick a download mirror

Click hadoop-1.0.4/ directory link

Download "hadoop-1.0.4-bin.tar.gz"

Unpack the downloaded tar

I've unpacked it in a directory named '~/Hadoop'

mkdir ~/Hadoop
cd ~/Hadoop
tar xvzf ~/Downloads/hadoop-1.0.4-bin.tar.gz

Create a file to source for the environment settings

Create a hadoop-1.0.4-env file with the following content (modify the JAVA_HOME and HADOOP_PREFIX to match your system):

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64

export HADOOP_PREFIX="${HOME}/Hadoop/hadoop-1.0.4"
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_CONF_DIR=$HADOOP_PREFIX/conf
export HADOOP_LIBEXEC_DIR=$HADOOP_PREFIX/libexec

export PATH=$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin:$PATH

Modify the main config files in the hadoop-1.0.4/conf directory

core-site.xml should have this content (you can modify the value for 'hadoop.tmp.dir' directory to your liking, remember to replace '/home' with '/Users' for Mac OS X):

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/${user.name}/Hadoop/hadoop-1.0.4-store</value>
        <description>A base for other temporary directories.</description>
    </property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
    </property>
</configuration>

hdfs-site.xml should have this content:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.support.broken.append</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
</configuration>

mapred-site.xml should have this content:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:8021</value>
    </property>
</configuration>

Modify the hadoop-1.0.4.conf/hadoop-env.sh file.

You need to add the JAVA_HOME setting that we also set in the previous step above (again adjust this to match your system):

...

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64

...

We are ready to run:

Start by sourcing the environment settings

source hadoop-1.0.4-env

Format the hadoop file system

You only do this step once for a new cluster!

hadoop namenode -format

Start Hadoop namenode, datanode and secondary-namenode

start-dfs.sh

Check that you have the dfs daemons running

jps

You should see something like:

[trisberg@localhost ~]$ jps
27932 SecondaryNameNode
27827 DataNode
26384 NameNode
27988 Jps

Start Hadoop job-tracker and task-tracker

start-mapred.sh

Check that you have the dfs and mapred daemons running

jps

You should see something like:

[trisberg@localhost ~]$ jps
28170 TaskTracker
27932 SecondaryNameNode
28053 JobTracker
27827 DataNode
26384 NameNode
28259 Jps

If one of the processes is missing you need to check the logs for clues to what could be wrong. Log files are available in the hadoop-1.0.4/logs directory.

Once the cluster is up and running you can access the web interfaces on these adresses:

NameNode: http://localhost:50070/dfshealth.jsp
JobTracker: http://localhost:50030/jobtracker.jsp

This would be a good time to run the tests for the spring-hadoop project.

source is available here: https://github.com/SpringSource/spring-hadoop
run tests using the command: ./gradlew -Phd.fs=hdfs://localhost:8020 -Phd.jt=localhost:8021 clean build

When you are done testing you can use these commands to shut the cluster down:

stop-mapred.sh
stop-dfs.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly