Running on Amazon Elastic MapReduce

These are instructions for how to run a job using Amazon elastic mapreduce.

Do all this once

Otherwise, skip to "the real stuff."

Set up Thrax

Set up Thrax as described in the QuickStart. Let $THRAX be the root directory of your Thrax installation.

Set up your Amazon Web Services (AWS) account

Create an Amazon AWS account and then setup your security credentials. The values you need are:

Access Key ID (e.g., 022QF06E7MXBSAMPLE)
Secret Access Key (e.g., WcrlUX5JEDGM/SAMPLE/aVmYvHNif5zB+d9+ct)

If your secret key has slashes in it, you may want to regenerate it, as there are reported problems with such keys and the command-line access to S3 through Hadoop.

Re-create the jar

Edit the file $THRAX/AwsCredentials.properties, filling in your access key ID and secret access key. It will look like this:

accessKey=022QF06E7MXBSAMPLE
secretKey=WcrlUX5JEDGM/SAMPLE/aVmYvHNif5zB+d9+ct

Then make sure these three environment variables are set: HADOOP, AWS_SDK (both of which can point to the joshua/lib folder) and the HADOOP_VERSION (0.20.203.0).

The jar file includes this file, so rebuild it by typing:

ant jar

This will place a new jar file with your credentials in $THRAX/bin/thrax.jar, so that it can be accessed when running on Amazon's Elastic Mapreduce.

Get s3cmd

s3cmd is a command-line interface to Amazon's S3 cloud-storage. It's easy to get: just point your browser at

http://sourceforge.net/projects/s3tools/files/s3cmd/1.0.0/s3cmd-1.0.0.tar.gz/download

then unpack the tarball. You should add the unpacked folder to your $PATH.

Get elastic-mapreduce command-line tools

elastic-mapreduce is a command-line interface to Amazon's elastic mapreduce. s3cmd above is just for putting and getting from files from an S3 bucket; this command will let you actually start a mapreduce job from the command line. Go to

http://aws.amazon.com/code/Elastic-MapReduce/2264

and click download, or download it directly from

http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip

then unpack the resulting tarball (Warning: tarbomb). Again, we recommend you put the folder on your $PATH.

Create a credentials file

Yes, another one. This one is for use with elastic mapreduce command line utility. It is a JSON file that looks like this:

{
"access_id": "<insert your AWS access id here>",
"private_key": "<insert your AWS secret access key here>",
"keypair": "<insert the name of your Amazon ec2 key-pair here>",
"key-pair-file": "<insert the path to the .pem file for your Amazon ec2 key pair here>",
"region": "<The region where you wish to launch your job flows. Should be one of us-east-1, us-west-1 or eu-west-1>"
}

To find your key pair go to https://console.aws.amazon.com/ec2/home?#s=KeyPairs

Set up the s3cmd

It'll ask for a GPG program; you can leave this blank.

$ s3cmd --configure

Create a bucket (if you haven't already)

$ s3cmd mb s3://NAME-YOUR-BUCKET

Amazon's bucket names use a global namespace shared by all users of the service, so choose something that no one else is likely to have chosen.

Now the real stuff

Add some keys to your thrax.conf

# the output path on S3
amazon-work s3://BUCKET/my-thrax-run
# where the thrax.jar will be put on S3
amazon-jar s3://BUCKET/thrax.jar
# number of nodes to request (default 5)
amazon-num-instances    1
# node type (default m1.small)
amazon-instance-type    m1.small

Also, change your input-file key to point somewhere on your bucket.

input-file    s3://BUCKET/europarl.unified

Run the script

$ $THRAX/scripts/run_on_amazon.sh thrax.conf credentials.json

Here's what the script does:

uploads your conf file to amazon-work/thrax.conf
tries to upload thrax.jar to amazon-jar (it will ask permission to overwrite)
checks to make sure the input file exists (if not, it will prompt you for the local path, then upload it to the bucket at input-file)
starts an elastic mapreduce job of amazon-num-instances nodes of type amazon-instance-type

The logs will all be written to amazon-work/logs, and the pieces of the final grammar will be at amazon-work/final.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly