-
Notifications
You must be signed in to change notification settings - Fork 6
Running on Amazon Elastic MapReduce
These are instructions for how to run a job using Amazon elastic mapreduce.
Otherwise, skip to "the real stuff."
Set up Thrax as described in the QuickStart. Let $THRAX be the root directory of your Thrax installation.
Create an Amazon AWS account and then setup your security credentials. The values you need are:
- Access Key ID (e.g., 022QF06E7MXBSAMPLE)
- Secret Access Key (e.g., WcrlUX5JEDGM/SAMPLE/aVmYvHNif5zB+d9+ct)
If your secret key has slashes in it, you may want to regenerate it, as there are reported problems with such keys and the command-line access to S3 through Hadoop.
Edit the file $THRAX/AwsCredentials.properties
, filling in your access key ID and secret access key. It will look like this:
accessKey=022QF06E7MXBSAMPLE
secretKey=WcrlUX5JEDGM/SAMPLE/aVmYvHNif5zB+d9+ct
Then make sure these three environment variables are set: HADOOP, AWS_SDK (both of which can point to the joshua/lib folder) and the HADOOP_VERSION (0.20.203.0).
The jar file includes this file, so rebuild it by typing:
ant jar
This will place a new jar file with your credentials in $THRAX/bin/thrax.jar
, so that it can be accessed when running on Amazon's Elastic Mapreduce.
s3cmd
is a command-line interface to Amazon's S3 cloud-storage. It's easy to get: just point your browser at
http://sourceforge.net/projects/s3tools/files/s3cmd/1.0.0/s3cmd-1.0.0.tar.gz/download
then unpack the tarball. You should add the unpacked folder to your $PATH.
elastic-mapreduce
is a command-line interface to Amazon's elastic mapreduce. s3cmd above is just for putting and getting from files from an S3 bucket; this command will let you actually start a mapreduce job from the command line. Go to
http://aws.amazon.com/code/Elastic-MapReduce/2264
and click download, or download it directly from
http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
then unpack the resulting tarball (Warning: tarbomb). Again, we recommend you put the folder on your $PATH.
Yes, another one. This one is for use with elastic mapreduce command line utility. It is a JSON file that looks like this:
{
"access_id": "<insert your AWS access id here>",
"private_key": "<insert your AWS secret access key here>",
"keypair": "<insert the name of your Amazon ec2 key-pair here>",
"key-pair-file": "<insert the path to the .pem file for your Amazon ec2 key pair here>",
"region": "<The region where you wish to launch your job flows. Should be one of us-east-1, us-west-1 or eu-west-1>"
}
To find your key pair go to https://console.aws.amazon.com/ec2/home?#s=KeyPairs
It'll ask for a GPG program; you can leave this blank.
$ s3cmd --configure
$ s3cmd mb s3://NAME-YOUR-BUCKET
Amazon's bucket names use a global namespace shared by all users of the service, so choose something that no one else is likely to have chosen.
# the output path on S3
amazon-work s3://BUCKET/my-thrax-run
# where the thrax.jar will be put on S3
amazon-jar s3://BUCKET/thrax.jar
# number of nodes to request (default 5)
amazon-num-instances 1
# node type (default m1.small)
amazon-instance-type m1.small
Also, change your input-file
key to point somewhere on your bucket.
input-file s3://BUCKET/europarl.unified
$ $THRAX/scripts/run_on_amazon.sh thrax.conf credentials.json
Here's what the script does:
- uploads your conf file to
amazon-work/thrax.conf
- tries to upload thrax.jar to
amazon-jar
(it will ask permission to overwrite) - checks to make sure the input file exists (if not, it will prompt you for the local path, then upload it to the bucket at
input-file
) - starts an elastic mapreduce job of
amazon-num-instances
nodes of typeamazon-instance-type
The logs will all be written to amazon-work/logs
, and the pieces of the final grammar will be at amazon-work/final
.