-
Notifications
You must be signed in to change notification settings - Fork 6
Running on Amazon Elastic MapReduce
These are instructions for how to run a job using Amazon elastic mapreduce.
Otherwise, skip to "the real stuff."
Set up Thrax as described in the QuickStart. Let $THRAX be the root directory of your Thrax installation.
Create an Amazon AWS account and then setup your security credentials. The values you need are:
- Access Key ID (e.g., 022QF06E7MXBSAMPLE)
- Secret Access Key (e.g., WcrlUX5JEDGM/SAMPLE/aVmYvHNif5zB+d9+ct)
If your secret key has slashes in it, you may want to regenerate it, as there are reported problems with such keys and the command-line access to S3 through Hadoop.
Edit the file $THRAX/AwsCredentials.properties
, filling in your access key ID and secret access key. It will look like this:
accessKey=022QF06E7MXBSAMPLE
secretKey=WcrlUX5JEDGM/SAMPLE/aVmYvHNif5zB+d9+ct
The jar file includes this file, so rebuild it by typing:
ant jar
This will place a new jar file with your credentials in $THRAX/bin/thrax.jar
, so that it can be accessed when running on Amazon's Elastic Mapreduce.
s3cmd
is a command-line interface to Amazon's S3 cloud-storage. It's easy to get: just point your browser at
http://sourceforge.net/projects/s3tools/files/s3cmd/1.0.0/s3cmd-1.0.0.tar.gz/download
then unpack the tarball. You should add the unpacked folder to your $PATH.
elastic-mapreduce
is a command-line interface to Amazon's elastic mapreduce. s3cmd above is just for putting and getting from files from an S3 bucket; this command will let you actually start a mapreduce job from the command line. Go to
http://aws.amazon.com/code/Elastic-MapReduce/2264
and click download, then unpack the resulting tarball. Again, we recommend you put the folder on your $PATH.
Yes, another one. This one is for use with elastic mapreduce command line utility. It is a JSON file that looks like this: { "access_id": "", "private_key": "", "keypair": "", "key-pair-file": "<insert the path to the .pem file for your Amazon ec2 key pair here>", "region": "<The region where you wish to launch your job flows. Should be one of us-east-1, us-west-1 or eu-west-1>" }
$ s3cmd mb BUCKET
Amazon's bucket names use a global namespace shared by all users of the service, so choose something that no one else is likely to have chosen.
# the output path on S3
amazon-work s3://BUCKET/my-thrax-run
# where the thrax.jar will be put on S3
amazon-jar s3://BUCKET/thrax.jar
# number of nodes to request (default 5)
amazon-num-instances 1
# node type (default m1.small)
amazon-instance-type m1.small
Also, change your input-file
key to point somewhere on your bucket.
input-file s3://BUCKET/europarl.unified
$ $THRAX/scripts/run_on_amazon.sh thrax.conf credentials.json
Here's what the script does:
- uploads your conf file to
amazon-work/thrax.conf
- tries to upload thrax.jar to
amazon-jar
(it will ask permission to overwrite) - checks to make sure the input file exists (if not, it will prompt you for the local path, then upload it to the bucket at
input-file
) - starts an elastic mapreduce job of
amazon-num-instances
nodes of typeamazon-instance-type
The logs will all be written to amazon-work/logs
, and the pieces of the final grammar will be at amazon-work/final
.