-
Notifications
You must be signed in to change notification settings - Fork 4
User Guide
br supports processing a directory full of files rather than a single file. Any of the files may be compressed using gzip and bashreduce will detect that and transparently handle decompression when -i is specified. However, gzipped stdin is not supported, use zcat
instead.
$ ls input_directory file1.gz file2.gz file3.gz $ br -m "grep abc" -i input_directory -o output
or
$ br -m "grep abc" -i input.gz -o output
The -M option allows you to specify your own merge program instead of the default (sort -M). It enables you to create a re-reduce step on the end. You must be careful to do as little work as possible in the merge step since it is serializing the output of map and reduce.
$ br -m "cut -f2 | sort" -r "uniq -c" -M "merge.rb" -i input -o ouput
Attention: if your merge step is significant, it will dominate and performance gains will be reduced.
By default, bashreduce requires data to be sent over the network from your machine to the workers. However, you can execute your applications in a more efficient manner using the -f option. This option causes bashreduce to distribute filenames instead of lines of data over the network. This will greatly reduce the network bandwidth required, but assumes that each machine has a local copy of the data or access to a shared filesystem.
In this case, the input file must contains a list with the filenames to be processed, one filename per line.
$ cat input /path/to/file1.gz /path/to/file2.gz /path/to/file3.gz $ br -f -m "grep abc" -i input -o output
// TODO