Skip to content
jweslley edited this page Sep 13, 2010 · 4 revisions

Taking directories as input and handling gzipped files

br supports processing a directory full of files rather than a single file. Any of the files may be compressed using gzip and bashreduce will detect that and transparently handle decompression when -i is specified. However, gzipped stdin is not supported, use zcat instead.

$ ls input_directory
file1.gz file2.gz file3.gz

$ br -m "grep abc" -i input_directory -o output

or

$ br -m "grep abc" -i input.gz -o output

Applying a re-reduce, the merge option

The -M option allows you to specify your own merge program instead of the default (sort -M). It enables you to create a re-reduce step on the end. You must be careful to do as little work as possible in the merge step since it is serializing the output of map and reduce.

$ br -m "cut -f2 | sort" -r "uniq -c" -M "merge.rb" -i input -o ouput

1

Attention: if your merge step is significant, it will dominate and performance gains will be reduced.

Getting better performance

By default, bashreduce requires data to be sent over the network from your machine to the workers. However, you can execute your applications in a more efficient manner using the -f option. This option causes bashreduce to distribute filenames instead of lines of data over the network. This will greatly reduce the network bandwidth required, but assumes that each machine has a local copy of the data or access to a shared filesystem.

In this case, the input file must contains a list with the filenames to be processed, one filename per line.

$ cat input
/path/to/file1.gz
/path/to/file2.gz
/path/to/file3.gz

$ br -f -m "grep abc" -i input -o output

Using BeeFS

// TODO

Notes

1 View the merge.rb file here.