Skip to content
jweslley edited this page Sep 13, 2010 · 14 revisions

Erik Frey ’s Benchmarks 1

big honkin’ local machine

Let’s start with a simpler scenario: I have a machine with multiple cores and with normal unix tools I’m relegated to using just one core. How does br help us here? Here’s br on an 8-core machine, essentially operating as a poor man’s multi-core sort:

command using time rate
sort -k1,1 -S2G 4gb_file > 4gb_file_sorted coreutils 30m32.078s 2.24 MBps
br -i 4gb_file -o 4gb_file_sorted coreutils 11m3.111s 6.18 MBps
br -i 4gb_file -o 4gb_file_sorted brp/brm 7m13.695s 9.44 MBps

The job completely i/o saturates, but still a reasonable gain!

many cheap machines

Here lies the promise of mapreduce: rather than use my big honkin’ machine, I have a bunch of cheaper machines lying around that I can distribute my work to. How does br behave when I add four cheaper 4-core machines into the mix?

command using time rate
sort -k1,1 -S2G 4gb_file > 4gb_file_sorted coreutils 30m32.078s 2.24 MBps
br -i 4gb_file -o 4gb_file_sorted coreutils 8m30.652s 8.02 MBps
br -i 4gb_file -o 4gb_file_sorted brp/brm 4m7.596s 16.54 MBps

We have a new bottleneck: we’re limited by how quickly we can partition/pump our dataset out to the nodes. awk and sort begin to show their limitations (our clever awk script is a bit cpu bound, and sort -m can only merge so many files at once). So we use two little helper programs written in C (yes, I know! it’s cheating! if you can think of a better partition/merge using core unix tools, contact me) to partition the data and merge it back.

When BashReduce Met BeeFS

In my lab, we have working on the implementation of a distributed file system that harnesses the free disk space of desktops connected by a LAN, called BeeFS. The primary purpose of BeeFS is to provide a POSIX-compliant file server that is not only more efficient than the prevalent approach based on dedicated servers (e.g. NFS), but also cheaper and naturally scalable. Nevertheless, after getting to know bashreduce, it was clear to us that BeeFS and BashReduce were meant for each other. Thus, we made a simple modification to bashreduce, which allows it to fully explore the distributed nature of BeeFS to substantially boosts its performance.

L’union fait la force

For this benchmark, we use a 20GiB workload containing the output logs of simulation experiments. The workload consists of 22 files, stored in a BeeFS instance, whose sizes are approximately equals to 1GiB. Thus, our bashreduce application must scan through this data set looking for simulation events’ names in the map phase, while the reduce phase will summarize the number of occurrences for each simulation event. Additionally, in order to generate the final output, the bashreduce application runs a custom merge program, which concatenate the output coming from the workers and generate the final summary of simulation event occurrences.

$ br -b -m "awk '{print \$2}' | sort" -r "uniq -c" -M "merge.rb" \
       -i list_of_simulation_logs_files -o simulation_events_summary

Besides that, we performed three different scenarios of this experiment aiming to evaluate the scalability properties by measuring the effects of an increasing demand workload. Each one of these scenarios process just a fraction of the previously mentioned workload. This fraction of the workload is a function of the number of workers. Thus, these scenarios used 5, 10 and 22 workers to process 5GiB, 10GiB and 20GiB of raw data, respectively. We used cheaper machines, each one had a single 2.40 GHz Intel Core 2 Duo processor running Ubuntu 9.04 (kernel version 2.6.28) with 2GiB of RAM.

Let’s go to the results:

number of machines time rate
5 0m40.427s 126.65MBps
10 0m45.422s 225.44MBps
22 0m56.626s 361.67MBps

Why not use NFS instead of BeeFS?

Because the client-server architecture of NFS requires that data be transfered over the network, consuming lots of bandwidth. However, we have executed the same experiment previously mentioned using NFS and got these results:

number of machines time rate
5 11m35.598s 7.3MBps
10 22m15.000s 7.6MBps
22 39m48.000s 8.7MBps

Notes

1 Erik Frey is the original creator of bashreduce.