Skip to content

Latest commit

 

History

History
374 lines (226 loc) · 15.8 KB

12a-herding_cats.asciidoc

File metadata and controls

374 lines (226 loc) · 15.8 KB

Herding `Cat`s

When working with big data, there are thousands of housekeeping tasks to do. As you will soon discover, moving data around, munging it, transforming it, and other mundane tasks will take up a depressingly inordinate amount of your time. A proper knowledge of some useful tools and tricks will make these tasks doable, and in many cases, easy.

In this chapter we discuss a variety of Unix commandline tools that are essential for shaping, transforming, and munging text. All of these commands are covered elsewhere, and covered more completely, but we’ve focused in on their applications for big data, specifically their use within Hadoop.

FLAVORISE

By the end of this chapter you should be FINISH

If you’re already familiar with Unix pipes and chaining commands together, feel free to skip the first few sections and jump straight into the tour of useful commands.

A series of pipes

One of the key aspects of the Unix philosophy is that it is built on simple tools that can be combined in nearly infinite ways to create complex behavior. Command line tools are linked together through 'pipes' which take output from one tool and feed it into the input of another. For example, the cat command reads a file and outputs its contents. You can pipe this output into another command to perform useful transformations. For example, to select only the first field (delimited by tabs) of the each line in a file you could:

cat somefile.txt | cut -f1`

The vertical pipe character, '|', represents the pipe between the cat and cut commands. You can chain as many commands as you like, and it’s common to construct chains of 5 commands or more.

In addition to redirecting a command’s output into another command, you can also redirect output into a file with the > operator. For example:

echo 'foobar' > stuff.txt

writes the text 'foobar' to stuff.txt. stuff.txt is created if it doesn’t exist, and is overwritten if it does.

If you’d like to append to a file instead of overwriting it, the '>>' operator will come in handy. Instead of creating or overwriting the specified file, it creates the file if it doesn’t exist, and appends to the file if it does.

As a side note, the Hadoop streaming API is built using pipes. Hadoop sends each record in the map and reduce phases to your map and reduce scripts' stdin. Your scripts print the results of their processing to stdout, which Hadoop collects.

Crossing the streams

Each Unix command has 3 input/output streams: standard input, standard output, and standard error, commonly referred to by the more concise stdin, stdout, and stderr. Commands accept input via stdin and generate output through stdout. When we said that pipes take output from one tool and feed it into the input of another, what we really meant was that pipes feed one command’s stdout stream into another command’s stdin stream.

The third stream, stderr, is generally used for error messages, progress information, or other text that is not strictly 'output'. Stderr is especially useful because it allows you to see messages from a command even if you have redirected the command’s stdout. For example, if you wanted to run a command and redirect its output into a file, you could still see any errors generated via stderr. curl, a command used to make network requests, #FINISH

* CURL COMMAND*

It’s occasionally useful to be able to redirect these streams independently or into each other. For example, if you’re running a command and want to log its output as well as any errors generated, you should redirect stderr into stdout and then direct stdout to a file:

*EXAMPLE*

Alternatively, you could redirect stderr and stdout into separate files:

*EXAMPLE*

You might also want to suppress stderr if the command you’re using gets too chatty. You can do that by redirecting stderr to /dev/null, which is a special file that discards everything you hand it.

Now that you understand the basics of pipes and output redirection, lets get on to the fun part - munging data!

cat and echo

cat reads the content of a file and prints it to stdout. It can accept multiple files, like so, and will print them in order:

cat foo.txt bar.txt bat.txt

cat is generally used to examine the contents of a file or as the start of a chain of commands:

cat foo.txt | sort | uniq > bar.txt

In addition to examining and piping around files, cat is also useful as an 'identity mapper', a mapper which does nothing. If your data already has a key that you would like to group on, you can specify cat as your mapper and each record will pass untouched through the map phase to the sort phase. Then, the sort and shuffle will group all records with the same key at the proper reducer, where you can perform further manipulations.

echo is very similar to cat except it prints the supplied text to stdout. For example:

echo foo bar baz bat > foo.txt

will result in foo.txt holding 'foo bar baz bat', followed by a newline. If you don’t want the newline you can give echo the -n option.

Filtering

cut

The cut command allows you to cut certain pieces from each line, leaving only the interesting bits. The -f option means 'keep only these fields', and takes a comma-delimited list of numbers and ranges. So, to select the first 3 and 5th fields of a tsv file you could use:

cat somefile.txt | cut -f 1-3,5`

Watch out - the field numbering is one-indexed. By default cut assumes that fields are tab-delimited, but delimiters are configurable with the -d option.

This is especially useful if you have tsv output on the hdfs and want to filter it down to only a handful of fields. You can create a hadoop streaming job to do this like so:

wu-mapred --mapper='cut -f 1-3,5'

cut is great if you know the indices of the columns you want to keep, but if you data is schema-less or nearly so (like unstructured text), things get slightly more complicated. For example, if you want to select the last field from all of your records, but the field length of your records vary, you can combine cut with the rev command, which reverses text:

cat foobar.txt | rev | cut -1 | rev`

This reverses each line, selects the first field in the reversed line (which is really the last field), and then reverses the text again before outputting it.

cut also has a -c (for 'character') option that allows you to select ranges of characters. This is useful for quickly verifying the output of a job with long lines. For example, in the Regional Flavor exploration, many of the jobs output wordbags which are just giant JSON blobs, one line of which would overflow your entire terminal. If you want to quickly verify that the output looks sane, you could use:

wu-cat /data/results/wikipedia/wordbags.tsv | cut -c 1-100

Character encodings

Cut’s -c option, as well as many Unix text manipulation tools require a little forethought when working with different character encodings because each encoding can use a different numbers of bits per character. If cut thinks that it is reading ASCII (7 bits per character) when it is really reading UTF-8 (variable number of bytes per character), it will split characters and produce meaningless gibberish. Our recommendation is to get your data into UTF-8 as soon as possible and keep it that way, but the fact of the matter is that sometimes you have to deal with other encodings.

Unix’s solution to this problem is the LC_* environment variables. LC stands for 'locale', and lets you specify your preferred language and character encoding for various types of data.

LC_CTYPE (locale character type) sets the default character encoding used systemwide. In absence of LC_CTYPE, LANG is used as the default, and LC_ALL can be used to override all other locale settings. If you’re not sure whether your locale settings are having their intended effect, check the man page of the tool you are using and make sure that it obeys the LC variables.

You can view your current locale settings with the locale command. Operating systems differ on how they represent languages and character encodings, but on my machine en_US.UTF-8 represents English, encoded in UTF-8.

Remember that if you’re using these commands as Hadoop mappers or Reducers, you must set these environment variables across your entire cluster, or set them at the top of your script.

head and tail

While cut is used to select columns of output, head and tail are used to select lines of output. head selects lines at the beginning of its input while tail selects lines at the end. For example, to view only the first 10 lines of a file, you could use head like so:

head -10 foobar.txt

head is especially useful for sanity-checking the output of a Hadoop job without overflowing your terminal. head and cut make a killer combination:

wu-cat /data/results/foobar | head -10 | cut -c 1-100

tail works almost identically to head. Viewing the last ten lines of a file is easy:

tail -10 foobar.txt

tail also lets you specify the selection in relation to the beginning of the file with the '+' operator. So, to select every line from the 10th line on:

tail +10 foobar.txt

What if you just finished uploading 5,000 small files to the HDFS and realized that you left a header on every one of them? No worries, just use tail as a mapper to remove the header:

wu-mapred --mapper='tail +2'`

This outputs every line but the first one.

tail is also useful for watching files as they are written to. For example, if you have a log file that you want to watch for errors or information, you can 'tail' it with the -f option:

tail -f yourlogs.log

This outputs the end of the log to your terminal and waits for new content, updating the output as more is written to yourlogs.log.

grep

grep is a tool for finding patterns in text. You can give it a word, and it will diligently search its input, printing only the lines that contain that word:

GREP EXAMPLE

grep has a many options, and accepts regular expressions as well as words and word sequences:

ANOTHER EXAMPLE

The -i option is very useful to make grep ignore case:

EXAMPLE

As is the -z option, which decompresses g-zipped text before grepping through it. This can be tremendously useful if you keep files on your HDFS in a compressed form to save space.

When using grep in Hadoop jobs, beware its non-standard exit statuses. grep returns a 0 if it finds matching lines, a 1 if it doesn’t find any matching lines, and a number greater than 1 if there was an error. Because Hadoop interprets any exit code greater than 0 as an error, any Hadoop job that doesn’t find any matching lines will be considered 'failed' by Hadoop, which will result in Hadoop re-trying those jobs without success. To fix this, we have to swallow `grep’s exit status like so:

(grep foobar || true)

This ensures that Hadoop doesn’t erroneously kill your jobs.

GOOD TITLE HERE

sort

As you might expect, sort sorts lines. By default it sorts alphabetically, considering the whole line:

EXAMPLE

You can also tell it to sort numerically with the -n option, but -n only sorts integers properly. To sort decimals and numbers in scientific notation properly, use the -g option:

EXAMPLE

You can reverse the sort order with -r:

EXAMPLE

You can also specify a column to sort on with the -k option:

EXAMPLE

By default the column delimiter is a non-blank to blank transition, so any content character followed by a whitespace character (tab, space, etc…) is treated as a column. This can be tricky if your data is tab delimited, but contains spaces within columns. For example, if you were trying to sort some tab-delimited data containing movie titles, you would have to tell sort to use tab as the delimiter. If you try the obvious solution, you might be disappointed with the result:

sort -t"\t"
sort: multi-character tab `\\t'

Instead we have to somehow give the -t option a literal tab. The easiest way to do this is:

sort -t$'\t'

$'<string>' is a special directive that tells your shell to expand <string> into its equivalent literal. You can do the same with other control characters, including \n, \r, etc…

Another useful way of doing this is by inserting a literal tab manually:

sort -t'	'

To insert the tab literal between the single quotes, type CTRL-V and then Tab.

If you find your sort command is taking a long time, try increasing its sort buffer size with the --buffer command. This can make things go a lot faster:

example

TALK ABOUT SORT’S USEFULNESS IN BIG DATA

uniq

uniq is used for working with with duplicate lines - you can count them, remove them, look for them, among other things. For example, here is how you would find the number of oscars each actor has in a list of annual oscar winners:

example

Note the -c option, which prepends the output with a count of the number of duplicates. Also note that we sort the list before piping it into uniq - input to uniq must always be sorted or you will get erroneous results.

You can also filter out duplicates with the -u option:

example

And only print duplicates with the -d option:

example
  • TALK ABOUT USEFULNESS, EXAMPLES*

join

TBD - do we even want to talk about this?

Summarizing

wc

wc is a utility for counting words, lines, and characters in text. Without options, it searches its input and outputs the number of lines, words, and bytes, in that order:

EXAMPLE

wc will also print out the number of characters, as defined by the LC_CTYPE environment variable:

EXAMPLE

We can use wc as a mapper to count the total number of words in all of our files on the HDFS:

EXAMPLE

md5sum and sha1sum

  • Flip ???*

Transforming

expand and unexpand

expand and unexpand are simple commands used to transform spaces to tabs and vice versa:

EXAMPLE

This comes in handy when trying to get data into or out of the TSV format.

tr

tr is used

Why didn’t you include sed, awk, or my other favorite scripting language here?

Because they

Moving things to and fro

TIMINGS Show hdp-shovel Show distcp

To put something on the HDFS directly from a pipe:

hdp-mkdir infochimps.com
curl 'http://infochimps.com' | hdp-put - infochimps.com/index.html

Don’t use NFS

Stupid Hadoop Tricks

Mappers that process filenames, not file contents

You can output anything you want in your mappers.

Every once in a while, you need to do something where getting the content onto the HDFS is almost more work than it’s worth. For instance, say you had to process a whole bunch of files located in no convenient place or organization

  • pull in all the files

  • transfer the files to the HDFS

  • start the job to process them

  • transfer them back off

or:

  • send, as the mapper input, the files to fetch

  • each mapper fetches the page contents and emits them

Be careful: hadoop has no rate limiting. It will quite happily obliterate any system you point it at, for whom there’s no apparent difference between Hadoop and a concentrated Distributed Denial of Service attack.


Benign DDOS

Speaking of which…​ So you have an API. And you think it’s working well, and in fact you think it’s working really well. Want to simulate a 200x load spike? Replay a week’s worth of request logs at your server, accelerated to all show up in an hour. Each mapper reads a section of the logs, and makes the corresponding request (setting its browser string and referer URL accordingly). It emits the response duration, HTTP status code, and content size. There are dedicated tools to do this kind of HTTP benchmarking, but they typically make the same request over and over. Replaying a real load at higher speed means that your caching strategy is properly exercised.