The Wikipedia corpus is a good introduction to handling XML input. It lacks some of the really hairy aspects of XML handling foonote:[FIXME: make into references see the section on XML for a list of drawbacks], which will let us concentrate on the basics.
The raw data is a single XML file, 8 GB compressed and about 40 GB uncompressed. After a brief irrelevant header, it’s simply several million records that look like this:
<page> <title>Abraham Lincoln</title> <id>307</id> <revision> <id>454480143</id> <timestamp>2011-10-08T01:36:34Z</timestamp> <contributor> <username>AnomieBOT</username> <id>7611264</id> </contributor> <minor /> <comment>Dating maintenance tags: Page needed</comment> <text xml:space="preserve">...(contents omitted; they are XML-encoded (that's helpful) with spacing preserved including newlines)...</text> </revision> </page>
I’ve omitted the article contents, which are cleanly XML encoded and not in a CDATA block, so there’s no risk of a spurious XML tag in an inappropriate place; avoid CDATA blocks if you can. The contents preserve all the whitespace of the original body, so we’ll need to ensure that our XML parser does so as well.
From this, we’d like to extract the title, numeric page id, timestamp, and text, and re-emit each record in one of our favorite formats.
Now we meet our first two XML-induced complexities: splitting the file among mappers, so that you don’t send the first half of an article to one task and the rest to a different one; and recordizing the file from a stream of lines into discrete XML fragments, each describing one article.
FIXME: we used crack
not plain text the whole way
The law of small numbers: given millions of things, your one-in-a-million occurrences become commonplace.
Out of 12,389,353 records, almost all look like <text xml:space="preserve">…stuff</text>
— but 446 records have an empty body, <text xml:space="preserve" />
.
Needless to say, we found this out not while developing locally, but rather some hundreds of thousands of records in while running on the cluster.
This crashing is a good feature of our script: it wasn’t clear that an empty article body is permissible.
At 40GB uncompressed, the articles file will occupy about 320 HDFS blocks (assuming 128MB blocks), each destined for its own mapper. However, the division points among blocks is arbitrary: it might occur in the middle of a word in the middle of a record with no regard for your feelings about the matter. However, if you do it the courtesy of pointing to the first point within a block that a split should have occurred, Hadoop will handle the details of patching it onto the trailing end of the preceding block. Pretty cool.
You need to ensure that Hadoop splits the file at a record boundary: after </page>
, before the next <page>
tag.
If you’re
Writing an input format and splitter is only as hard as your input format makes it, but it’s the kind of pesky detail that lies right at the "do it right" vs "do it (stupid/simpl)ly" decision point. Luckily there’s a third option, which is to steal somebody else’s code [1]. Oliver Grisel (@ogrisel) has written an Wikipedia XML reader as a raw Java API reader in the Mahout project, and as a Pig loader in his pignlproc project. Mahout’s XmlInputFormat (src)
If all you need to do is yank the data out of it’s ill-starred format, or if the data format’s complexity demands the agility of a high-level language, you can use Hadoop Streaming as a brute-force solution. In this case, we’ll still be reading the data as a stream of lines, and use native libraries to do the XML parsing. We only need to ensure that the splits are correct, and the StreamXmlRecordReader
(doc / source);
that ships with Hadoop is sufficient.
class Wikipedia::RawArticle field :title, Integer field :id, Integer field :revision, Wikipedia::RawArticleRevision end class Wikipedia::RawArticleRevision field :id, Integer field :timestamp, Time field :text, String end