Skip to content

Latest commit

 

History

History
269 lines (185 loc) · 18 KB

00a-about-also.asciidoc

File metadata and controls

269 lines (185 loc) · 18 KB
Front Cover

Mission Statement

Big Data for Chimps will:

  1. Explain a practical, actionable view of big data, centered on tested best practices as well as give readers street fighting smarts with Hadoop

  2. Readers will also come away with a useful, conceptual idea of big data; big data in its simplest form is a small cluster of well-tagged information that sits upon a central pivot, and can be manipulated through various shifts and rotations with the purpose of delivering insights ("Insight is data in context"). Key to understanding big data is scalability: infinite amounts of data can rest upon infinite pivot points (Flip - is that accurate or would you say there’s just one central pivot - like a Rubic’s cube?)

  3. Finally, the book will contain examples with real data and real problems that will bring the concepts and applications for business to life.

Dear Early Release Reader,

Hello and thank you, courageous and farsighted early released-to’er! We want to make sure the book delivers value to you now, and rewards your early confidence by becoming a book you’ll be proud to own.

Our Questions for You

  • "If it’s well-covered on the internet, leave it out" is the rule of thumb we’re using for the introductory material covered in this book. We find it annoying when technology books go on (and on) as if giving a bus-tour-of-London tour of a topic ("Outside the window to your left you can see the southwest side of the British Museum!") So we will refrain from exhaustive coverage; yet you should never find yourself completely stranded in the book. Please do let us know if you notice that we’re either too exhaustive or too brief in our coverage.

  • Analogies: Two "characters" appear in the first part of the book, Chimpanzee and Elephant. We utilize them because their activities are surprisingly relevant to understanding the internals of Hadoop; however, we don’t want to waste your time, or patience, remapping those activities back to the problem at hand, so please provide feedback as to the usefulness you derive from the analogies.

  • Our plan: We’ll roll material out over the next few months. Should we find we need to cut things (I hope not to), I’ve flagged a few chapters as (bubble).

We Want Your Feedback

OK! On to the book. Or, on to the introductory parts of the book and then the book.

About

How We’re Writing This Book

I plan to push chapters to the publicly-viewable 'Hadoop for Chimps' git repo as they are written, and to post them periodically to the Infochimps blog after minor cleanup.

We really mean it about the git social-coding thing — please comment on the text, file issues and send pull requests. However! We might not use your feedback, no matter how dazzlingly cogent it is; and while we are soliciting comments from readers, we are not seeking content from collaborators.

Who This Book Is For

We’d like for you to be familiar with at least one programming language, but it doesn’t have to be Ruby. Familiarity with SQL will help a bit, but isn’t essential.

Most importantly, you should have an actual project in mind that requires a big data toolkit to solve — a problem that requires scaling out across multiple machines. If you don’t already have a project in mind but really want to learn about the big data toolkit, take a quick browse through the exercises. At least a few of them should have you jumping up and down with excitement to learn this stuff.

Who This Book Is Not For

This is not "Hadoop the Definitive Guide" (that’s been written, and well); this is more like "Hadoop: a Highly Opinionated Guide". The only coverage of how to use the bare Hadoop API is to say "In most cases, don’t". We recommend storing your data in one of several highly space-inefficient formats and in many other ways encourage you to willingly trade a small performance hit for a large increase in programmer joy. The book has a relentless emphasis on writing scalable code, but no content on writing performant code beyond the advice that the best path to a 2x speedup is to launch twice as many machines.

That is because for almost everyone, the cost of the cluster is far less than the opportunity cost of the data scientists using it. If you have not just big data but huge data — let’s say somewhere north of 100 terabytes — then you will need to make different tradeoffs for jobs that you expect to run repeatedly in production.

The book does have some content on machine learning with Hadoop, on provisioning and deploying Hadoop, and on a few important settings. But it does not cover advanced algorithms, operations or tuning in any real depth.

Hadoop

In Doug Cutting’s words, Hadoop is the "kernel of the big-data operating system". It’s the dominant batch-processing solution, has both commercial enterprise support and a huge open source community, runs on every platform and cloud, and there are no signs any of that will change in the near term.

The code in this book will run unmodified on your laptop computer and on an industrial-strength Hadoop cluster. (Of course you will need to use a reduced data set for the laptop). You do need a Hadoop installation of some sort — Appendix (TODO: ref) describes your options, including instructions for running hadoop on a multi-machine cluster in the public cloud — for a few dollars a day you can analyze terabyte-scale datasets.

A Note on Ruby and Wukong

We’ve chosen Ruby for two reasons. First, it’s one of several high-level languages (along with Python, Scala, R and others) that have both excellent Hadoop frameworks and widespread support. More importantly, Ruby is a very readable language — the closest thing to practical pseudocode we know. The code samples provided should map cleanly to those high-level languages, and the approach we recommend is available in any language.

In particular, we’ve chosen the Ruby-language Wukong framework. We’re the principal authors, but it’s open-source and widely used. It’s also the only framework I’m aware of that runs on both Hadoop and Storm+Trident.

Helpful Reading

  • Hadoop the Definitive Guide by Tom White is a must-have. Don’t try to absorb its whole — the most powerful parts of Hadoop are its simplest parts — but you’ll refer to often it as your applications reach production.

  • Hadoop Operations by Eric Sammer — hopefully you can hand this to someone else, but the person who runs your hadoop cluster will eventually need this guide to configuring and hardening a large production cluster.

  • "Big Data: principles and best practices of scalable realtime data systems" by Nathan Marz

  • …​

What This Book Covers

'Big Data for Chimps' shows you how to solve important hard problems using simple, fun, elegant tools.

Geographic analysis is an important hard problem. To understand a disease outbreak in Europe, you need to see the data from Zurich in the context of Paris, Milan, Frankfurt and Munich; but to understand the situation in Munich requires context from Zurich, Prague and Vienna; and so on. How do you understand the part when you can’t hold the whole world in your hand?

Finding patterns in massive event streams is an important hard problem. Most of the time, there aren’t earthquakes — but the patterns that will let you predict one in advance lie within the data from those quiet periods. How do you compare the trillions of subsequences in billions of events, each to each other, to find the very few that matter? Once you have those patterns, how do you react to them in real-time?

We’ve chosen case studies anyone can understand that generalize to problems like those and the problems you’re looking to solve. Our goal is to equip you with:

  • How to think at scale — equipping you with a deep understanding of how to break a problem into efficient data transformations, and of how data must flow through the cluster to effect those transformations.

  • Detailed example programs applying Hadoop to interesting problems in context

  • Advice and best practices for efficient software development

All of the examples use real data, and describe patterns found in many problem domains:

  • Statistical Summaries

  • Identify patterns and groups in the data

  • Searching, filtering and herding records in bulk

  • Advanced queries against spatial or time-series data sets.

The emphasis on simplicity and fun should make this book especially appealing to beginners, but this is not an approach you’ll outgrow. We’ve found it’s the most powerful and valuable approach for creative analytics. One of our maxims is "Robots are cheap, Humans are important": write readable, scalable code now and find out later whether you want a smaller cluster. The code you see is adapted from programs we write at Infochimps to solve enterprise-scale business problems, and these simple high-level transformations (most of the book) plus the occasional Java extension (chapter XXX) meet our needs.

What This Book Does Not Cover

We are not currently planning to cover Hive. The Pig scripts will translate naturally for folks who are already familiar with it. There will be a brief section explaining why you might choose it over Pig, and why I chose it over Hive. If there’s popular pressure I may add a "translation guide".

Other things we do not plan to include:

  • Installing or maintaining Hadoop

  • we will cover how to design HBase schema, but not how to use HBase as database

  • Other map-reduce-like platforms (disco, spark, etc), or other frameworks (MrJob, Scalding, Cascading)

  • At a few points we’ll use Mahout, R, D3.js and Unix text utils (cut/wc/etc), but only as tools for an immediate purpose. I can’t justify going deep into any of them; there are whole O’Reilly books on each.

Chimpanzee and Elephant

Starting with Chapter 2, you’ll meet the zealous members of the Chimpanzee and Elephant Computing Company. Elephants have prodgious memories and move large heavy volumes with ease. They’ll give you a physical analogue for using relationships to assemble data into context, and help you understand what’s easy and what’s hard in moving around massive amounts of data. Chimpanzees are clever but can only think about one thing at a time. They’ll show you how to write simple transformations with a single concern and how to analyze a petabytes data with no more than megabytes of working space.

Together, they’ll equip you with a physical metaphor for how to work with data at scale.

What’s Covered in This Book?

///Revise each chapter summary into paragraph form, as you’ve done for Chapter 1. This can stay in the final book. Amy////

Most of the chapters have exercises included. If you’re a beginning user, I highly recommend you work out at least one exercise from each chapter. Deep learning will come less from having the book in front of you as you read it than from having the book next to you while you write code inspired by it. There are sample solutions and result datasets on the book’s website.

Chapter 1. First Exploration: A walkthrough of problem you’d use Hadoop to solve, showing the workflow and thought process. Hadoop asks you to write code poems that compose what we’ll call transforms (process records independently) and pivots (restructure data).

Chapter 2. Simple Transform: Chimpanzee and Elephant are hired to translate the works of Shakespeare to every language; you’ll take over the task of translating text to Pig Latin. This is an "embarrasingly parallel" problem, so we can learn the mechanics of launching a job and a coarse understanding of the HDFS without having to think too hard. - Chimpanzee and Elephant start a business - Pig Latin translation - Your first job: test at commandline - Run it on cluster - Input Splits - Why Hadoop I: Simple Parallelism

Chapter 3. Transform-Pivot Job: C&E help SantaCorp optimize the Christmas toymaking process, demonstrating the essential problem of data locality (the central challenge of Big Data). We’ll follow along with a job requiring map and reduce, and learn a bit more about Wukong (a Ruby-language framework for Hadoop). - Locality: the central challenge of distributed computing - The Hadoop Haiku

Chapter 4. First Exploration: Geographic Flavor pt II

  1. The Hadoop Toolset

    • toolset overview

    • launching and debugging jobs

    • overview of wukong

    • overview of pig

  2. Filesystem Mojo

    • dumping, listing, moving and manipulating files on the HDFS and local filesystems

  3. Server Log Processing:

    • Parsing logs and using regular expressions

    • Histograms and time series of pageviews

    • Geolocate visitors based on IP

    • (Ab)Using Hadoop to stress-test your web server

  4. Statistics:

    • Summarizing: Averages, Percentiles, and Normalization

    • Sampling responsibly: it’s harder and more important than you think

    • Statistical aggregates and the danger of large numbers

  5. Time Series

  6. Geographic Data:

    • Spatial join (find all UFO sightings near Airports) -

  7. cat herding

    • total sort

    • transformations from the commandline (grep, cut, wc, etc)

    • pivots from the commandline (head, sort, etc)

    • commandline workflow tips

    • advanced hadoop filesystem (chmod, setrep, fsck)

  8. Data Munging (Semi-Structured Data): The dirty art of data munging. It’s a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We’ll show you street-fighting tactics that lessen the time and pain. Along the way, we’ll prepare the datasets to be used throughout the book:

    • Wikipedia Articles: Every English-language article (12 million) from Wikipedia.

    • Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.

    • US Commercial Airline Flights: every commercial airline flight since 1987

    • Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.

    • "Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio’s waxy.org).

  9. Interlude I: Organizing Data:

    • How to design your data models

    • How to serialize their contents (orig, scratch, prod)

    • How to organize your scripts and your data

  10. Machine Learning without Grad School: We’ll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using

    • Naive Bayes

    • Logistic Regression

    • Random Forest (using Mahout) We’ll equip you with a picture of how they work, but won’t go into the math of how or why. We will show you how to choose a method, and how to cheat to win.

  11. Interlude II: Best Practices and Pedantic Points of style

    • Pedantic Points of Style

    • Best Practices

    • How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they’re equivalent; with some experience under your belt it’s worth learning how to fluidly shift among these different models.

    • Why Hadoop

    • robots are cheap, people are important

  12. Hadoop Native Java API

    • don’t

  13. Advanced Pig

    • Specialized joins that can dramatically speed up (or make feasible) your data transformations

    • Basic UDF

    • why algebraic UDFs are awesome and how to be algebraic

    • Custom Loaders

    • Performance efficiency and tunables

  14. Data Modeling for HBase-style Database

  15. Hadoop Internals

    • What happens wte* code inspired by it. There are sample solutions and result datasets on the book’s website.

Feel free to hop around among chapters; the application chapters don’t have large dependencies on earlier chapters.hen a job is launched - A shallow dive into the HDFS

  1. Hadoop Tuning

    • Tuning for the Wise and Lazy

    • Tuning for the Brave and Foolish

    • The USE Method for understanding performance and diagnosing problems

  2. Overview of Datasets and Scripts

    • Datasets

    • Wikipedia (corpus, pagelinks, pageviews, dbpedia, geolocations)

    • Airline Flights

    • UFO Sightings

    • Global Hourly Weather

    • Waxy.org "Star Wars Kid" Weblogs

    • Scripts

  3. Cheatsheets:

    • Regular Expressions

    • Sizes of the Universe

    • Hadoop Tuning & Configuration Variables

  4. Appendix:

Excised:

  1. Text Processing: We’ll show how to combine powerful existing libraries with hadoop to do effective text handling and Natural Language Processing:

    • Indexing documents

    • Tokenizing documents using Lucene

    • Pointwise Mutual Information

    • K-means Clustering

  2. Graph Processing:

    • Graph Representations

    • Community Extraction: Use the page-to-page links in Wikipedia to identify similar documents

    • Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages

How to Contact Us

////Flip feel free to add your contact information (Twitter, email) Amy////

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 (707) 829-0515 (international or local)

To comment or ask technial questions about this book, send email to [email protected]