(The remainder of the intro section is a planning document, please skip)
Intros:
Introduce the chapter to the reader
-
take the strands from the last chapter, and show them braided together
-
in this chapter, you’ll learn …. OR ok we’re done looking at that, now let’s xxx
-
weave in the locality question
-
Tie each chapter to the goals of the book
-
perspective, philosophy, what we’ll be working, a bit of repositioning, a bit opinionated, a bit personal.
Preface:
-
like a "map" of the book
-
"First part is about blah, next is about blah, …"
At this point, you have good familiarity with Hadoop’s batch processing power and the powerful inquiries it unlocks above and as a counterpart to traditional database approach. Stream analytics is a third mode of data analysis and, it is becoming clear, one that is just as essential and transformative as massive scale batch processing has been.
Storm is an open-source framework developed at Twitter that provides scalable stream processing. Trident draws on Storm’s powerful transport mechanism to provide exactly once processing of records in windowed batches es for aggregating and persisting to an external data store.
The central challenge in building a system that can perform fallible operations on billions of records reliably is how to do so without yourself producing so much bookkeeping that it becomes its own scalable Stream processing challenge. Storm handles all details of reliable transport and efficient routing for you, leaving you with only the business process at hand. (The remarkably elegant way Storm handles that bookkeeping challenge is one of its principle breakthroughs; you’ll learn about it in the later chapter on Storm Internals.)
This takes Storm past the mere processing of records to Stream Analytics — with some limitations and some advantages, you have the same ability to specify locality and write arbitrarily powerful general-purpose code to handle every record. A lot of Storm+Trident’s adoption is in application to real-time systems. [1]
But, just as importantly, the framework exhibits radical tolerance of latency. It’s perfectly reasonable to, for every record, perform reads of a legacy data store, call an internet API and the like, even if those might have hundreds or thousands of milliseconds worst-case latency. That range of timescales is simply impractical within a batch processing run or database query. In the later chapter on the Lambda Architecture, you’ll learn how to use stream and batch analytics together for latencies that span from milliseconds to years.
As an example, one of the largest hard drive manufacturers in the world ingests sensor data from its manufacturing line, test data from quality assurance processes, reports from customer support and post mortem analysis of returned devices. They have been able to mine the accumulated millisecond scale sensor data for patterns that predict flaws months and years later. Hadoop produces the “slow,” deep results, uncovering the patterns that predict failure. Storm+Trident produces the fast, relevant results: operational alerts when those anomalies are observed.
Things you should take away from this chapter:
Understand the type of problems you solve using stream processing and apply it to real examples using the best-in-class stream analytics frameworks. Acquire the practicalities of authoring, launching and validating a Storm+Trident flow. Understand Trident’s operators and how to use them: Each apply `CombinerAggregator`s, `ReducerAggregator`s and `AccumulatingAggregator`s (generic aggregator?) Persist records or aggregations directly to a backing database or to Kafka for item-potent downstream storage. (probably not going to discuss how to do a streaming join, using either DRPC or a hashmap join)
Note
|
This chapter will only speak of Storm+Trident, the high level and from the outside. We won’t spend any time on how it’s making this all work until (to do ref the chapter on Storm+Trident internals) |
What should you take away from this chapter:
You should:
-
Understand the lifecycle of a Storm tuple, including spout, tupletree and acking.
-
(Optional but not essential) Understand the details of its reliability mechanism and how tuples are acked.
-
Understand the lifecycle of partitions within a Trident batch and thus, the context behind partition operations such as Apply or PartitionPersist.
-
Understand Trident’s transactional mechanism, in the case of a PartitionPersist.
-
Understand how Aggregators, Statemap and the Persistence methods combine to give you exactly once processing with transactional guarantees. Specifically, what an OpaqueValue record will look like in the database and why.
-
Understand how the master batch coordinator and spout coordinator for the Kafka spout in particular work together to uniquely and efficiently process all records in a Kafka topic.
-
One specific: how Kafka partitions relate to Trident partitions.
Lifecycle of a File:
-
What happens as the Namenode and Datanode collaborate to create a new file.
-
How that file is replicated to acknowledged by other Datanodes.
-
What happens when a Datanode goes down or the cluster is rebalanced.
-
Briefly, the S3 DFS facade // (TODO: check if HFS?).
-
Lifecycle of a job at the client level including figuring out where all the source data is; figuring out how to split it; sending the code to the JobTracker, then tracking it to completion.
-
How the JobTracker and TaskTracker cooperate to run your job, including: The distinction between Job, Task and Attempt., how each TaskTracker obtains its Attempts, and dispatches progress and metrics back to the JobTracker, how Attempts are scheduled, including what happens when an Attempt fails and speculative execution, __, Split.
-
How TaskTracker child and Datanode cooperate to execute an Attempt, including; what a child process is, making clear the distinction between TaskTracker and child process.
-
Briefly, how the Hadoop Streaming child process works.
-
How the mapper and Datanode handle record splitting and how and when the partial records are dispatched.
-
The mapper sort buffer and spilling to disk (maybe here or maybe later, the I/O.record.percent).
-
Briefly note that data is not sent from mapper-to-reducer using HDFS and so you should pay attention to where you put the Map-Reduce scratch space and how stupid it is about handling an overflow volume.
-
Briefly that combiners are a thing.
-
Briefly how records are partitioned to reducers and that custom partitioners are a thing.
-
How the Reducer accepts and tracks its mapper outputs.
-
Details of the merge/sort (shuffle and sort), including the relevant buffers and flush policies and why it can skip the last merge phase.
-
(NOTE: Secondary sort and so forth will have been described earlier.)
-
Delivery of output data to the HDFS and commit whether from mapper or reducer.
-
Highlight the fragmentation problem with map-only jobs.
-
Where memory is used, in particular, mapper-sort buffers, both kinds of reducer-merge buffers, application internal buffers.
-
An interesting opportunity happens when the sequence order of records corresponds to one of your horizon keys.
-
Explain using the example of weblogs. highlighting strict order and partial order.
-
In the frequent case, the sequence order only somewhat corresponds to one of the horizon keys. There are several types of somewhat ordered streams: block disorder, bounded band disorder, band disorder. When those conditions hold, you can use windows to recover the power you have with ordered streams — often, without having to order the stream.
-
Unbounded band disorder only allows “”convergent truth” aggregators. If you have no idea when or whether that some additional record from a horizon group might show up, then you can’t treat your aggregation as anything but a best possible guess at the truth.
-
However, what the limited disorder does get you, is the ability to efficiently cache aggregations from a practically infinite backing data store.
-
With bounded band or block disorder, you can perform accumulator-style aggregations.
-
How to, with the abstraction of an infinite sorting buffer or an infinite binning buffer, efficiently re-present the stream as one where sequence order and horizon key directly correspond.
-
Re-explain the Hadoop Map-Reduce algorithm in this window+horizon model.
-
How windows and panes correspond to horizon groups, subgroups and the secondary sort; in particular, explain the CUBE and ROLLUP operations in Pig.
-
(somewhere: Describe how to use Trident batches as they currently stand to fake out windows.)
-
Understand why conceptual model is useful; in particular, it illuminates the core similarity between batch and stream analytics and also, to help you reason about the architecture of your analysis.
-
The basic model: Organize context globally, compute locally. DO MORE HERE.
-
Horizon of computation, including what we mean by horizon key. DO MORE HERE.
-
Volume of justified belief. DO MORE HERE.
-
Note that the direct motivation for Big Data technology is to address the situation where the necessary volume for justified belief exceeds the practical horizon of computation.
-
Volume of aggregation, including holistic and algebraic aggregates. Describe briefly one or two algebraic aggregates and two holistic aggregates, including medium (or something) and Unified-Profile assembly.
-
Highlight that , in practice, we often and eagerly trade off truth and accuracy in favor of relevance, timeliness, cost and the other constraints we’ve described. Give a few examples.
-
Timescale of acceptable delay. DO MORE HERE.
-
Timescale of syndication. DO MORE HERE.
-
Horizon of computational risk. DO MORE HERE.
-
Horizon of external conversation. DO MORE HERE.
-
Understand relativity: horizons of belief, computation, delay, etc
-
How guarantees of bounded disorder or delay, uniform sampling, etc let you trade off
-
Aggregation types: holistic, algebraic, combinable; accumulate, accretion
Data is worthless. Actually, it’s worse than worthless. It costs you money to gather, store, manage, replicate and analyze. What you really want is insight — a relevant summary of the essential patterns in that data — produced using relationships to analyze data in context.
Statistical summaries are the purest form of this activity, and will be used repeatedly in the book to come, so now that you see how Hadoop is used it’s a good place to focus.
Some statistical measures let you summarize the whole from summaries of the parts: I can count all the votes in the state by summing the votes from each county, and the votes in each county by summing the votes at each polling station. Those types of aggregations — average/standard deviation, correlation, and so forth — are naturally scalable, but just having billions of objects introduces some practical problems you need to avoid. We’ll also use them to introduce Pig, a high-level language for SQL-like queries on large datasets.
Other statistical summaries require assembling context that grows with the size of the whole dataset. The amount of intermediate data required to count distinct objects, extract an accurate histogram, or find the median and other quantiles can become costly and cumbersome. That’s especially unfortunate because so much data at large scale has a long-tail, not normal (Gaussian) distribution — the median is far more robust indicator of the "typical" value than the average. (If Bill Gates walks into a bar, everyone in there is a billionaire on average.)
But you don’t always need an exact value — you need actionable insight. There’s a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and we’ll apply it to
-
Holistic vs algebraic aggregations
-
Underflow and the "Law of Huge Numbers"
-
Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog)
-
Count-min sketch for most frequent elements
-
Approx histogram