-
Notifications
You must be signed in to change notification settings - Fork 171
Giraph PoC
Bryan Robbins edited this page Feb 4, 2014
·
9 revisions
- We need DG graph computations to be able to consider on the order of 100K paths efficiently
- We can use the Apache Giraph library (https://giraph.apache.org/) to distribute our data set generation phase across a Hadoop cluster.
- There is a sample Giraph problem available in the quickstart here. Get this running on a local CH4 demo VM.
- Adapt the example problem (which is doing a shortest-path computation) to solve the data set generation problem of DG. Use a simple "ABC", 3-variable data generation scenario.
- Write an Evaluator for processing assignment statements from graph edges
- Adapt input and output formats of example to store a List of Maps (i.e., List of data sets) as each node's value, and a list of Strings as each edge's value (i.e., list of assignment expressions).
- Write a PreProcessor for converting some convenient graph format into Giraph's JSON simple format.
- Write a PostProcessor to be run on the HDFS client machine to apply velocity templates to all datasets in the value of the END node in the resulting graph.
- Run the example problem on the local cluster to see if versions are compatible.
- Each node in the graph holds as its value a List (Set?) of datasets.
- Each edge in the graph holds as its value a List of assignment statements, potentially conditional on previously assigned dataset values.
- Graphs must contain an END node which is the last node in every path which should result in an output dataset.
- Nodes pass messages to one another to propagate computation.
- As its computation, each node:
- Receives any messages from its (inbound) neighbors. The value of each neighbor node is a List of datasets. For each dataset d at a neighbor node N :
- For each expression e on each inbound edge from N :
- Evaluate e given d to produce a value v for variable V in the dataset
- Update d given e (e.g., if e is "A := 1", actually set "A = 1" in d)
- Send a message to each outbound neighbor that values have been updated
- For each expression e on each inbound edge from N :
- Receives any messages from its (inbound) neighbors. The value of each neighbor node is a List of datasets. For each dataset d at a neighbor node N :
- Giraph is an HDFS-only solution. We could abstract the data generation step to be a per-node computation and implement a simple message passing solution locally to have the same architecture and code execute for local runs as well (not all DG problems/users need the 100K scale).