Giraph PoC

We need DG graph computations to be able to consider on the order of 100K paths efficiently

We can use the Apache Giraph library (https://giraph.apache.org/) to distribute our data set generation phase across a Hadoop cluster.

There is a sample Giraph problem available in the quickstart here. Get this running on a local CH4 demo VM.
Adapt the example problem (which is doing a shortest-path computation) to solve the data set generation problem of DG. Use a simple "ABC", 3-variable data generation scenario.
- Write an Evaluator for processing assignment statements from graph edges
- Adapt input and output formats of example to store a List of Maps (i.e., List of data sets) as each node's value, and a list of Strings as each edge's value (i.e., list of assignment expressions).
- Write a PreProcessor for converting some convenient graph format into Giraph's JSON simple format.
- Write a PostProcessor to be run on the HDFS client machine to apply velocity templates to all datasets in the value of the END node in the resulting graph.
Run the example problem on the local cluster to see if versions are compatible.

Each node in the graph holds as its value a List (Set?) of datasets.
Each edge in the graph holds as its value a List of assignment statements, potentially conditional on previously assigned dataset values.
Graphs must contain an END node which is the last node in every path which should result in an output dataset.
Nodes pass messages to one another to propagate computation.
As its computation, each node:
- Receives any messages from its (inbound) neighbors. The value of each neighbor node is a List of datasets. For each dataset d at a neighbor node N :
  - For each expression e on each inbound edge from N :
    - Evaluate e given d to produce a value v for variable V in the dataset
    - Update d given e (e.g., if e is "A := 1", actually set "A = 1" in d)
  - Send a message to each outbound neighbor that values have been updated

Giraph is an HDFS-only solution. We could abstract the data generation step to be a per-node computation and implement a simple message passing solution locally to have the same architecture and code execute for local runs as well (not all DG problems/users need the 100K scale).

Provide feedback