Skip to content
Bryan Robbins edited this page Feb 4, 2014 · 9 revisions

Problem

  • We need DG graph computations to be able to consider on the order of 100K paths efficiently

Proposed Solution

  • We can use the Apache Giraph library (https://giraph.apache.org/) to distribute our data set generation phase across a Hadoop cluster.

Action Plan

  • There is a sample Giraph problem available in the quickstart here. Get this running on a local CH4 demo VM.
  • Adapt the example problem (which is doing a shortest-path computation) to solve the data set generation problem of DG. Use a simple "ABC", 3-variable data generation scenario.
    • Write an Evaluator for processing assignment statements from graph edges
    • Adapt input and output formats of example to store a List of Maps (i.e., List of data sets) as each node's value, and a list of Strings as each edge's value (i.e., list of assignment expressions).
    • Write a PreProcessor for converting some convenient graph format into Giraph's JSON simple format.
    • Write a PostProcessor to be run on the HDFS client machine to apply velocity templates to all datasets in the value of the END node in the resulting graph.
  • Run the example problem on the local cluster to see if versions are compatible.

Design thoughts

  • Each node in the graph holds as its value a List (Set?) of datasets.
  • Each edge in the graph holds as its value a List of assignment statements, potentially conditional on previously assigned dataset values.
  • Graphs must contain an END node which is the last node in every path which should result in an output dataset.
  • Nodes pass messages to one another to propagate computation.
  • As its computation, each node:
    • Receives any messages from its (inbound) neighbors. The value of each neighbor node is a List of datasets. For each dataset d at a neighbor node N :
      • For each expression e on each inbound edge from N :
        • Evaluate e given d to produce a value v for variable V in the dataset
        • Update d given e (e.g., if e is "A := 1", actually set "A = 1" in d)
      • Send a message to each outbound neighbor that values have been updated

Additional thoughts

  • Giraph is an HDFS-only solution. We could abstract the data generation step to be a per-node computation and implement a simple message passing solution locally to have the same architecture and code execute for local runs as well (not all DG problems/users need the 100K scale).