Skip to content

Large Static Graphs

Yogesh Simmhan edited this page Aug 13, 2016 · 7 revisions

Large Static Graphs

Presentation Link

Giraph Out of Core

  • Presented by Diptanshu
  • V props, E props and topo kept in separate files.
  • Partitions can be loaded from disk to memory on the fly, and written back to disk.
  • If property changes and not topo, only deltas of property loaded from disk
  • Sticky partitions for random set of partitions to be retained in memory. LRU to keep recently used partitions in memory.
  • Messages batched and written to disk in vertex ID order. Multiple batches can be in different files.
  • Compute method called on vertices in same order. Cursor for each file moves sequentially forward, pointing to current/newer vertex being processed. Ensures single pass thru file.
  • Desiderata
  • A "unit" of a graph (partition/subgraph) can fit in memory. Others need not, can be persisted to disk.
  • State of this unit (topo and property) need to be persisted, loaded on demand. Subset of properties could be used in each supserstep/task phase, allowing only those to be loaded.
  • Messages for a unit persisted to disk between supersteps, loaded on-demand.
  • Useful for elasticity to offload partitions when not used in non-stationary graph algos.

Hybrid Big Graph, IPDPS

  • Presented by Diptanshu
  • Each subgraph fits in mem, but not entire graph in distributed mem
  • If subgraph in memory, updated async in memory. Else write to disk. Messages read when loading that subgraph from disk.
  • Allows subgraphs to make async progress for k supersteps.
  • Inactive vertices need not be loaded into memory for a given superstep after being persisted to disk.
  • Desiderata
    • Notion of active and inactive vertices to allow memory to be flushed out. Useful for elasticity?
  • TODO
  • Additional Reading: CCGRID 2016 paper by Safi for memory offloading [RD]
  • Additional Reading: KLA programming model, Giraph unchained (move forward by supersteps/nested ssteps) [RD]
  • Additional Reading: Look at asycing/sync+async programming models: GraphLab, GraphX? [DK]

Giraph Mutations

  • Presented by Diptanshu
  • Edge owned only by single vertex. Undirected means two edges.
  • Direct mutation on edges owned by a vertex: Add, remove, convert from directed to undirected. Visible in same superstep.
  • Graph mutation requests for vertices/edges not owned by a vertex (add/remove vertex/edge, optionally with properties). Mutations are applied between two ssteps.
  • sending message to no-existed vertex can create it.
  • Mutation conflicts resolved using vertex change and vertex resolver.
  • Desiderata
  • Support for v, e, property mutations
  • Rebalancing partitions/subgraphs across (static/dynamic) set of machines between supersteps? Handling skews due to mutation?
  • TODO:
  • What is the specific order in which mutations happen? [DK]
  • Additional reading: Rebalancing of partitions (Mirzan) [RD]. Algos that have a lot of mutations (e.g. MSF) [DK]

Giraph Data model

  • Outedges can have custom implementations. Iterable also accessible thru different models, like ArrayList, HashMapEdgeList, ByteArrayList, etc.
  • Desiderata
    • Support different DS for V, E, SG
  • TODO:
  • Additional reading: Other graph DS in shared and distributed memory -- performance, storage footprint [DK].

NScale

  • Presented by Diptanshu
  • Queries over subgraphs
  • Extraction of subgraphs: Predicate, k-radius, neighbors, edge/vert attributes
  • Vertices and edges in a subgraph can be replicated so that whole subgraph present in a local partition. Only one subgraph owns a replicated vertex/edge, and can modify it.
  • Bin packing of multiple overlapping extracted graphs on same partition using min-hash over vertices to reduce vertex replication.
  • Desiderata
  • Support graph extraction based on query/distance, and distributedly apply user logic on graph(s) extracted.
  • Packing overlapping (sub)graphs into few partitions.
  • TODO:
  • MinHash functions
Clone this wiki locally