-
Notifications
You must be signed in to change notification settings - Fork 11
Large Static Graphs
Yogesh Simmhan edited this page Aug 13, 2016
·
7 revisions
- Presented by Diptanshu
- V props, E props and topo kept in separate files.
- Partitions can be loaded from disk to memory on the fly, and written back to disk.
- If property changes and not topo, only deltas of property loaded from disk
- Sticky partitions for random set of partitions to be retained in memory. LRU to keep recently used partitions in memory.
- Messages batched and written to disk in vertex ID order. Multiple batches can be in different files.
- Compute method called on vertices in same order. Cursor for each file moves sequentially forward, pointing to current/newer vertex being processed. Ensures single pass thru file.
- Desiderata
- A "unit" of a graph (partition/subgraph) can fit in memory. Others need not, can be persisted to disk.
- State of this unit (topo and property) need to be persisted, loaded on demand. Subset of properties could be used in each supserstep/task phase, allowing only those to be loaded.
- Messages for a unit persisted to disk between supersteps, loaded on-demand.
- Useful for elasticity to offload partitions when not used in non-stationary graph algos.
- Presented by Diptanshu
- Each subgraph fits in mem, but not entire graph in distributed mem
- If subgraph in memory, updated async in memory. Else write to disk. Messages read when loading that subgraph from disk.
- Allows subgraphs to make async progress for k supersteps.
- Inactive vertices need not be loaded into memory for a given superstep after being persisted to disk.
-
Desiderata
- Notion of active and inactive vertices to allow memory to be flushed out. Useful for elasticity?
- TODO
- Additional Reading: CCGRID 2016 paper by Safi for memory offloading [RD]
- Additional Reading: KLA programming model, Giraph unchained (move forward by supersteps/nested ssteps) [RD]
- Additional Reading: Look at asycing/sync+async programming models: GraphLab, GraphX? [DK]
- Presented by Diptanshu
- Edge owned only by single vertex. Undirected means two edges.
- Direct mutation on edges owned by a vertex: Add, remove, convert from directed to undirected. Visible in same superstep.
- Graph mutation requests for vertices/edges not owned by a vertex (add/remove vertex/edge, optionally with properties). Mutations are applied between two ssteps.
- sending message to no-existed vertex can create it.
- Mutation conflicts resolved using vertex change and vertex resolver.
- Desiderata
- Support for v, e, property mutations
- Rebalancing partitions/subgraphs across (static/dynamic) set of machines between supersteps? Handling skews due to mutation?
- TODO:
- What is the specific order in which mutations happen? [DK]
- Additional reading: Rebalancing of partitions (Mirzan) [RD]. Algos that have a lot of mutations (e.g. MSF) [DK]
- Outedges can have custom implementations. Iterable also accessible thru different models, like ArrayList, HashMapEdgeList, ByteArrayList, etc.
-
Desiderata
- Support different DS for V, E, SG
- TODO:
- Additional reading: Other graph DS in shared and distributed memory -- performance, storage footprint [DK].
- Presented by Diptanshu
- Queries over subgraphs
- Extraction of subgraphs: Predicate, k-radius, neighbors, edge/vert attributes
- Vertices and edges in a subgraph can be replicated so that whole subgraph present in a local partition. Only one subgraph owns a replicated vertex/edge, and can modify it.
- Bin packing of multiple overlapping extracted graphs on same partition using min-hash over vertices to reduce vertex replication.
- Desiderata
- Support graph extraction based on query/distance, and distributedly apply user logic on graph(s) extracted.
- Packing overlapping (sub)graphs into few partitions.
- TODO:
- MinHash functions