Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile patterns #20

Open
2 of 5 tasks
Nan-Zhang opened this issue Oct 21, 2013 · 8 comments
Open
2 of 5 tasks

Profile patterns #20

Nan-Zhang opened this issue Oct 21, 2013 · 8 comments
Assignees
Labels

Comments

@Nan-Zhang
Copy link
Collaborator

  • Hadoop graphbuild
  • hyracks graphbuild
  • pathmerge
  • bubble merge
  • scaffolding

@anbangx to help.

@ghost ghost assigned JavierJia Oct 21, 2013
@JavierJia
Copy link
Collaborator

Hyracks graphbuilding

Main Modules:

  • Read Parser 11%
  • AggregateKmer init 18% + 20% =38%
  • AggregateKmer aggregate 6%
  • AggregateKmer Output 13%

Methods should be optimized, the percent here is the Own time:

  • java.util.TreeMap$EntrySet.iterator() 28%
    Callers and its percentage
    • edu.uci.ics.genomix.type.EdgeMap.getLengthInBytes() 64%
    • edu.uci.ics.genomix.type.EdgeMap.write(DataOutput) 20%
    • edu.uci.ics.genomix.type.EdgeMap.unionUpdate(EdgeMap) 13%
    • edu.uci.ics.genomix.type.EdgeMap.setAsCopy(EdgeMap) 3%
  • java.util.TreeSet.iterator() 13%
    Callers and its percentage
    • edu.uci.ics.genomix.type.ReadIdSet.write(DataOutput) 38%
    • java.util.TreeSet.(Collection) 24%
    • edu.uci.ics.genomix.type.ReadHeadSet.write(DataOutput) 20%
    • java.util.TreeSet.addAll(Collection) 19%
  • java.util.TreeSet.addAll(Collection) 7% ( this method may by hard to improve unless we change to not use TreeSet)
  • java.util.TreeMap.put(Object, Object) 6% (same as above)
  • edu.uci.ics.genomix.type.VKmer.compareTo(BinaryComparable) 6%
  • java.util.TreeSet.(Collection) 6%
  • edu.uci.ics.genomix.hyracks.graph.dataflow.ReadsKeyValueParserFactory$1.setEdgeListForCurAndNext(DIR, Node, DIR, Node, ReadIdSet) 5%
  • java.util.RegularEnumSet.iterator() 5%
  • java.util.EnumSet.allOf(Class) 5%
  • edu.uci.ics.genomix.type.Node.marshalToByteArray() 4%
  • edu.uci.ics.genomix.type.EdgeMap.setAsCopy(EdgeMap) 4%

@anbangx
Copy link
Collaborator

anbangx commented Oct 28, 2013

TreeMap.iterator shouldn't be called at all(?) in graph-build...

@JavierJia
Copy link
Collaborator

Path Merge

Major modules

  • edu.uci.ics.pregelix.dataflow.std.IndexNestedLoopRightOuterJoinFunctionUpdateOperatorNodePushable.moveTreeCursor() 29%
  • edu.uci.ics.genomix.pregelix.operator.pathmerge.P4ForPathMergeVertex.compute(Iterator) 15%
  • edu.uci.ics.pregelix.dataflow.util.TupleDeserializer.deserializeRecord(ArrayTupleBuilder, ITupleReference) 10%
    • edu.uci.ics.genomix.type.Node.readFields(DataInput) 9%
  • edu.uci.ics.pregelix.dataflow.std.IndexNestedLoopRightOuterJoinFunctionUpdateOperatorNodePushable.outputMatch(int) 21%
    • edu.uci.ics.genomix.pregelix.operator.pathmerge.P4ForPathMergeVertex.compute(Iterator) 10%
  • edu.uci.ics.pregelix.dataflow.group.ClusteredGroupWriter.nextFrame(ByteBuffer) 16 %
    • edu.uci.ics.genomix.pregelix.io.message.PathMergeMessage.readFields(DataInput) 4%
    • edu.uci.ics.genomix.pregelix.io.message.PathMergeMessage.write(DataOutput) 3%
    • edu.uci.ics.genomix.type.Node.readFields(DataInput) 4 %
    • edu.uci.ics.genomix.pregelix.io.message.PathMergeMessage.write(DataOutput) 3%
  • edu.uci.ics.pregelix.dataflow.VertexFileScanOperatorDescriptor$1.loadVertices(IHyracksTaskContext, Configuration, int) 5%
    • edu.uci.ics.genomix.pregelix.format.NodeToGenericVertexInputFormat$BinaryDataCleanLoadGraphReader.getCurrentVertex() 2%
    • edu.uci.ics.genomix.pregelix.format.NodeToGenericVertexInputFormat$BinaryDataCleanLoadGraphReader.nextVertex() 2 %
  • edu.uci.ics.hyracks.control.nc.Task.pushFrames(IPartitionCollector, IFrameWriter) 19%
    • edu.uci.ics.genomix.pregelix.io.message.PathMergeMessage.readFields(DataInput) 11%
    • edu.uci.ics.genomix.pregelix.io.message.PathMergeMessage.write(DataOutput) 8%

Hot spots process ( The total times, sub process time included)

  • edu.uci.ics.genomix.type.EdgeMap.readFields(DataInput) 39%
    callers are from
    • edu.uci.ics.genomix.pregelix.io.message.PathMergeMessage.readFields(DataInput) 61%
    • edu.uci.ics.genomix.pregelix.io.VertexValueWritable.readFields(DataInput) 34%
    • org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue() 4%
  • edu.uci.ics.genomix.type.EdgeMap.setAsCopy(EdgeMap) 8%
    callers are from
    • edu.uci.ics.genomix.type.Node.setAsCopy(EdgeMap[], ReadHeadSet, ReadHeadSet, VKmer, float) 50%
    • edu.uci.ics.genomix.type.Node.setEdgeMap(EDGETYPE, EdgeMap) 36%
    • edu.uci.ics.genomix.type.Node.mergeEdges(EDGETYPE, Node) 14%
  • edu.uci.ics.genomix.type.VKmer.compareTo(BinaryComparable) 5%
    callers are from
    • java.util.TreeMap.put(Object, Object) 95%

Hot spots methods ( Own time, sub process time excluded )

  • java.util.TreeSet.iterator() 15%
    callers are from
    • edu.uci.ics.genomix.type.ReadIdSet.write(DataOutput) 52%
    • edu.uci.ics.genomix.type.ReadHeadSet.write(DataOutput) 34%
    • java.util.TreeSet.(Collection) 10%
    • edu.uci.ics.genomix.type.Node.mergeStartAndEndReadIDs(EDGETYPE, Node) 2%
  • java.util.TreeMap$EntrySet.iterator() 13%
    callers are from
    • edu.uci.ics.genomix.type.EdgeMap.write(DataOutput) 85%
    • edu.uci.ics.genomix.type.EdgeMap.setAsCopy(EdgeMap) 13%
    • edu.uci.ics.genomix.type.EdgeMap.unionUpdate(EdgeMap) 3%
  • java.util.TreeMap$KeySet.iterator() 12%
    callers are from
    • java.util.TreeSet.iterator() 79%
      • edu.uci.ics.genomix.type.ReadHeadSet.write(DataOutput) 43%
      • edu.uci.ics.genomix.type.ReadIdSet.write(DataOutput) 33%
      • edu.uci.ics.genomix.type.Node.mergeStartAndEndReadIDs(EDGETYPE, Node) 3%
    • edu.uci.ics.genomix.pregelix.operator.DeBruijnGraphCleanVertex.isTandemRepeat(VertexValueWritable) 10%
    • edu.uci.ics.genomix.type.EdgeMap.setAsCopy(EdgeMap) 6%
    • edu.uci.ics.genomix.type.Node.setAsCopy(Node) 3%
  • java.util.TreeMap.put(Object, Object) 11%
    callers are from
    • edu.uci.ics.genomix.type.EdgeMap.readFields(DataInput) 54%
    • edu.uci.ics.genomix.type.ReadIdSet.readFields(DataInput) 36%
    • edu.uci.ics.genomix.type.EdgeMap.setAsCopy(EdgeMap) 7%
  • edu.uci.ics.genomix.type.EdgeMap.readFields(DataInput) 9%
  • org.apache.hadoop.conf.Configuration.get(String) 7%
    callers are from
    • edu.uci.ics.genomix.config.GenomixJobConf.setGlobalStaticConstants(Configuration) 65%
    • edu.uci.ics.genomix.pregelix.operator.pathmerge.P4ForPathMergeVertex.initVertex() 35%

@JavierJia
Copy link
Collaborator

@anbangx @jakebiesinger please have a look on this numbers.

@jakebiesinger
Copy link
Contributor

Do pages like this make more sense on the wiki? I want to make inline comments on these numbers. I could edit the entries directly and make the comments inline but that seems silly.

Graphbuild

  • What's happening in AggregateKmer init? That's a huge chunk of build time.
  • In any case, for Hyracks graphbuilding, java.util.TreeMap$EntrySet.iterator() is mostly called in serialization. Easy enough to cache the length and use a dirty bit to see if it needs to be recalculated.
  • Couldn't edu.uci.ics.genomix.type.EdgeMap.unionUpdate(EdgeMap) use addAll or something? Seems a waste to have to iterate here when the types are the same.

PathMerge

  • Implementing Write-space optimization for pregelix messages and genomix-data structures. #10 would make a lot of the overhead of PathMergeMessage.readFields and VertexValueWritable.readFields go away. Also, I'm not sure if we really need to use setAsCopy when merging in other node's EdgeMap's (easing Node.mergeEdges). Maybe we can get away with more references, especially in our incoming messages (is the value returned by msgIterator.next() really something we have to make a new copy of? Can we get away with copying out only what we need from it if it is a non-reusable reference? @anbangx can you investigate?
  • There really is a lot of work happening in serialization and deserialization. I think we're going to need to change how we do this... but let's hold off a touch longer until our data structures settle down a bit.
  • I'm surprised to see so much time spent on Configuration.get. Can we move those callers so they're only called in the first iteration, @anbangx ?

@anbangx
Copy link
Collaborator

anbangx commented Oct 31, 2013

I will investigate the topics you mentioned, but move callers only in first iteration seems not work

@JavierJia
Copy link
Collaborator

@jakebiesinger
Copy link
Contributor

Actually, there's a wiki already available that you can use (it's really another git repo). Ours is turned off right now but is easy to add. No need to make it a submodule unless you want to edit it locally rather than using the github interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants