-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimizations from profiling build_hyracks #68
optimizations from profiling build_hyracks #68
Conversation
@@ -148,7 +148,7 @@ public static void writeHDFSBinToHDFSSvg(JobConf conf, String srcDir, String des | |||
|
|||
public static String convertEdgeToGraph(String outputNode, Node value, GRAPH_TYPE graphType) { | |||
String outputEdge = ""; | |||
for (EDGETYPE et : EDGETYPE.values()) { | |||
for (EDGETYPE et : EDGETYPE.values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the different between the values()
and value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values()
is a function that's automatically generated for all enum
types and which returns a static final array including all the enum
values (so FF, FR, RF and RR in this case). values
as I've added here is shadowing that generated function with a public static final enum[]
. I'm essentially bypassing the getter for this enum type, which was taking up to 5% of the total runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can rename away from .values
so it's not so ambiguous. In fact, it's weird that java lets me have a field called values
when I can also call the function values()
.
About this
I suggest we remove this method, because every time we serialized node, we call this
When the caller want to know the length, just let it to get the length of the returned byte[]. |
About the iterator issues, the TreeSet.iterator is really complicated. By now , we didn't use some range query operations, maybe we can just change it to HashSet ? |
Good point. I had thought that this code was called directly by pregelix/hyracks. Turns out that we control the caller in this case so we can do it. Looking at the API for |
Yeah, that's an easy fix. The only thing we lose is a syntactically clear way to get the (one and) only element in the set. |
Total graph build time with one thread goes from 26.5 to 23 minutes, so about 10% savings. |
this 10% saving is decreased by this branch ? or iterator() ? or |
We still have to call the EdgeMap.getLenghtInBytes() on any copy operation. I could try caching the value and marking it dirty when there's an update. Or maybe we stop caring at this point. |
10% saving just from the changes committed so far. I'm working on the other changes now and will post timings once they're ready. |
OK, I got it. |
@anbangx don't close the pull request, please :D |
😄 |
Looks like 10% savings before was a fluke run. After running several times on the previous code, I see 5% on average. I've now pushed a fix for getLengths() and will do some timings and will switch over to HashSets instead of TreeSets. |
can we write a our own parsing function instead of using regex?, and can On Thu, Nov 21, 2013 at 12:56 PM, Jake Biesinger
|
Yes... but why? You think it will be faster than the regex? Or you mean you want to special-case the |
Okay guys, I think I'm done optimizing/profiling for a while. I'll do some timings on all these changes later but from my profiling, the remaining overhead is hyracks/pregelix code. Timing tables for BUILD:Ordered by Own Time:Ordered by Total Time:Timing tables for MERGE:Ordered by Own Time:Ordered by Total Time: |
Timings are in: so 16% faster. Not huge but a decent speedup :) Any comments? Shall I merge it in? |
Great. LGTM. :+1 Best Regards. Jianfeng Jia On Nov 23, 2013, at 3:54 PM, Jake Biesinger [email protected] wrote:
|
…wbiesing/optimizations-from-profiling-BUILD_HYRACKS Conflicts: genomix/genomix- pregelix/src/main/java/edu/uci/ics/genomix/pregelix/io/VertexValueWritable.java genomix/genomix- pregelix/src/main/java/edu/uci/ics/genomix/pregelix/operator/DeBruijnGraphCleanVertex.java
…iling-BUILD_HYRACKS optimizations from profiling build_hyracks
@anbangx @Nan-Zhang @JavierJia @Elmira88
Some simple changes to our code after looking at the profiles of BUILD_HYRACKS. the
enum.values()
function was apparently still doing a lot of work (or was being called so many times as to make it seem that way). I swapped it with a static final array of the same name.6f40f96 will cause more GC overhead but it's negligible compared to the deepcopy cost. I'm running timings now and will update this when I have them.
A few other items I noticed:
*.getSerializedLength()
! We may want to cache the length or something!