-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make grouping deterministic #225
base: master
Are you sure you want to change the base?
Conversation
than IndexThreads Introduce DocGrouper to do grouping for text and binary based LFD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhaih! It would be SO AWESOME to fix this (hopefully last remaining) non-determinism in benchy indexing so that we could switch to concurrent indexing + rearrange to achieve the deterministic search index.
I left a bunch of small comments, and one big one about maybe just disallowing binary LFD + grouped indexing, since the algo get so scary-hairy.
assert docCounter == 0; | ||
outputQueue.put(END); | ||
} | ||
buffer[docCounter++] = lfd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm are LineFileDoc
instances reused (it's unsafe to buffer/hold onto more than one at a time in a thread)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't think they're reused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for persisting here @zhaih! I think it is getting closer!!
* Consuming {@link perf.LineFileDocs.LineFileDoc}, group them and put the grouped docs into | ||
* a thread-safe queue. | ||
*/ | ||
public abstract class DocGrouper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this final, non-abstract, and rename it to TextDocGrouper
maybe? No need for separate subclass since we have no binary case anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need a NoGroupImpl
to deal with the case where we don't need groups, or we can handle that difference in LineFileDocs
which requires several if/else there... I'm ok with either but think this might be (slightly) cleaner?
}; | ||
|
||
public static BytesRef[] group100; | ||
public static BytesRef[] group100K; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these maybe become non-static now? Init them on construction of TextDocGrouper
class? And fix indexer threads to reference them in this instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually why they are static previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not certain :) I think because there was no obvious singleton instantiated class to store them on (though, IndexThreads
could've been used, hmm)? But this new DocGrouper
is instantiated once, and is all about grouping, so it seems like the right place to put these compute-once group values?
/** | ||
* A simple impl when we do not need grouping | ||
*/ | ||
static final class NoGroupImpl extends DocGrouper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can we eliminate this? It seems wasteful way to do nothing? (putting a single doc into a new DocGroup
into the queue for threads to then read? Can we just add if (addGroupingFields == false)
and skip adding grouping fields in index threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we need to retrieve lfd in two different ways, one with group one without, means we need to keep 2 blocking queues in LineFileDocs
and have 2 ways to retrieve them..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. Then I think you're right -- let's keep the abstract base class and the no-op subclass?
} | ||
} | ||
|
||
static final class TextBased extends DocGroups { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also elide this class up into super class, and have only a TextDocGroups
?
/** | ||
* A wrapper for singleLFD, when we don't use group fields | ||
*/ | ||
static final class SingleLFD extends DocGroups { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we must also keep this, since we're going with the NoGroupImpl
approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhai -- I left comments -- I think we are really close! And this can unblock using rearranger so we can use the (many!) cores in beast3
nightly benchmarking box to concurrently rearrange to a consistent segment geometry, unlocking important additional things to benchmark like more realistic KNN vectors, int[]
and float[]
vectors, etc.
In looking at this with semi-fresh eyes ... I am now wondering whether we should just pre-compute these groups when building the LineFileDocs
file, instead of this hairy logic at indexing time? It'd mean a larger LineFileDocs
file to read/parse, but we'd move all this hair to a one-time tool that adds the groups. Let's not do this now, and keep going with this PR (I think it's truly close), but can you open a follow-on issue to consider the "compute groups when building LineFileDocs" approach? Then the binary
case could also handle groups too...
Thanks!!
}; | ||
|
||
public static BytesRef[] group100; | ||
public static BytesRef[] group100K; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not certain :) I think because there was no obvious singleton instantiated class to store them on (though, IndexThreads
could've been used, hmm)? But this new DocGrouper
is instantiated once, and is all about grouping, so it seems like the right place to put these compute-once group values?
/** | ||
* A simple impl when we do not need grouping | ||
*/ | ||
static final class NoGroupImpl extends DocGrouper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. Then I think you're right -- let's keep the abstract base class and the no-op subclass?
/** | ||
* A wrapper for singleLFD, when we don't use group fields | ||
*/ | ||
static final class SingleLFD extends DocGroups { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we must also keep this, since we're going with the NoGroupImpl
approach?
by allocate groups in LineFileDocs rather than IndexThreads
Introduce DocGrouper to do grouping for text and binary based LFD. Please see javadoc for implementation details.
Test
NOT YET DONE
I'll test it with text based LFD and binary based LFD locally