Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make grouping deterministic #225

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Make grouping deterministic #225

wants to merge 4 commits into from

Conversation

zhaih
Copy link
Collaborator

@zhaih zhaih commented Aug 3, 2023

by allocate groups in LineFileDocs rather than IndexThreads

Introduce DocGrouper to do grouping for text and binary based LFD. Please see javadoc for implementation details.

Test

NOT YET DONE
I'll test it with text based LFD and binary based LFD locally

zhaih added 2 commits August 2, 2023 22:33
than IndexThreads

Introduce DocGrouper to do grouping for text and binary based LFD
Copy link
Owner

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhaih! It would be SO AWESOME to fix this (hopefully last remaining) non-determinism in benchy indexing so that we could switch to concurrent indexing + rearrange to achieve the deterministic search index.

I left a bunch of small comments, and one big one about maybe just disallowing binary LFD + grouped indexing, since the algo get so scary-hairy.

src/main/perf/LineFileDocs.java Outdated Show resolved Hide resolved
src/main/perf/DocGrouper.java Show resolved Hide resolved
src/main/perf/DocGrouper.java Show resolved Hide resolved
src/main/perf/DocGrouper.java Outdated Show resolved Hide resolved
src/main/perf/DocGrouper.java Outdated Show resolved Hide resolved
src/main/perf/DocGrouper.java Show resolved Hide resolved
src/main/perf/DocGrouper.java Outdated Show resolved Hide resolved
src/main/perf/DocGrouper.java Outdated Show resolved Hide resolved
assert docCounter == 0;
outputQueue.put(END);
}
buffer[docCounter++] = lfd;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm are LineFileDoc instances reused (it's unsafe to buffer/hold onto more than one at a time in a thread)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't think they're reused?

src/main/perf/DocGrouper.java Outdated Show resolved Hide resolved
Copy link
Owner

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for persisting here @zhaih! I think it is getting closer!!

src/main/perf/DocGrouper.java Show resolved Hide resolved
* Consuming {@link perf.LineFileDocs.LineFileDoc}, group them and put the grouped docs into
* a thread-safe queue.
*/
public abstract class DocGrouper {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this final, non-abstract, and rename it to TextDocGrouper maybe? No need for separate subclass since we have no binary case anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need a NoGroupImpl to deal with the case where we don't need groups, or we can handle that difference in LineFileDocs which requires several if/else there... I'm ok with either but think this might be (slightly) cleaner?

};

public static BytesRef[] group100;
public static BytesRef[] group100K;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these maybe become non-static now? Init them on construction of TextDocGrouper class? And fix indexer threads to reference them in this instance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually why they are static previously?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not certain :) I think because there was no obvious singleton instantiated class to store them on (though, IndexThreads could've been used, hmm)? But this new DocGrouper is instantiated once, and is all about grouping, so it seems like the right place to put these compute-once group values?

/**
* A simple impl when we do not need grouping
*/
static final class NoGroupImpl extends DocGrouper {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can we eliminate this? It seems wasteful way to do nothing? (putting a single doc into a new DocGroup into the queue for threads to then read? Can we just add if (addGroupingFields == false) and skip adding grouping fields in index threads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to retrieve lfd in two different ways, one with group one without, means we need to keep 2 blocking queues in LineFileDocs and have 2 ways to retrieve them..

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. Then I think you're right -- let's keep the abstract base class and the no-op subclass?

}
}

static final class TextBased extends DocGroups {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also elide this class up into super class, and have only a TextDocGroups?

/**
* A wrapper for singleLFD, when we don't use group fields
*/
static final class SingleLFD extends DocGroups {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we must also keep this, since we're going with the NoGroupImpl approach?

Copy link
Owner

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhai -- I left comments -- I think we are really close! And this can unblock using rearranger so we can use the (many!) cores in beast3 nightly benchmarking box to concurrently rearrange to a consistent segment geometry, unlocking important additional things to benchmark like more realistic KNN vectors, int[] and float[] vectors, etc.

In looking at this with semi-fresh eyes ... I am now wondering whether we should just pre-compute these groups when building the LineFileDocs file, instead of this hairy logic at indexing time? It'd mean a larger LineFileDocs file to read/parse, but we'd move all this hair to a one-time tool that adds the groups. Let's not do this now, and keep going with this PR (I think it's truly close), but can you open a follow-on issue to consider the "compute groups when building LineFileDocs" approach? Then the binary case could also handle groups too...

Thanks!!

};

public static BytesRef[] group100;
public static BytesRef[] group100K;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not certain :) I think because there was no obvious singleton instantiated class to store them on (though, IndexThreads could've been used, hmm)? But this new DocGrouper is instantiated once, and is all about grouping, so it seems like the right place to put these compute-once group values?

/**
* A simple impl when we do not need grouping
*/
static final class NoGroupImpl extends DocGrouper {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. Then I think you're right -- let's keep the abstract base class and the no-op subclass?

/**
* A wrapper for singleLFD, when we don't use group fields
*/
static final class SingleLFD extends DocGroups {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we must also keep this, since we're going with the NoGroupImpl approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants