Make grouping deterministic #225

zhaih · 2023-08-03T05:38:30Z

by allocate groups in LineFileDocs rather than IndexThreads

Introduce DocGrouper to do grouping for text and binary based LFD. Please see javadoc for implementation details.

Test

NOT YET DONE
I'll test it with text based LFD and binary based LFD locally

than IndexThreads Introduce DocGrouper to do grouping for text and binary based LFD

mikemccand

Thanks @zhaih! It would be SO AWESOME to fix this (hopefully last remaining) non-determinism in benchy indexing so that we could switch to concurrent indexing + rearrange to achieve the deterministic search index.

I left a bunch of small comments, and one big one about maybe just disallowing binary LFD + grouped indexing, since the algo get so scary-hairy.

src/main/perf/LineFileDocs.java

src/main/perf/DocGrouper.java

mikemccand · 2023-08-03T11:35:03Z

src/main/perf/DocGrouper.java

+                assert docCounter == 0;
+                outputQueue.put(END);
+            }
+            buffer[docCounter++] = lfd;


Hmm are LineFileDoc instances reused (it's unsafe to buffer/hold onto more than one at a time in a thread)?

No I don't think they're reused?

src/main/perf/DocGrouper.java

mikemccand

Thanks for persisting here @zhaih! I think it is getting closer!!

src/main/perf/DocGrouper.java

mikemccand · 2023-09-12T11:19:51Z

src/main/perf/DocGrouper.java

+ * Consuming {@link perf.LineFileDocs.LineFileDoc}, group them and put the grouped docs into
+ * a thread-safe queue.
+ */
+public abstract class DocGrouper {


Can we make this final, non-abstract, and rename it to TextDocGrouper maybe? No need for separate subclass since we have no binary case anymore?

We still need a NoGroupImpl to deal with the case where we don't need groups, or we can handle that difference in LineFileDocs which requires several if/else there... I'm ok with either but think this might be (slightly) cleaner?

mikemccand · 2023-09-12T11:21:31Z

src/main/perf/DocGrouper.java

+    };
+
+    public static BytesRef[] group100;
+    public static BytesRef[] group100K;


Could these maybe become non-static now? Init them on construction of TextDocGrouper class? And fix indexer threads to reference them in this instance?

Actually why they are static previously?

Not certain :) I think because there was no obvious singleton instantiated class to store them on (though, IndexThreads could've been used, hmm)? But this new DocGrouper is instantiated once, and is all about grouping, so it seems like the right place to put these compute-once group values?

mikemccand · 2023-09-12T11:23:10Z

src/main/perf/DocGrouper.java

+    /**
+     * A simple impl when we do not need grouping
+     */
+    static final class NoGroupImpl extends DocGrouper {


Hmm can we eliminate this? It seems wasteful way to do nothing? (putting a single doc into a new DocGroup into the queue for threads to then read? Can we just add if (addGroupingFields == false) and skip adding grouping fields in index threads?

Then we need to retrieve lfd in two different ways, one with group one without, means we need to keep 2 blocking queues in LineFileDocs and have 2 ways to retrieve them..

OK, I see. Then I think you're right -- let's keep the abstract base class and the no-op subclass?

mikemccand · 2023-09-12T11:23:51Z

src/main/perf/DocGrouper.java

+            }
+        }
+
+        static final class TextBased extends DocGroups {


Also elide this class up into super class, and have only a TextDocGroups?

mikemccand · 2023-09-12T11:24:00Z

src/main/perf/DocGrouper.java

+        /**
+         * A wrapper for singleLFD, when we don't use group fields
+         */
+        static final class SingleLFD extends DocGroups {


Remove this?

Looks like we must also keep this, since we're going with the NoGroupImpl approach?

mikemccand

Thanks @zhai -- I left comments -- I think we are really close! And this can unblock using rearranger so we can use the (many!) cores in beast3 nightly benchmarking box to concurrently rearrange to a consistent segment geometry, unlocking important additional things to benchmark like more realistic KNN vectors, int[] and float[] vectors, etc.

In looking at this with semi-fresh eyes ... I am now wondering whether we should just pre-compute these groups when building the LineFileDocs file, instead of this hairy logic at indexing time? It'd mean a larger LineFileDocs file to read/parse, but we'd move all this hair to a one-time tool that adds the groups. Let's not do this now, and keep going with this PR (I think it's truly close), but can you open a follow-on issue to consider the "compute groups when building LineFileDocs" approach? Then the binary case could also handle groups too...

Thanks!!

mikemccand · 2023-10-27T16:19:03Z

src/main/perf/DocGrouper.java

+    };
+
+    public static BytesRef[] group100;
+    public static BytesRef[] group100K;


Not certain :) I think because there was no obvious singleton instantiated class to store them on (though, IndexThreads could've been used, hmm)? But this new DocGrouper is instantiated once, and is all about grouping, so it seems like the right place to put these compute-once group values?

mikemccand · 2023-10-28T09:53:31Z

src/main/perf/DocGrouper.java

+    /**
+     * A simple impl when we do not need grouping
+     */
+    static final class NoGroupImpl extends DocGrouper {


OK, I see. Then I think you're right -- let's keep the abstract base class and the no-op subclass?

mikemccand · 2023-10-28T09:54:37Z

src/main/perf/DocGrouper.java

+        /**
+         * A wrapper for singleLFD, when we don't use group fields
+         */
+        static final class SingleLFD extends DocGroups {


Looks like we must also keep this, since we're going with the NoGroupImpl approach?

zhaih added 2 commits August 2, 2023 22:33

Make grouping deterministic by allocate groups in LineFileDocs rather

04e6a38

than IndexThreads Introduce DocGrouper to do grouping for text and binary based LFD

Add to compile file list

e2e3735

mikemccand reviewed Aug 3, 2023

View reviewed changes

zhaih added 2 commits September 4, 2023 22:29

Address easy comments

b8ecd82

Refactor, disable binaryLFD with grouping

41d76c2

mikemccand reviewed Sep 12, 2023

View reviewed changes

mikemccand approved these changes Oct 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make grouping deterministic #225

Make grouping deterministic #225

zhaih commented Aug 3, 2023

mikemccand left a comment

mikemccand Aug 3, 2023

zhaih Sep 5, 2023

mikemccand left a comment

mikemccand Sep 12, 2023

zhaih Oct 8, 2023

mikemccand Sep 12, 2023

zhaih Oct 8, 2023

mikemccand Oct 27, 2023

mikemccand Sep 12, 2023

zhaih Oct 8, 2023

mikemccand Oct 28, 2023

mikemccand Sep 12, 2023

mikemccand Sep 12, 2023

mikemccand Oct 28, 2023

mikemccand left a comment

mikemccand Oct 27, 2023

mikemccand Oct 28, 2023

mikemccand Oct 28, 2023

Make grouping deterministic #225

Are you sure you want to change the base?

Make grouping deterministic #225

Conversation

zhaih commented Aug 3, 2023

Test

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment