-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB-12351 Remove the pre-scan of IndexPrefixIteratorMode on HBase #5686
base: master
Are you sure you want to change the base?
Conversation
jenkins please test branch @dbaas3.1,skipTestsLongerThan2Minutes |
jenkins please test branch @dbaas3.1,skipTestsLongerThan2Minutes |
jenkins please test branch @cdh6.3.0,skipTestsLongerThan2Minutes |
TEST SUCCEEDED +1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great fix!
In description, this sounds strange, though:
Since operation tree deserialization sometimes deserializes items from a map based on the target result set of the operation, the null qualifiersField in the IndexPrefixIteratorOperation was picked up as the qualifiers for the main TableScanOperation, effectively removing the qualifiers.
It sounds like a problem we should fix. Is that right?
@ascend1 It seems to be by design. There is an assumption in the code that each SpliceOperation will have a different target result set number. That's not an unreasonable assumption, to assume each operation is independent and works on a stream of rows. This is a little different than what IndexPrefixIteratorOperation does. It is more like a modifier which builds an HBase filter for the underlying TableScan based on runtime values of start/stop keys. In the case of mem platform, it actually does iteratively apply the underlying TableScan, but it's still more like an Operation modifier than a true independent operation. Let me see if I can locate the code that serializes based on result set number and open a Jira to investigate a better way to do this. Created DB-12376 for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Short Description
Removes a costly pre-scan to find the first row in the table before a IndexPrefixIteratorMode scan. This is impacting the performance of livewire queries.
Long Description
DB-11930 fixed some execution issues in IndexPrefixIteratorMode table scans. One problem it tried to fix is a missing qualifier in the main TableScanOperation. Since operation tree deserialization sometimes deserializes items from a map keyed off of the target result set of the operation, the null qualifiersField in the IndexPrefixIteratorOperation was picked up as the qualifiers for the main TableScanOperation since they share the same target result set number, effectively removing the qualifiers. DB-11930 fixed it by writing the same qualifiersField to the IndexPrefixIteratorOperation as the TableScanOperation, but then added a flag to skip building of the qualifiers, since we want to retrieve the very first row. This was problematic because the flag was reset to false right after getNonSIScan was called, but needed to be reset after buildDataSet to ensure it was properly used. The effect is that the qualifier is also applied during the scan to find the first row. It may end up scanning through the whole table in control mode, if none of the rows qualify.
Really, we don't need to read the first row to get the DataValueDescriptor of the first index column. A null DVD is already built for us in the template row. So, the fix is to remove the scan for the first row entirely, for HBase platforms. For the mem platform, which still needs to collect the first column values, the scan is still done.
How to test
Run the following on iotdev03. The job should show up in the spark UI right away, and it should take minutes, not hours, to run: