-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low cardinality auto indexing #142
Comments
I went for that heuristic as I was optimizing for accelerated aggregations - (e.g. Borough column only has 5 values - Manhattan,Bronx,Brooklyn,Queens,Staten Island...), and for filtering with the DataTables view Searchbuilder, where you have an interactive filter builder and response time is important. Maybe, I can tweak the heuristic so that if the value is negative and less than -1, then the threshold is interpreted as greater than the absolute value? Alternatively, I can add another set of AutoIndex settings that can be specified in the dotenv template to |
It's possible what you want is 5 and up -- that's getting to the point that the planner might choose to use the index, and that's something that an explain analyze would tell you. Right now I'm seeing it create indexes when there's one value text value, which is truly useless. (and in my case, it's a 13mb index + a 38mb text index) |
As a side note -- I'm not sure it's worth specifically creating unique indexes when doing the loads from datapusher, unless you're specifically trying to constrain additional entries that are added. |
Yes. For that example, I would set the setting to 5 in the dotenv template. I'll change the default from 3 to 10 as you suggested as that setting is too low. I'll also add a minimum with a default of 3. |
Yes. That was what I was going for - to prevent erroneous upserts. |
Describe the bug
Auto index generation on cardinality is backwards.
https://github.com/dathere/datapusher-plus/blob/master/datapusher/jobs.py#L1728
Low cardinality indexes on postgresql aren't useful, because the planner will almost never choose them.
The planner will choose a table scan instead of an index scan if it is likely that many/most pages will be read. When you have a low cardinality index, eg a bool, the planner is going to assume that 1/2 of rows will be hit, and therefore essentially all pages. Even if you've got a distribution that's really skewed, the planner will generally not choose that index. (if you do have a known skewed distribution, you can do a partial/multicolumn index where the condition is the rare case, but that's not really applicable for a general purpose tool. )
This should be reversed, and set the default threshold to ~10 at least.
The text was updated successfully, but these errors were encountered: