Low cardinality auto indexing #142

EricSoroos · 2024-06-10T18:07:57Z

Describe the bug

Auto index generation on cardinality is backwards.

https://github.com/dathere/datapusher-plus/blob/master/datapusher/jobs.py#L1728

    # if a column's cardinality <= AUTO_INDEX_THRESHOLD, create an index for that column

Low cardinality indexes on postgresql aren't useful, because the planner will almost never choose them.

The planner will choose a table scan instead of an index scan if it is likely that many/most pages will be read. When you have a low cardinality index, eg a bool, the planner is going to assume that 1/2 of rows will be hit, and therefore essentially all pages. Even if you've got a distribution that's really skewed, the planner will generally not choose that index. (if you do have a known skewed distribution, you can do a partial/multicolumn index where the condition is the rare case, but that's not really applicable for a general purpose tool. )

This should be reversed, and set the default threshold to ~10 at least.

The text was updated successfully, but these errors were encountered:

jqnatividad · 2024-06-11T12:47:12Z

I went for that heuristic as I was optimizing for accelerated aggregations - (e.g. Borough column only has 5 values - Manhattan,Bronx,Brooklyn,Queens,Staten Island...), and for filtering with the DataTables view Searchbuilder, where you have an interactive filter builder and response time is important.

Maybe, I can tweak the heuristic so that if the value is negative and less than -1, then the threshold is interpreted as greater than the absolute value?

Alternatively, I can add another set of AutoIndex settings that can be specified in the dotenv template to eval() an expression that DP+ maintainers can set to specify their own Autoindex heuristics?

EricSoroos · 2024-06-11T13:02:21Z

It's possible what you want is 5 and up -- that's getting to the point that the planner might choose to use the index, and that's something that an explain analyze would tell you. Right now I'm seeing it create indexes when there's one value text value, which is truly useless. (and in my case, it's a 13mb index + a 38mb text index)

EricSoroos · 2024-06-11T13:03:40Z

As a side note -- I'm not sure it's worth specifically creating unique indexes when doing the loads from datapusher, unless you're specifically trying to constrain additional entries that are added.

jqnatividad · 2024-06-11T13:12:56Z

It's possible what you want is 5 and up -- that's getting to the point that the planner might choose to use the index, and that's something that an explain analyze would tell you. Right now I'm seeing it create indexes when there's one value text value, which is truly useless. (and in my case, it's a 13mb index + a 38mb text index)

Yes. For that example, I would set the setting to 5 in the dotenv template. I'll change the default from 3 to 10 as you suggested as that setting is too low. I'll also add a minimum with a default of 3.

jqnatividad · 2024-06-11T13:14:30Z

As a side note -- I'm not sure it's worth specifically creating unique indexes when doing the loads from datapusher, unless you're specifically trying to constrain additional entries that are added.

Yes. That was what I was going for - to prevent erroneous upserts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low cardinality auto indexing #142

Low cardinality auto indexing #142

EricSoroos commented Jun 10, 2024

jqnatividad commented Jun 11, 2024

EricSoroos commented Jun 11, 2024

EricSoroos commented Jun 11, 2024

jqnatividad commented Jun 11, 2024

jqnatividad commented Jun 11, 2024

Low cardinality auto indexing #142

Low cardinality auto indexing #142

Comments

EricSoroos commented Jun 10, 2024

jqnatividad commented Jun 11, 2024

EricSoroos commented Jun 11, 2024

EricSoroos commented Jun 11, 2024

jqnatividad commented Jun 11, 2024

jqnatividad commented Jun 11, 2024