Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use array datatype for tags storage and searching #70

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

CumpsD
Copy link
Member

@CumpsD CumpsD commented Mar 2, 2020

Apparently tags is a semicolon-separated list with a LIKE %% query against it, which is not that performant when you have millions of events.

PostgreSQL can turn tags into an array type and efficiently index it with a GIN index.

This PR turns the tags column in an array type and changes how tags are inserted and searched for.

@Aaronontheweb
Copy link
Member

Looks like we need to update the build system here

@CumpsD
Copy link
Member Author

CumpsD commented Mar 2, 2020

Where is it defined? :p

@Arkatufus
Copy link
Contributor

@CumpsD Sorry, can you resolve the merge conflict?

@CumpsD
Copy link
Member Author

CumpsD commented Mar 5, 2021

@Arkatufus I'll try to get around to it as fast as I can, somewhere next week

@to11mtm
Copy link
Member

to11mtm commented Mar 6, 2021

@CumpsD a couple notes on this:

  • If this is a breaking change we will need a migration script/application to handle, and we should probably have proper documentation for users around how to use in the upgrade.

  • Would you be able to help work such a change into Akka.Persistence.Linq2Db? The hope is that we are able to keep backwards compatibility for people who want to migrate from existing journals, but from a perf standpoint it's much faster so we are hoping people will adopt. I can assist ofc.

@Arkatufus
Copy link
Contributor

Arkatufus commented Mar 8, 2021

I did a performance test using 7 tags and 2 million records using the latest (13.2) PostgreSql docker image, the result is not encouraging.

Current implementation
Persist()-ing 2000000 took 4020253.7091 ms
Querying took 5390.3155 ms
Querying took 5039.1806 ms
Querying took 5053.6488 ms
Querying took 5045.1138 ms
Querying took 4984.1766 ms
Querying took 5216.0742 ms
Querying took 5095.1884 ms
Querying took 4991.0002 ms
Querying took 5712.957 ms
Querying took 4915.7664 ms
Query average time: 5144.34215 ms, 194388.31454863475 msg/sec
Median time: 5049.3813 ms

==============================

Tag-array
Persist()-ing 2000000 took 4046318.0437 ms
Querying took 5587.1745 ms
Querying took 6423.1314 ms
Querying took 5028.1429 ms
Querying took 5127.729 ms
Querying took 5073.5635 ms
Querying took 5310.665 ms
Querying took 5388.9509 ms
Querying took 5120.5908 ms
Querying took 5128.799 ms
Querying took 8580.9275 ms
Query average time: 5676.967449999999 ms, 176150.38465651235 msg/sec
Median time: 5219.732 ms

@CumpsD
Copy link
Member Author

CumpsD commented Mar 8, 2021

If it helps to get an idea about scale, we have 200 million events :D

@Arkatufus
Copy link
Contributor

There are a few outliers, but if you look at the median of both tests, you see that there are no significant improvement in performance between the 2 implementation.

@Arkatufus
Copy link
Contributor

@CumpsD feel free to clone #81 and run a 200 million event test on it, my educated guess is that there wont be any significant improvement over the query.

@CumpsD
Copy link
Member Author

CumpsD commented Mar 16, 2021

@Arkatufus just wanted to let you know I have not lost sight of this, will do when I find some time in the coming week(s)

@Aaronontheweb
Copy link
Member

Interesting - is this still on the table or should we be looking at Linq2Db instead?

@to11mtm
Copy link
Member

to11mtm commented Jun 16, 2022

Interesting - is this still on the table or should we be looking at Linq2Db instead?

There's advantages and drawbacks to Array datatype.

Obviously, it simplifies parts of the query pipeline, as you're able to write everything into one row and not interleave with tables.

The drawback is (AFAIK) Indexes on arrays are GIN (Generalized inverted) indexes. These can have different performance characteristics than BTrees on inserts and updates. BUT IIRC they can be more size efficient than BTree.

(FWIW, this would be easy to add as an option to Persistence.Linq2Db, Yet another tagmode flag 😅 .)

Side note regarding query performance: It may be prudent to flush Postgres' (and other DBs) internal caches for an operation like this, at least for some scenarios. If rows are recently inserted and pages fit into memory, unless there are a -lot- of tags I wouldn't expect to see much difference in perf.

@Aaronontheweb
Copy link
Member

Sounds like we're better off sticking to Linq2Db :p

@CumpsD
Copy link
Member Author

CumpsD commented Jun 23, 2022

I never got around to testing it because I'm not using it anymore. But purely postgress wise, this made more sense :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants