Use array datatype for tags storage and searching #70

CumpsD · 2020-03-02T18:39:59Z

Apparently tags is a semicolon-separated list with a LIKE %% query against it, which is not that performant when you have millions of events.

PostgreSQL can turn tags into an array type and efficiently index it with a GIN index.

This PR turns the tags column in an array type and changes how tags are inserted and searched for.

Aaronontheweb · 2020-03-02T19:16:47Z

Looks like we need to update the build system here

CumpsD · 2020-03-02T19:53:05Z

Where is it defined? :p

Arkatufus · 2021-03-05T17:57:04Z

@CumpsD Sorry, can you resolve the merge conflict?

CumpsD · 2021-03-05T19:53:02Z

@Arkatufus I'll try to get around to it as fast as I can, somewhere next week

to11mtm · 2021-03-06T17:19:55Z

@CumpsD a couple notes on this:

If this is a breaking change we will need a migration script/application to handle, and we should probably have proper documentation for users around how to use in the upgrade.
Would you be able to help work such a change into Akka.Persistence.Linq2Db? The hope is that we are able to keep backwards compatibility for people who want to migrate from existing journals, but from a perf standpoint it's much faster so we are hoping people will adopt. I can assist ofc.

Arkatufus · 2021-03-08T12:22:08Z

I did a performance test using 7 tags and 2 million records using the latest (13.2) PostgreSql docker image, the result is not encouraging.

Current implementation
Persist()-ing 2000000 took 4020253.7091 ms
Querying took 5390.3155 ms
Querying took 5039.1806 ms
Querying took 5053.6488 ms
Querying took 5045.1138 ms
Querying took 4984.1766 ms
Querying took 5216.0742 ms
Querying took 5095.1884 ms
Querying took 4991.0002 ms
Querying took 5712.957 ms
Querying took 4915.7664 ms
Query average time: 5144.34215 ms, 194388.31454863475 msg/sec
Median time: 5049.3813 ms

==============================

Tag-array
Persist()-ing 2000000 took 4046318.0437 ms
Querying took 5587.1745 ms
Querying took 6423.1314 ms
Querying took 5028.1429 ms
Querying took 5127.729 ms
Querying took 5073.5635 ms
Querying took 5310.665 ms
Querying took 5388.9509 ms
Querying took 5120.5908 ms
Querying took 5128.799 ms
Querying took 8580.9275 ms
Query average time: 5676.967449999999 ms, 176150.38465651235 msg/sec
Median time: 5219.732 ms

CumpsD · 2021-03-08T12:24:09Z

If it helps to get an idea about scale, we have 200 million events :D

Arkatufus · 2021-03-08T12:25:44Z

There are a few outliers, but if you look at the median of both tests, you see that there are no significant improvement in performance between the 2 implementation.

Arkatufus · 2021-03-12T23:03:57Z

@CumpsD feel free to clone #81 and run a 200 million event test on it, my educated guess is that there wont be any significant improvement over the query.

CumpsD · 2021-03-16T17:09:19Z

@Arkatufus just wanted to let you know I have not lost sight of this, will do when I find some time in the coming week(s)

Aaronontheweb · 2022-06-16T14:47:17Z

Interesting - is this still on the table or should we be looking at Linq2Db instead?

to11mtm · 2022-06-16T15:23:54Z

Interesting - is this still on the table or should we be looking at Linq2Db instead?

There's advantages and drawbacks to Array datatype.

Obviously, it simplifies parts of the query pipeline, as you're able to write everything into one row and not interleave with tables.

The drawback is (AFAIK) Indexes on arrays are GIN (Generalized inverted) indexes. These can have different performance characteristics than BTrees on inserts and updates. BUT IIRC they can be more size efficient than BTree.

(FWIW, this would be easy to add as an option to Persistence.Linq2Db, Yet another tagmode flag 😅 .)

Side note regarding query performance: It may be prudent to flush Postgres' (and other DBs) internal caches for an operation like this, at least for some scenarios. If rows are recently inserted and pages fit into memory, unless there are a -lot- of tags I wouldn't expect to see much difference in perf.

Aaronontheweb · 2022-06-16T15:24:45Z

Sounds like we're better off sticking to Linq2Db :p

CumpsD · 2022-06-23T13:14:39Z

I never got around to testing it because I'm not using it anymore. But purely postgress wise, this made more sense :)

CumpsD added 2 commits March 2, 2020 19:36

use arrays and gin index to improve tag searching

90d2119

follow coding style for usings

a496338

Arkatufus mentioned this pull request Mar 8, 2021

[DO NOT MERGE] Add EventsByTag performace test #81

Open

to11mtm mentioned this pull request Dec 18, 2021

[FEATURE] Migrate Akka.Persistence.Sql.Common to Akka.Persistence.Linq2Db as the new "core engine" akkadotnet/akka.net#5408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use array datatype for tags storage and searching #70

Use array datatype for tags storage and searching #70

CumpsD commented Mar 2, 2020

Aaronontheweb commented Mar 2, 2020

CumpsD commented Mar 2, 2020

Arkatufus commented Mar 5, 2021

CumpsD commented Mar 5, 2021

to11mtm commented Mar 6, 2021

Arkatufus commented Mar 8, 2021 •

edited

Loading

CumpsD commented Mar 8, 2021

Arkatufus commented Mar 8, 2021

Arkatufus commented Mar 12, 2021

CumpsD commented Mar 16, 2021

Aaronontheweb commented Jun 16, 2022

to11mtm commented Jun 16, 2022

Aaronontheweb commented Jun 16, 2022

CumpsD commented Jun 23, 2022

Use array datatype for tags storage and searching #70

Are you sure you want to change the base?

Use array datatype for tags storage and searching #70

Conversation

CumpsD commented Mar 2, 2020

Aaronontheweb commented Mar 2, 2020

CumpsD commented Mar 2, 2020

Arkatufus commented Mar 5, 2021

CumpsD commented Mar 5, 2021

to11mtm commented Mar 6, 2021

Arkatufus commented Mar 8, 2021 • edited Loading

CumpsD commented Mar 8, 2021

Arkatufus commented Mar 8, 2021

Arkatufus commented Mar 12, 2021

CumpsD commented Mar 16, 2021

Aaronontheweb commented Jun 16, 2022

to11mtm commented Jun 16, 2022

Aaronontheweb commented Jun 16, 2022

CumpsD commented Jun 23, 2022

Arkatufus commented Mar 8, 2021 •

edited

Loading