CurrentEventsByTag query on Sqlite has a weird initial delay #345

kpko · 2024-01-28T16:35:15Z

Version Information
I'm using Akka.Cluster.Hosting v1.5.15 and the associated libraries like Akka 1.5.15.

Describe the performance issue
I'm using an ASP.NET Core background service that uses a PersistenceQuery (CurrentEventsByTag) that should be run at application start to retrieve some data and cache it in-memory. I want my app to scale using PostgreSQL but also scale down using Sqlite which works great by using Akka.Persistence.Sql and making the provider itself configurable.

The background service I'm using looks like this and is added to the ASP.NET Core app via AddHostedService.

public class TestBackgroundService : BackgroundService
{
    private readonly ActorSystem _actorSystem;

    public TestBackgroundService(ActorSystem actorSystem)
    {
        _actorSystem = actorSystem;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        Log.Information("TestBackgroundService: Starting query...");

        var query = PersistenceQuery
            .Get(_actorSystem)
            .ReadJournalFor<SqlReadJournal>(SqlReadJournal.Identifier)
            .CurrentEventsByTag("t:wfd", Offset.NoOffset());

        await foreach (var ee in query.RunAsAsyncEnumerable(_actorSystem).WithCancellation(stoppingToken))
        {
            Log.Information("TestBackgoundService: Got event {Event}", ee.Event);
        }
    }
}

When using the Sqlite provider (with Microsoft.Data.Sqlite v8.0.1), I get my first event after 11 seconds:

[17:25:24 INF] TestBackgroundService: Starting query...
[17:25:24 INF] Now listening on: http://localhost:12341
[17:25:24 INF] Application started. Press Ctrl+C to shut down.
[17:25:24 INF] Hosting environment: Development
[17:25:24 INF] Content root path: /Users/kp/Projects/test/src/test
[17:25:35 INF] TestBackgoundService: Got event Incremented { Tag = t:wfd }
[17:25:35 INF] TestBackgoundService: Got event Incremented { Tag = t:wfd }

This delay doesn't happen with PostgreSQL or using another query type. Is there a reason the CurrentEventsByTag query runs after a 10-11s delay? The Sqlite database itself it empty except for the two events in the journal. I'm using the tag table instead of csv tags.

Could you give me a tip on how to get more information of the persistence query performance?

Also I expect to be the Sqlite provider to be slower than the PostgreSQL provider and a 10s delay is not the end of the world for an ASP.NET Core web app - but this doesn't feel like just "bad performance" but more like theres an intentional 10s delay somewhere and I would like to understand why.

Environment

I'm using a 16-inch 2021 MacBook Pro (M1) with 16 GB RAM.
.NET SDK:
Version: 8.0.101
Commit: 6eceda187b
Workload version: 8.0.100-manifests.8a1da915

Runtime Environment:
OS Name: Mac OS X
OS Version: 14.0
OS Platform: Darwin
RID: osx-arm64
Base Path: /usr/local/share/dotnet/sdk/8.0.101/

The text was updated successfully, but these errors were encountered:

Aaronontheweb · 2024-02-13T14:13:56Z

This delay doesn't happen with PostgreSQL or using another query type. Is there a reason the CurrentEventsByTag query runs after a 10-11s delay? The Sqlite database itself it empty except for the two events in the journal. I'm using the tag table instead of csv tags.

That is very weird - is that a Linq2Db discrepancy possibly @to11mtm ?

to11mtm · 2024-02-13T18:33:01Z

This delay doesn't happen with PostgreSQL or using another query type. Is there a reason the CurrentEventsByTag query runs after a 10-11s delay? The Sqlite database itself it empty except for the two events in the journal. I'm using the tag table instead of csv tags.

That is very weird - is that a Linq2Db discrepancy possibly @to11mtm ?

Possibly but I'd be surprised; MSSqlite is pretty fast, In most work environments/etc I actually tend to use it for writing DAL quasi-integration tests and protyping tests, a delay like that overall would be something I feel like I'd have seen.

That said, I know #347 fixes some things that could cause pain in SQLite for existing data, but not necessarily for a 'fresh' DB. @kpko what does your connection string look like?

kpko · 2024-02-16T00:18:47Z

Hi, I managed to reproduce the issue on a smaller stripped down example project: https://github.com/kpko/sqlite-query-delay-example

I needed some time to reproduce the issue because on my first tries it worked just fine and the query ran pretty quickly. I then tried to align the example app more like my main project and was confused when everything worked as expected, even when I copied most of the functionality over.

Finally I was able to reproduce this when I more or less copied the data in the database and I found the culprit: Once the ordering_id (on the journal and the tag table) is higher, the 10s delay starts. When there are two events/tags in the database with ordering id 1 and 2, theres practically no delay. When I manually edit the database and change the ordering id to 300 and 301 respectively, the initial delay happens.

Before, no delay:

After, 00:00:10.4972806 delay (stopwatch):

I tried to figure out at what value this starts and it seems like it already can be reproduced when ordering id is 2 instead of 1. Is this some kind of Sqlite query issue?

to11mtm · 2024-02-16T16:07:23Z

Hi, I managed to reproduce the issue on a smaller stripped down example project: https://github.com/kpko/sqlite-query-delay-example

I needed some time to reproduce the issue because on my first tries it worked just fine and the query ran pretty quickly. I then tried to align the example app more like my main project and was confused when everything worked as expected, even when I copied most of the functionality over.

Finally I was able to reproduce this when I more or less copied the data in the database and I found the culprit: Once the ordering_id (on the journal and the tag table) is higher, the 10s delay starts. When there are two events/tags in the database with ordering id 1 and 2, theres practically no delay. When I manually edit the database and change the ordering id to 300 and 301 respectively, the initial delay happens.

Before, no delay:

After, 00:00:10.4972806 delay (stopwatch):

I tried to figure out at what value this starts and it seems like it already can be reproduced when ordering id is 2 instead of 1. Is this some kind of Sqlite query issue?

This explains a lot, thank you!
This is by design but may be tuned to risk taste, and not necessarily unique to SQLite.

If the ordering numbers are non linear or have a large gap, there will be delays on queries.

Why:

Consider the case where multiple processes are writing to the DB concurrently.

If A starts writing BEFORE B, but doesn't finish until AFTER B... It's possible that A succeeds (delay in visibility of ordered elements) or fails (gaps in ordering), but in general, if we see a 'gap' in ordering IDs, the logic will 'scan' the range of IDs it sees as missing for a certain period of time before finally 'releasing' the current Max Ordering ID.

Backoff defaults are configured/shown here. Note that the default backoff for a missing value is 10 queries with 1 second between... lines up fairly well with your timings.

to11mtm · 2024-02-16T21:59:58Z

Following up on this, speaking with Aaron we are issue open to continue investigating potential solutions due to the behavior's pain points.

kpko · 2024-02-22T07:31:31Z

Hi @to11mtm, thank you very much for the explanation. I understand why it's implemented in this way now. For my concerns this is totally fine.

I delete event streams once certain workflows in my project complete, that produced gaps in my log table pretty quickly. I have no idea how to work around the issue you mentioned, I didn't consider this scenario. Would you like the issue to stay open to work on an alternative solution? Otherwise please feel free to close this issue.

Again, thank you for your time and the explanation. Keep up the good work :)

to11mtm · 2024-02-22T13:45:30Z

We are trying to work towards a better solution for this, so will keep this issue open for now. Thank you! :)

lucavice · 2024-07-04T14:42:33Z

Hi @to11mtm, I think I'm experiencing a similar issue to the OP but I'm not sure it has to do with gaps.

I'm using MS SQL instead of Postgres, and I am experiencing an initial delay with EventsByTag after moving from the Akka.Persistence.SqlServer plugin to this one with Akka 1.5.

When I use EventsByTag, I always provide an initial offset to resume from latest point. This implements persistent projections in our event sourced system. But since moving to Akka.Persistence.Sql, I experience a delay that is proportional to the starting offset provided to EventsByTag. Examples:

Local environment with 500k events (starting point 500k): initial delay of 4 minutes

Staging environment with 1.4m events: initial delay of 17 minutes

Local environment with clean journal (there is some initial seeded data): initial delay of about 10s, like OP

Reducing the backoff parameters (number of tries, query delay, etc) does affect this intial delay, so it does seem related to this backpressure mechanism.

Now my question is: what is the correct way to use EventsByTag with a high starting offset without having this huge initial delay? It does not seem intentional to me. There are no gaps in the ordering in my case, just a high starting offset.

At the moment, it makes Akka.Persistence.Sql unuseable for me, if I were to deploy this to our production environment it would take hours to start....

I can open a new issue if this turns out to be unrelated to the OP request.

Thanks for your help!

ondrejpialek · 2024-07-05T13:21:58Z

Would it be possible, as a temp workaround, to update the back off logic to disregard gaps for events written more than a day / an hour ago? A delay of tens of seconds might be acceptable, many minutes is really not...

Aaronontheweb · 2024-07-05T13:26:43Z

lay of tens of seconds might be acceptable, many minutes is really not...

I think that's a good idea

lucavice · 2024-07-05T16:15:50Z

@ondrejpialek @Aaronontheweb @to11mtm

So, I added some additional log statements into a local version of Akka.Persistence.Sql and I have a better understanding of what's happening:

When the EventsByTag is called, gaps are searched thoughout the whole history of events, regardless of the requested starting offset
Because SQL Server does not gurantee that the identity column is consecutive, we do have many gaps
These gaps are inspected in batches. The default batching size is 10'000. Each batch will wait for a max of MaxTries * QueryDelay, which by default is 10*1s = 10s.
Because at least one gap is very likely to appear for every 10'000 inserts, we end up waiting roughly 10s for each 10k events.

You can see below a simulation of my local environment, where I have 27393 events.
The initialization of the JournalSequenceActor calls the FindGaps function 3 times, finding all the gaps in batches.

In fact, my local environment projection takes 30s to start, which is consistent with this wait process 3*10s.

For our staging environment, we have 1.4m events and 21k gaps. This means that most of 10'000-sized batches will likely encounter a gap. So this means that it will need to wait, at worst case, 140 times 10s with default settings, which is consistent with what I saw (17 min, a bit less then the worst case).

I agree with @ondrejpialek , I think the gap-finding code should ignore any gaps in OrderingId that happened more than MaxTries*QueueDelay ago...

Aaronontheweb · 2024-07-05T16:35:29Z

When the EventsByTag is called, gaps are searched thoughout the whole history of events, regardless of the requested starting offset

Yikes, that would do it.

I agree with @ondrejpialek , I think the gap-finding code should ignore any gaps in OrderingId that happened more than MaxTries*QueueDelay ago...

I will happily accept a PR to that effect - gaps should have a freshness window. We only care about very recent history, so maybe make this configurable?

lucavice · 2024-07-08T09:51:32Z

@Aaronontheweb @ondrejpialek @to11mtm It turns out that the logic was already in place and correct, but a small bug prevented this 'freshness' window to be applied as intended.

I provided a PR here: #415

lucavice mentioned this issue Jul 8, 2024

Fixes bug with long delay with initialization of JournalSequenceActor #415

Merged

Aaronontheweb closed this as completed in #415 Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CurrentEventsByTag query on Sqlite has a weird initial delay #345

CurrentEventsByTag query on Sqlite has a weird initial delay #345

kpko commented Jan 28, 2024 •

edited

Loading

Aaronontheweb commented Feb 13, 2024

to11mtm commented Feb 13, 2024

kpko commented Feb 16, 2024 •

edited

Loading

to11mtm commented Feb 16, 2024 •

edited

Loading

to11mtm commented Feb 16, 2024

kpko commented Feb 22, 2024

to11mtm commented Feb 22, 2024

lucavice commented Jul 4, 2024

ondrejpialek commented Jul 5, 2024 •

edited

Loading

Aaronontheweb commented Jul 5, 2024

lucavice commented Jul 5, 2024

Aaronontheweb commented Jul 5, 2024

lucavice commented Jul 8, 2024

CurrentEventsByTag query on Sqlite has a weird initial delay #345

CurrentEventsByTag query on Sqlite has a weird initial delay #345

Comments

kpko commented Jan 28, 2024 • edited Loading

Aaronontheweb commented Feb 13, 2024

to11mtm commented Feb 13, 2024

kpko commented Feb 16, 2024 • edited Loading

to11mtm commented Feb 16, 2024 • edited Loading

Why:

to11mtm commented Feb 16, 2024

kpko commented Feb 22, 2024

to11mtm commented Feb 22, 2024

lucavice commented Jul 4, 2024

ondrejpialek commented Jul 5, 2024 • edited Loading

Aaronontheweb commented Jul 5, 2024

lucavice commented Jul 5, 2024

Aaronontheweb commented Jul 5, 2024

lucavice commented Jul 8, 2024

kpko commented Jan 28, 2024 •

edited

Loading

kpko commented Feb 16, 2024 •

edited

Loading

to11mtm commented Feb 16, 2024 •

edited

Loading

ondrejpialek commented Jul 5, 2024 •

edited

Loading