Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CurrentEventsByTag query on Sqlite has a weird initial delay #345

Closed
kpko opened this issue Jan 28, 2024 · 13 comments · Fixed by #415
Closed

CurrentEventsByTag query on Sqlite has a weird initial delay #345

kpko opened this issue Jan 28, 2024 · 13 comments · Fixed by #415

Comments

@kpko
Copy link

kpko commented Jan 28, 2024

Version Information
I'm using Akka.Cluster.Hosting v1.5.15 and the associated libraries like Akka 1.5.15.

Describe the performance issue
I'm using an ASP.NET Core background service that uses a PersistenceQuery (CurrentEventsByTag) that should be run at application start to retrieve some data and cache it in-memory. I want my app to scale using PostgreSQL but also scale down using Sqlite which works great by using Akka.Persistence.Sql and making the provider itself configurable.

The background service I'm using looks like this and is added to the ASP.NET Core app via AddHostedService.

public class TestBackgroundService : BackgroundService
{
    private readonly ActorSystem _actorSystem;

    public TestBackgroundService(ActorSystem actorSystem)
    {
        _actorSystem = actorSystem;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        Log.Information("TestBackgroundService: Starting query...");

        var query = PersistenceQuery
            .Get(_actorSystem)
            .ReadJournalFor<SqlReadJournal>(SqlReadJournal.Identifier)
            .CurrentEventsByTag("t:wfd", Offset.NoOffset());

        await foreach (var ee in query.RunAsAsyncEnumerable(_actorSystem).WithCancellation(stoppingToken))
        {
            Log.Information("TestBackgoundService: Got event {Event}", ee.Event);
        }
    }
}

When using the Sqlite provider (with Microsoft.Data.Sqlite v8.0.1), I get my first event after 11 seconds:

[17:25:24 INF] TestBackgroundService: Starting query...
[17:25:24 INF] Now listening on: http://localhost:12341
[17:25:24 INF] Application started. Press Ctrl+C to shut down.
[17:25:24 INF] Hosting environment: Development
[17:25:24 INF] Content root path: /Users/kp/Projects/test/src/test
[17:25:35 INF] TestBackgoundService: Got event Incremented { Tag = t:wfd }
[17:25:35 INF] TestBackgoundService: Got event Incremented { Tag = t:wfd }

This delay doesn't happen with PostgreSQL or using another query type. Is there a reason the CurrentEventsByTag query runs after a 10-11s delay? The Sqlite database itself it empty except for the two events in the journal. I'm using the tag table instead of csv tags.

Could you give me a tip on how to get more information of the persistence query performance?

Also I expect to be the Sqlite provider to be slower than the PostgreSQL provider and a 10s delay is not the end of the world for an ASP.NET Core web app - but this doesn't feel like just "bad performance" but more like theres an intentional 10s delay somewhere and I would like to understand why.

Environment

I'm using a 16-inch 2021 MacBook Pro (M1) with 16 GB RAM.
.NET SDK:
Version: 8.0.101
Commit: 6eceda187b
Workload version: 8.0.100-manifests.8a1da915

Runtime Environment:
OS Name: Mac OS X
OS Version: 14.0
OS Platform: Darwin
RID: osx-arm64
Base Path: /usr/local/share/dotnet/sdk/8.0.101/

@Aaronontheweb
Copy link
Member

This delay doesn't happen with PostgreSQL or using another query type. Is there a reason the CurrentEventsByTag query runs after a 10-11s delay? The Sqlite database itself it empty except for the two events in the journal. I'm using the tag table instead of csv tags.

That is very weird - is that a Linq2Db discrepancy possibly @to11mtm ?

@to11mtm
Copy link
Member

to11mtm commented Feb 13, 2024

This delay doesn't happen with PostgreSQL or using another query type. Is there a reason the CurrentEventsByTag query runs after a 10-11s delay? The Sqlite database itself it empty except for the two events in the journal. I'm using the tag table instead of csv tags.

That is very weird - is that a Linq2Db discrepancy possibly @to11mtm ?

Possibly but I'd be surprised; MSSqlite is pretty fast, In most work environments/etc I actually tend to use it for writing DAL quasi-integration tests and protyping tests, a delay like that overall would be something I feel like I'd have seen.

That said, I know #347 fixes some things that could cause pain in SQLite for existing data, but not necessarily for a 'fresh' DB. @kpko what does your connection string look like?

@kpko
Copy link
Author

kpko commented Feb 16, 2024

Hi, I managed to reproduce the issue on a smaller stripped down example project: https://github.com/kpko/sqlite-query-delay-example

I needed some time to reproduce the issue because on my first tries it worked just fine and the query ran pretty quickly. I then tried to align the example app more like my main project and was confused when everything worked as expected, even when I copied most of the functionality over.

Finally I was able to reproduce this when I more or less copied the data in the database and I found the culprit: Once the ordering_id (on the journal and the tag table) is higher, the 10s delay starts. When there are two events/tags in the database with ordering id 1 and 2, theres practically no delay. When I manually edit the database and change the ordering id to 300 and 301 respectively, the initial delay happens.

Before, no delay:
image
image

After, 00:00:10.4972806 delay (stopwatch):
image
image

I tried to figure out at what value this starts and it seems like it already can be reproduced when ordering id is 2 instead of 1. Is this some kind of Sqlite query issue?

@to11mtm
Copy link
Member

to11mtm commented Feb 16, 2024

Hi, I managed to reproduce the issue on a smaller stripped down example project: https://github.com/kpko/sqlite-query-delay-example

I needed some time to reproduce the issue because on my first tries it worked just fine and the query ran pretty quickly. I then tried to align the example app more like my main project and was confused when everything worked as expected, even when I copied most of the functionality over.

Finally I was able to reproduce this when I more or less copied the data in the database and I found the culprit: Once the ordering_id (on the journal and the tag table) is higher, the 10s delay starts. When there are two events/tags in the database with ordering id 1 and 2, theres practically no delay. When I manually edit the database and change the ordering id to 300 and 301 respectively, the initial delay happens.

Before, no delay: image image

After, 00:00:10.4972806 delay (stopwatch): image image

I tried to figure out at what value this starts and it seems like it already can be reproduced when ordering id is 2 instead of 1. Is this some kind of Sqlite query issue?

This explains a lot, thank you!
This is by design but may be tuned to risk taste, and not necessarily unique to SQLite.

If the ordering numbers are non linear or have a large gap, there will be delays on queries.

Why:

Consider the case where multiple processes are writing to the DB concurrently.

If A starts writing BEFORE B, but doesn't finish until AFTER B... It's possible that A succeeds (delay in visibility of ordered elements) or fails (gaps in ordering), but in general, if we see a 'gap' in ordering IDs, the logic will 'scan' the range of IDs it sees as missing for a certain period of time before finally 'releasing' the current Max Ordering ID.

Backoff defaults are configured/shown here. Note that the default backoff for a missing value is 10 queries with 1 second between... lines up fairly well with your timings.

@to11mtm
Copy link
Member

to11mtm commented Feb 16, 2024

Following up on this, speaking with Aaron we are issue open to continue investigating potential solutions due to the behavior's pain points.

@kpko
Copy link
Author

kpko commented Feb 22, 2024

Hi @to11mtm, thank you very much for the explanation. I understand why it's implemented in this way now. For my concerns this is totally fine.

I delete event streams once certain workflows in my project complete, that produced gaps in my log table pretty quickly. I have no idea how to work around the issue you mentioned, I didn't consider this scenario. Would you like the issue to stay open to work on an alternative solution? Otherwise please feel free to close this issue.

Again, thank you for your time and the explanation. Keep up the good work :)

@to11mtm
Copy link
Member

to11mtm commented Feb 22, 2024

We are trying to work towards a better solution for this, so will keep this issue open for now. Thank you! :)

@lucavice
Copy link
Contributor

lucavice commented Jul 4, 2024

Hi @to11mtm, I think I'm experiencing a similar issue to the OP but I'm not sure it has to do with gaps.

I'm using MS SQL instead of Postgres, and I am experiencing an initial delay with EventsByTag after moving from the Akka.Persistence.SqlServer plugin to this one with Akka 1.5.

When I use EventsByTag, I always provide an initial offset to resume from latest point. This implements persistent projections in our event sourced system. But since moving to Akka.Persistence.Sql, I experience a delay that is proportional to the starting offset provided to EventsByTag. Examples:

  • Local environment with 500k events (starting point 500k): initial delay of 4 minutes
image
  • Staging environment with 1.4m events: initial delay of 17 minutes
image
  • Local environment with clean journal (there is some initial seeded data): initial delay of about 10s, like OP
image

Reducing the backoff parameters (number of tries, query delay, etc) does affect this intial delay, so it does seem related to this backpressure mechanism.

Now my question is: what is the correct way to use EventsByTag with a high starting offset without having this huge initial delay? It does not seem intentional to me. There are no gaps in the ordering in my case, just a high starting offset.

At the moment, it makes Akka.Persistence.Sql unuseable for me, if I were to deploy this to our production environment it would take hours to start....

I can open a new issue if this turns out to be unrelated to the OP request.

Thanks for your help!

@ondrejpialek
Copy link

ondrejpialek commented Jul 5, 2024

Would it be possible, as a temp workaround, to update the back off logic to disregard gaps for events written more than a day / an hour ago? A delay of tens of seconds might be acceptable, many minutes is really not...

@Aaronontheweb
Copy link
Member

lay of tens of seconds might be acceptable, many minutes is really not...

I think that's a good idea

@lucavice
Copy link
Contributor

lucavice commented Jul 5, 2024

@ondrejpialek @Aaronontheweb @to11mtm

So, I added some additional log statements into a local version of Akka.Persistence.Sql and I have a better understanding of what's happening:

  • When the EventsByTag is called, gaps are searched thoughout the whole history of events, regardless of the requested starting offset
  • Because SQL Server does not gurantee that the identity column is consecutive, we do have many gaps
  • These gaps are inspected in batches. The default batching size is 10'000. Each batch will wait for a max of MaxTries * QueryDelay, which by default is 10*1s = 10s.
  • Because at least one gap is very likely to appear for every 10'000 inserts, we end up waiting roughly 10s for each 10k events.

You can see below a simulation of my local environment, where I have 27393 events.
The initialization of the JournalSequenceActor calls the FindGaps function 3 times, finding all the gaps in batches.

image

In fact, my local environment projection takes 30s to start, which is consistent with this wait process 3*10s.

For our staging environment, we have 1.4m events and 21k gaps. This means that most of 10'000-sized batches will likely encounter a gap. So this means that it will need to wait, at worst case, 140 times 10s with default settings, which is consistent with what I saw (17 min, a bit less then the worst case).

I agree with @ondrejpialek , I think the gap-finding code should ignore any gaps in OrderingId that happened more than MaxTries*QueueDelay ago...

@Aaronontheweb
Copy link
Member

When the EventsByTag is called, gaps are searched thoughout the whole history of events, regardless of the requested starting offset

Yikes, that would do it.

I agree with @ondrejpialek , I think the gap-finding code should ignore any gaps in OrderingId that happened more than MaxTries*QueueDelay ago...

I will happily accept a PR to that effect - gaps should have a freshness window. We only care about very recent history, so maybe make this configurable?

@lucavice
Copy link
Contributor

lucavice commented Jul 8, 2024

@Aaronontheweb @ondrejpialek @to11mtm It turns out that the logic was already in place and correct, but a small bug prevented this 'freshness' window to be applied as intended.

I provided a PR here: #415

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants