-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retroactive User Recognition at scale without redis #565
Comments
The whole user recognition flow relies on the fact that we could pull user by anonymous_id relatively quickly. Besides redis following storages will work:
Here are the caveats:
I believe that the best way to deal with that would be a) sending data to S3 with Jitsu b) writing a Spark job that processes the data and sends the updated events to Jitsu. The real-time aspect of Jitsu will be lost, but anyway it will do a better job comparing to writing an in-house merger |
@vklimontovich But with a db like redshift or snowflake this can be achieved by only storing unique anonymous user ids in redis.
Or do I miss something ? I Believe that retroactive user recognition is inherently something that should not be real time "retroactive" -> after the fact That's my intuition. |
It's time to revisit this given a new architecture of Jitsu Next and the fact that we use Mongo as an underlying storage for user recognition. Here's a preliminary design:
We should think through design in details, generally speaking it's easy to do if the database allows to pull events by id |
Closing, we got rid of Redis already and moved to Mongo |
Problem
According to docs the current implementation stores all anonymous events in Redis
This has two significant downsides:
As you pointed out in the documentation:
10M events / month is really not that much.
its ~ 231 events per minute or 3.4 events per second.
Large scale tracking load is mesured in thousends of events per second.
At this point redis ram consumption will explode.
Solution
Implement Retroactive User Recognition is a background task, no need to store those events in a hot cache like redis.
Instead they can be stored in any cloud storage as files ( under a path containing the user anonymous id )
This solves the ram consumption problem.
And once a user is identified
a background process can update the records according with the identified user.
In this scenario redis will only contain the coordinating info ( and can be updated asyncronuesly )
The text was updated successfully, but these errors were encountered: