-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Applying ops under load #123
Comments
Thanks for the work you've done so far! Do you happen to have the MongoDB logs? They should contain the queries being made, as well as how long those queries are taking; it'd really help to narrow down if the issue is a slow query, or with |
Yeah, thanks for providing the detailed timing logs, those are illustrative! On the first attempt, the two long delays are:
Which are indeed both in sharedb-mongo + Mongo. As Alec mentions, the slowness could be coming from:
It would be interesting to take a performance profile of, say, 2 Node processes handling requests from 20 clients, to see if the bottleneck is Node process CPU, time spent waiting on Mongo, or something else entirely. With just 2 Node processes, it should be feasible to run them both with |
@alecgibson @ericyhwang thanks for the hints - I will collect more info. |
I haven't done profiling yet - it requires more effort to hook into our Kubernetes infrastructure... But I'll do it. Meanwhile, I set the mongo log level to debug. Also, I added a log wrapped in nextTick before fetching ops/snapshot - to check if nodejs is overloaded. So, in the case when fetching ops/snapshot takes too much time, my nextTick log is called just after a few mongo logs (nodejs should not be overloaded) but there are a lot of calls (sometimes hundreds) to mongo before the ops/snapshot is really fetched. Adapting the example above:
Attaching new logs with my comments (separated with //): |
Based on that, it currently seems like the slowness is due to Mongo time and not Node time, so it's probably best to look further at Mongo for now instead of trying to get Node profiling working in Kubernetes. The first slow part of the slow log has a bunch of Later on, there are also multiple queries against the |
Hi @ericyhwang,
Thank you for trying to guide me, but I didn't get this - how to check this?
This is also unclear for me. What is "enough" in this context? I am actually doing this "load testing" to understand what I need to allow up to 500 users to work concurrently (up to 10 users per document). And at some point (~ 5 users per doc) scaling sharedb stopped improving the experience. |
https://docs.mongodb.com/manual/reference/method/db.collection.getIndexes/ |
🤦, thanks @alecgibson. Documents collection indexes:
Ops collection indexes:
|
Out of curiosity, have you got |
Also do you have access to the MongoDB logs from the |
Yes, before I created this issue I had tried
I'll try to collect them. |
I added more logs to figure out what is that queries queue which delays getSnapshot during a slow op - now I see that it's because of broadcasting presence data. That bunch of the same queries is a set of the getSnapshotOpLink requests (getSnapshotOpLink is a part of transformPresenceToLatestVersion). So client(s) produce this load by sharing presence data - will look more into this. See the attached log file (it has more stacktrace information). |
Without presence it's faster than light. |
@yaroslavputria thanks for all the digging you've done on this! Here are some of my initial thoughts: We definitely don't want Presence bogging down "real" ShareDB work. In theory, Presence itself is OT, so should be able to work at a similar performance level. I think the issue here is a combination of:
Let's compare with This means that in the "happy" case (clients don't submit concurrent ops), we never have to fetch any ops, and we reuse the snapshot that we already had. Contrast this with Even in the happy case, we call There's possibly some trickery we can do here to re-optimise for the "happy" case, because we transform Presence both server-side and client-side. If you turned off server-side transformations, this would massively reduce the calls to the Database, because we'd only ever call it if the client needs to catch up some presence. This would come at the cost of some potential lag introduced into the Presence system where ops collide with Presence. I've opened share/sharedb#538 to continue discussion down that thread. @yaroslavputria would you be able to please re-run your performance test using that branch? |
Hi @alecgibson, |
Hi Guys!
About the project: it's a quill collaborative rich text editor. On the back-end side, there is a load balancer over 5 instances of sharedb connected through redis. Sharedb-mongo - is the db adapter.
I'm doing a kind of load testing. I'm running end-to-end tests which simulate users editing - each user produces changes with some frequency. I'm limited in resources so I'm able to run about 50 chromedriver instances (so 50 users). As I have 5 instances of sharedb I expect 5 users per document working concurrently. So these tests simulate 50 users typing within 10 documents. The consumed CPU and RAM look good. When I noticed that under such a load, synchronizing users takes time, I added logs to trace time of the applying op. And I see that applying an op mostly takes milliseconds. BUT sometimes it takes even more than 10 seconds - it's a case when "commit" is unsuccessful and sharedb has to re-fetch snapshot/ops (to re-transform the op). Most of this time is spent on accessing mongo.
Is there anything that I can improve? Or is it a limit?
I use only one mongo collection - will using multiple collections improve the experience?
Applying op timing:
`
The text was updated successfully, but these errors were encountered: