fix(group): make MLS group thread safe #1349 #1404

mchenani · 2024-12-11T16:29:27Z

Changes:
Make adding and removing members in an MLS group thread-safe.
Add Mutex<HashMap<Vec<u8>, Arc<Semaphore>>> to have a lock per group
Issue:
In rare parallel scenarios, invoking Add or Remove members in an MLS group could result in a ForkedGroups issue, leaving some users in an outdated state. This problem arises when the group state is fetched, and a commit is generated, potentially publishing two or more intents with the same group state. If one of these intents gets published and its within commit is merged, the second intent, even if republished, continues to reference the outdated group state.

As a result, the client is unable to decrypt the commit due to an AEAD error. Furthermore, all upcoming messages, belong to future epoch, remain undecryptable.

…hread-safe-groups # Conflicts: # xmtp_mls/src/client.rs # xmtp_mls/src/groups/mls_sync.rs # xmtp_mls/src/groups/mod.rs # xmtp_mls/src/groups/subscriptions.rs # xmtp_mls/src/subscriptions.rs

codabrink · 2024-12-11T17:39:41Z

xmtp_mls/src/groups/mod.rs

+        operation: F,
+    ) -> Result<R, GroupError>
+    where
+        F: FnOnce(OpenMlsGroup) -> Result<R, GroupError>,


Rather than taking in a closure, what would you think about creating a Lock struct like LockedOpenMlsGroup that wraps the mls group and returning that? Like a MutexGuard does.

tbh not sure if I get your solution, could you please elaborate on it?

Something like this

@@ -358,6 +358,23 @@ impl<ScopedClient: ScopedGroupClient> MlsGroup<ScopedClient> { operation(mls_group) } + pub(crate) fn lock( + self, + provider: &impl OpenMlsProvider, + ) -> Result<LockedMlsGroup, GroupError> { + // Get the group ID for locking + let group_id = self.group_id.clone(); + + // Acquire the lock synchronously using blocking_lock + let lock = MLS_COMMIT_LOCK.get_lock_sync(group_id.clone())?; + // Load the MLS group + let group = OpenMlsGroup::load(provider.storage(), &GroupId::from_slice(&self.group_id)) + .map_err(|_| GroupError::GroupNotFound)? + .ok_or(GroupError::GroupNotFound)?; + + Ok(LockedMlsGroup { group, lock }) + } + // Load the stored OpenMLS group from the OpenMLS provider's keystore #[tracing::instrument(level = "trace", skip_all)] pub(crate) async fn load_mls_group_with_lock_async<F, E, R, Fut>( @@ -1643,6 +1660,18 @@ fn build_group_join_config() -> MlsGroupJoinConfig { .build() } +struct LockedMlsGroup { + group: OpenMlsGroup, + lock: SemaphoreGuard, +} + +impl Deref for LockedMlsGroup { + type Target = OpenMlsGroup; + fn deref(&self) -> &Self::Target { + &self.group + } +} + #[cfg(test)] pub(crate) mod tests { #[cfg(target_arch = "wasm32")]

@codabrink I'll suggest that lets consider your solution as a refactor for later, tbh for now kinda takes more time to implement and adjust the code, wdyt?

If you want, I can help you swap out the patterns. Shouldn't take long. The main reason I suggested the change is this approach would result in much less line changes. Wrapping everything in a closure makes it hard to track what changed between this and what was before it, which is a little scary. If we end up keeping this pattern, it's not the end of the world.

Perfect. Let's do it in an online pair session today and see if we can achieve it in a short time.

…oups

insipx

nice ! Huge, hope we run into less "drop everything the world is forked" now. Left a comment about an additional test that could help us figure out possible next steps

would be good to maybe get @cameronvoell and @neekolas review? otherwise good to go, hope we can get bindings out tomorrow morning

insipx · 2024-12-12T22:06:34Z

xmtp_mls/src/lib.rs

+        };
+
+        // Synchronously acquire the permit
+        let permit = semaphore.clone().try_acquire_owned()?;


This is the part that I'm most uncertain about, it is most problematic in specific multi-threaded circumstances, which I think would be worth creating a test for, probably in bindings_ffi since that's where multi-threading is most relevant. wasm will always be single-threaded

Create client

create a reference to the client, and start a stream_all_messages with this reference in its own thread.

clone two references to client

spawn two threads each holding a reference to this client

do a few syncs in each thread with each client reference. on the main thread, send a bunch of messages with a different user-client

in addition to messages, can try to do operations on the group like updating the name/picture/etc, since those predominantly use the sync variant

the aim here is to try to create a situation that races the two syncs to acquire a (syncronous) permit. The worst case is if we lose messages in a stream, since it can be cumbersome for integrators to restart streams. An error from calling sync is less bad and recoverable by integrators, but we may want to create a descriptive error message for this case (like, "critical updates already ongoing" or something) so it doesn't cause too much turmoil in our support chats when it happens. An error because a sync is happening when trying to update the group photo or something else is probably the least bad

Maybe can be done in a follow up?

neekolas · 2024-12-12T22:28:58Z

Curious why we are using a Semaphore instead of a Mutex for each group, given that we only want one thread to be able to operate on a group at a time.

cameronvoell

See comment on testing draft pr in xmtp-android - Im seeing a crash when running group streaming tests, ~~but still need to test to see if it is caused by this PR, or another recent change in libxmtp~~ Confirmed the latest libxmtp main works fine in xmtp-android, so something in this PR is causing a crash in streaming tests in xmtp-android and xmtp-react-native

xmtp-android test: xmtp/xmtp-android#350 (comment)
xmtp-react-native test: xmtp/xmtp-react-native#566 (comment)

mchenani · 2024-12-13T09:27:29Z

Curious why we are using a Semaphore instead of a Mutex for each group, given that we only want one thread to be able to operate on a group at a time.

In the beginning it was so easy to shift to locking the group without touching a lot and it's more reliable from POV.
Reasons:

I didn't want to touch the group object itself
Still not sure what would affect if I just make the group mut, cuz some places obtaining the lock is easier to track the group reference.

But still, today will go through what @codabrink suggested and if it doesn't require a lot of time we shift to mut solution.

mchenani · 2024-12-13T12:43:49Z

Curious why we are using a Semaphore instead of a Mutex for each group, given that we only want one thread to be able to operate on a group at a time.

In the beginning it was so easy to shift to locking the group without touching a lot and it's more reliable from POV. Reasons:

I didn't want to touch the group object itself

Still not sure what would affect if I just make the group mut, cuz some places obtaining the lock is easier to track the group reference.

But still, today will go through what @codabrink suggested and if it doesn't require a lot of time we shift to mut solution.

Just tested the mutex, not sure but I see more changes needed to use mutex instead of semaphore based on our current code base. at the end changing from Semaphore to Mutex shouldn't be that hard for us since the lock is isolated in one place.

mchenani · 2024-12-13T14:43:29Z

Curious why we are using a Semaphore instead of a Mutex for each group, given that we only want one thread to be able to operate on a group at a time.

Hey again! sorry for the noises, today I tested the Coda's solutions (not directly related to your questions) but that helped to replace the semaphore with the mutex; so my conclusion is: If we want just to use a mutex hashset, we need to take care of the blocking if the lock is not acquired or not inserted manually! Semaphore handles that automatically!

cameronvoell

tested that latest commit in xmtp-android and xmtp-react-native. The stream crash issues seem resolved, but unfortunately the original React Native fork reproduction is failing again about 4/5 times, so I think these changes are not addressing the underlying issue they were initially trying to prevent.

comment on xmtp-react-native test draft pr here with logs of both a passing and failing result xmtp/xmtp-react-native#566 (comment)

mchenani added 4 commits December 9, 2024 17:43

wip

59a0b6e

wip

653f6c9

wip

e5b2bf1

fixed tests

ab27ee3

mchenani requested a review from a team as a code owner December 11, 2024 16:29

mchenani added 7 commits December 11, 2024 18:00

wip

a8a39f6

wip

2b57721

wip

5b352a7

fixed tests

250eabb

Merge remote-tracking branch 'origin/mc/thread-safe-groups' into mc/t…

a37c79b

…hread-safe-groups # Conflicts: # xmtp_mls/src/client.rs # xmtp_mls/src/groups/mls_sync.rs # xmtp_mls/src/groups/mod.rs # xmtp_mls/src/groups/subscriptions.rs # xmtp_mls/src/subscriptions.rs

fix after rebase

4adf5cf

remove unneeded changes

1472473

codabrink reviewed Dec 11, 2024

View reviewed changes

mchenani and others added 7 commits December 11, 2024 19:19

fix clippy issues

fbc057c

fix fmt

dbd57e2

Merge branch 'main' into mc/thread-safe-groups

f9ad029

fix webassembly compile

261f300

Merge branch 'main' of github.com:xmtp/libxmtp into mc/thread-safe-gr…

80b9a8c

…oups

fix tests

6511f09

remove unneeded comments

7473a24

mchenani requested review from insipx and a team December 12, 2024 21:38

insipx approved these changes Dec 12, 2024

View reviewed changes

This was referenced Dec 13, 2024

Testing thread safe groups libxmtp PR xmtp/xmtp-android#350

Draft

Testing thread safe groups libxmtp PR xmtp/xmtp-react-native#566

Draft

cameronvoell self-requested a review December 13, 2024 02:25

cameronvoell reviewed Dec 13, 2024

View reviewed changes

mchenani added 5 commits December 13, 2024 17:07

use mutex instead of semaphore

7ef4f3c

fix fmt

d288c6b

Merge branch 'main' into mc/thread-safe-groups

d455367

fix after conflicts

822236d

fix linter

ba0b09c

cameronvoell self-requested a review December 13, 2024 20:28

cameronvoell requested changes Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(group): make MLS group thread safe #1349 #1404

fix(group): make MLS group thread safe #1349 #1404

mchenani commented Dec 11, 2024

codabrink Dec 11, 2024 •

edited

Loading

mchenani Dec 11, 2024

codabrink Dec 11, 2024

mchenani Dec 12, 2024

codabrink Dec 13, 2024

mchenani Dec 13, 2024

insipx left a comment •

edited

Loading

insipx Dec 12, 2024 •

edited

Loading

neekolas commented Dec 12, 2024

cameronvoell left a comment •

edited

Loading

mchenani commented Dec 13, 2024

mchenani commented Dec 13, 2024

mchenani commented Dec 13, 2024

cameronvoell left a comment

fix(group): make MLS group thread safe #1349 #1404

Are you sure you want to change the base?

fix(group): make MLS group thread safe #1349 #1404

Conversation

mchenani commented Dec 11, 2024

codabrink Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

mchenani Dec 11, 2024

Choose a reason for hiding this comment

codabrink Dec 11, 2024

Choose a reason for hiding this comment

mchenani Dec 12, 2024

Choose a reason for hiding this comment

codabrink Dec 13, 2024

Choose a reason for hiding this comment

mchenani Dec 13, 2024

Choose a reason for hiding this comment

insipx left a comment • edited Loading

Choose a reason for hiding this comment

insipx Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

neekolas commented Dec 12, 2024

cameronvoell left a comment • edited Loading

Choose a reason for hiding this comment

mchenani commented Dec 13, 2024

mchenani commented Dec 13, 2024

mchenani commented Dec 13, 2024

cameronvoell left a comment

Choose a reason for hiding this comment

codabrink Dec 11, 2024 •

edited

Loading

insipx left a comment •

edited

Loading

insipx Dec 12, 2024 •

edited

Loading

cameronvoell left a comment •

edited

Loading