Submission duplication: add advisory lock on xml_hash #859
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The problem
During a project we're managing for a national government we noticed several duplicated submissions with the same uuid and xml_hash, as describe in #470
After a bit of investigation, we found a race condition caused by transaction isolation between different sessions: in https://github.com/kobotoolbox/kobocat/blob/main/onadata/libs/utils/logger_tools.py#L177 there's a check to find other submissions with the same xml_hash from the same user, however this check won't be able to detect other submissions being created concurrently because they're all in different database transactions, "since
ATOMIC_REQUESTS
is set toTrue
".In this case we need another mechanism to detect if there are other documents with the same xml_hash being submitted but still not completely written to the database.
Reproducing duplication reliably
The main problem with testing a fix for this problem is that it usually happens only when you have many remote devices submitting forms with attachments using unreliable networks.
Given we were suspecting a race condition we wrote an async script that would submit the same form and its attachments 5 times from "parallel" async tasks (in hindsight using requests with multiprocessing would have been a better use of my time but hey, async is fashionable... 😅)
Possible solution
A PostgreSQL's Advisory lock is perfect for the task: it can work across transaction and session boundaries and allows to ensure atomic access using an integer id. So we got the xml_hash number, converted it to an integer then took its 60 least significant bits to obtain a (hopefully still) unique id for the lock.
The strategy is:
Testing the problem and the fix
If you extract the files from this duplication_repro.zip file, you'll find:
sample_form.xml
that you can import in your kf instance (a local development instance works fine)testfiles/
multiple_submission_test.py
scriptFirst of all, import the test form in your kf application, taking note of its ID (something like the
a7G7MvQMmzARfb2irNsyn3
already in the script)Create and activate a virtual environment if you're not already in one, then
pip install aiohttp aiofiles dict2xml
, then edit the script: in the first few lines after the imports, you need to change the following variables so they match your test instance:run the script from the same directory in which you extracted it: on an unpatched system most of the 5 submissions (often all) will get accepted and you'll have 5 new submissions with exactly the same content.
If you switch to a branch that includes this PR, you should get 1 accepted submission and 4
Duplicate submission
responses.Remaining issues and questions
Most important: I'm not sure if "DuplicateSubmission" is the best exception to raise: maybe telling the client to retry (how?) is a safer approach.
In my opinion the big question however is why it's so common to receive multiple submissions at the same time:
This was an obvious race condition, but there could be more. It took us a long time to find how to write a reproduction script able to work reliably both locally and remotely, so we were unable to deploy the fix to our production system in time (our data collection efforts were pratically closed when we finished this).
P.S.
Many thanks to whoever wrote the
clean_duplicated_submissions
admin command (probably @noliveleger)