add event tracking #349

sravfeyn · 2024-07-01T07:12:07Z

https://dimagi.atlassian.net/browse/CCCT-369

sravfeyn · 2024-07-01T07:12:30Z

This is work in progress, but wanted to get any initial review I can get @calellowitz

sravfeyn · 2024-07-01T07:14:08Z

commcare_connect/events/models..py

+
+
+class RecordsFlagged(InferredEventSpec):
+


This style of inferred events allows us to avoid saving redundant info on Event model.

sravfeyn · 2024-07-01T07:17:07Z

commcare_connect/events/tasks.py

+
+
+@celery_app.task
+def process_events_batch():


I have gone this route over few other options

Simpler option: single celery task for saving each event would be simpler, but would be too much of a celery overhead and doesn't take advantage of ability to save in bulk which should enable better performance.

Complicated option using kafka would enable to do this in much more scalable way, but it would be probably an overkill for now.

We can run this task periodically, say every 30 seconds.

Thanks for describing your thought process a bit. A few questions

Why redis instead of postgres for the queue?

Why does this need to be async at all? I can imagine its slightly faster to write to redis than postgres, but if we are doing a network write for every event anyway, I am surprised this is much more efficient than just writing directly to postgres?

Main reason being: Wouldn't writing to redis (in memory) be faster than writing to Postgres DB (on disk)? This also allows less number of writes to Postgres (because they are committed in batches) which should make it that much less probable for the DB to run into performance issues.

I would rather we start with a purely synchronous implementation for now if we want to be sure about event durability. We can always make these async later, but this adds a lot of complexity for something we aren't sure we need yet. If we do want to make it async, I think we should use celery unless we have a very strong reason not to since we are already using that. That keeps us to only one async task method in our code and makes it a bit easier to manage.

I am pretty sure that it would impact performance in sync mode, at least in some cases if not all. For e.g. in process_learn_form we need to track multiple events (Finishes a Module, Finishes all Modules, Failed Assessment, Succeeded Assessment). Without async, these will add up enough to slow down the request.

(Though, may be this isn't a big issue if these requests are coming from Repeaters as opposed to directly from mobile submissions.)

And if we do separate task for each event, we are going to face lot of constant celery troubleshooting. We have like 47 lines of code to do async and a single periodic task. I don't feel that qualifies for 'lot of complexity'.

I am happy to remove all the async code or make it multiple individual tasks, but ¯_(ツ)_/, you will have to handle if/when things slow down 🐢 ⌛ 🐌 or if we end up doing constant celery firefighting 🧯 🔥 🚒

My instinct was that nearly all of these come from background tasks (like approvals) or non time sensitive requests. It is certainly possible that future ones won't (the mobile endpoint should be performant, but we also need to be able to tell mobile whether the write succeeded and we can't do that with an async task anyway), and we can deal with async then, but at this point it seems more important to have guarantees. I am also less convinced of the long term sustainability of this implementation. My understanding of the code is that if any event has an issue, no events will ever be written, based on the error handling there. For example if mobile sends a new event and we havent yet migrated, or a web dev forgets to add the migration. That seems quite dangerous. Additionally its not paged at all, if the list is very long (maybe mobile sends a bunch at once), it could consume a lot of memory.

None of these issues are unfixable, but the more we work to make this a production ready queuing system, the more complex it becomes, and the more we benefit from a real one. However, given our current needs, unless there is a clear place where the async will benefit us, it is a bit hard for me to believe that one postgres write will be enough worse than the one redis write at this time and for our use cases.

I am not going to block the PR over this, and if you think there is a specific place where that performance matters (the form processor would be reasonable if that were blocking a mobile call), I am certainly willing to consider it, but my strong presumption is that this is overkill for our current needs, while not being robust enough to be a long term solution.

calellowitz

left a couple questions. Super happy you chose to do this in a new app, way fewer migration issues

calellowitz · 2024-07-09T21:59:51Z

commcare_connect/events/models..py

+
+class Event(models.Model):
+
+	class Type(models.TextChoices):


I am a little concerned that this means we will need a migration each time we add a new event, something that is likely to be quite frequent. What is the reason to make this a choice set?

Yeah, I am deliberating to change it for few other reasons as well.

calellowitz · 2024-07-09T22:14:35Z

commcare_connect/events/models..py

+class Event(models.Model):
+
+	class Type(models.TextChoices):
+	    INVITE_SENT = "is", gettext("Invite Sent")


I think the slugs should be human readable. No reason to save a few bytes, and it will make it easier to use them later

We do already have the slug the first part (INVITE_SENT) for readability inside the code. Of course, if you look at the raw DB it might not be. The number of events are going to be very large (larger than any other table), so I thought we could use every opportunity to keep it smaller.

calellowitz · 2024-07-09T22:16:12Z

commcare_connect/events/models..py

+			)
+
+
+class RecordsFlagged(InferredEventSpec):


What does this class do?

This is a specification for 'inferred' events, for e.g. this particular one specifies that UserVisit with the listed mapping is an indication of the event that visits got flagged. This avoids saving redundant info.

This spec is further used in get_events to get list of events.

calellowitz · 2024-07-09T22:17:46Z

commcare_connect/events/models..py

+    date_created = models.DateTimeField(auto_now_add=True, db_index=True)
+    event_type = models.CharField(max_length='2', choices=Type.choices)
+    user = models.ForeignKey(User, on_delete=models.CASCADE, null=True)
+    opportunity = models.ForeignKey(Opportunity, on_delete=models.PROTECT, null=True)


is there no way to save additional metadata?

For all the metrics so far, I haven't found one which needs more metadata.

calellowitz · 2024-07-09T22:20:26Z

commcare_connect/events/tasks.py

+
+
+@celery_app.task
+def process_events_batch():


Thanks for describing your thought process a bit. A few questions

Why redis instead of postgres for the queue?

Why does this need to be async at all? I can imagine its slightly faster to write to redis than postgres, but if we are doing a network write for every event anyway, I am surprised this is much more efficient than just writing directly to postgres?

calellowitz

Left a few followup questions/comments

calellowitz · 2024-07-25T02:05:29Z

commcare_connect/events/views.py

+    serializer_class = EventSerializer
+    permission_classes = [IsAuthenticated]
+
+    def create(self, request, *args, **kwargs):


What happens if one item passed by the phone is invalid? Does the whole request fail or are the rest created? If the first, I think that could cause issues because then one bad event will make all subsequent requests from the phone fail as it will be included in each one. If the second, I think we need to let the phone know which succeeded and which failed so it knows which records to retry and which it can delete

This is a great point, from your above comment maintaining some sort of ID would address this. I will add that.

It looks like this change is still pending?

Instead of having an ID, I have gone with a simpler approach of just sending back the rows that fail.

calellowitz · 2024-07-25T02:06:04Z

commcare_connect/events/views.py

+class EventSerializer(serializers.ModelSerializer):
+    class Meta:
+        model = Event
+        fields = ["date_created", "event_type", "user", "opportunity"]


I think this will need some kind of event id from the phone, so that we can communicate success info for individual items in the list as described below

Agreed, I was thinking to use the timestamp as such ID, but a UUID would be nice.

Responded here https://github.com/dimagi/commcare-connect/pull/349/files#r1701902592

calellowitz · 2024-07-25T02:07:07Z

commcare_connect/events/models.py

+    Type = types
+
+    date_created = models.DateTimeField(auto_now_add=True, db_index=True)
+    event_type = models.CharField(max_length=40, choices=types.EVENT_TYPE_CHOICES)


You mentioned you were going to rethink this in response to my last comment but appears to still be here? How has your thinking here evolved? Why did you decide to require a migration each time we have a new event?

Yeah, I am going to make it a callable to avoid migrations getting created

So, it looks like this won't work since we are on Django 4 (and callable only works in 5), we can do it using validators. But the trade off is that we need to have some extra code to toggle the slug and verbose_name (and back) while processing in some of the places (which is ugly and unnecessary). I could live with extra migration until Django 5 instead of the unnecessary code. Do you feel strongly against migrations?

Do you feel strongly against migrations?

I am not sure I totally understand why we need restricted choices here at all. That seems pretty nonstandard for an analytics system (if you think of GA or kissmetrics, they allow arbitrary events), and makes adding new events, whcih we expect to do incrementally much trickier. Mobile will need to coordinate releases around web running these migrations, and simple communication errors could cause confusing problems. I like the idea on web of predefining our events, and only using ones we have listed in something like a constant file, but I am not sure I am sold on enforcing them at the db level.

As with other comments, I am open to being convinced

calellowitz · 2024-07-25T02:08:16Z

commcare_connect/events/models.py

+    # this allows referring to event types in this style: Event.Type.INVITE_SENT
+    Type = types
+
+    date_created = models.DateTimeField(auto_now_add=True, db_index=True)


If these events are processed asynchronously, I don't think we can use auto_now_add especially since there is data from the phone that could be hours or days delayed.

Hmm, that's a good point. Though, this field is getting overridden when the events are created. I agree it's better to remove the auto_now_add and let the callers worry about setting it.

calellowitz · 2024-07-25T02:09:40Z

commcare_connect/events/models.py

+INFERRED_EVENT_SPECS = [RecordsFlagged()]
+
+
+def get_events(user=None, from_date=None, to_date=None):


how is this used?

This will be used in the Events timeline report.

calellowitz · 2024-07-25T02:12:05Z

commcare_connect/events/tasks.py

+
+
+@celery_app.task
+def process_events_batch():


I would rather we start with a purely synchronous implementation for now if we want to be sure about event durability. We can always make these async later, but this adds a lot of complexity for something we aren't sure we need yet. If we do want to make it async, I think we should use celery unless we have a very strong reason not to since we are already using that. That keeps us to only one async task method in our code and makes it a bit easier to manage.

calellowitz

Left a few comments/questions. You can definitely convince me that your architecture is better than what I had in mind, but I am not sure I see the benefits for a few of the pieces yet

calellowitz · 2024-08-02T02:52:26Z

commcare_connect/events/models.py

@@ -26,6 +27,12 @@ class Event(models.Model):
    event_type = models.CharField(max_length=40, choices=get_event_type_choices())
    user = models.ForeignKey(User, on_delete=models.CASCADE, null=True)
    opportunity = models.ForeignKey(Opportunity, on_delete=models.PROTECT, null=True)
+    organization = models.ForeignKey(


Why is this necessary? Opportunity encodes this info as well.

I was considering for cases where Events are org specific without org, but may be we don't have any right now. Let me remove it

calellowitz · 2024-08-02T02:54:47Z

commcare_connect/events/views.py

+    serializer_class = EventSerializer
+    permission_classes = [IsAuthenticated]
+
+    def create(self, request, *args, **kwargs):


It looks like this change is still pending?

calellowitz · 2024-08-02T03:13:07Z

commcare_connect/events/tasks.py

+
+
+@celery_app.task
+def process_events_batch():


My instinct was that nearly all of these come from background tasks (like approvals) or non time sensitive requests. It is certainly possible that future ones won't (the mobile endpoint should be performant, but we also need to be able to tell mobile whether the write succeeded and we can't do that with an async task anyway), and we can deal with async then, but at this point it seems more important to have guarantees. I am also less convinced of the long term sustainability of this implementation. My understanding of the code is that if any event has an issue, no events will ever be written, based on the error handling there. For example if mobile sends a new event and we havent yet migrated, or a web dev forgets to add the migration. That seems quite dangerous. Additionally its not paged at all, if the list is very long (maybe mobile sends a bunch at once), it could consume a lot of memory.

None of these issues are unfixable, but the more we work to make this a production ready queuing system, the more complex it becomes, and the more we benefit from a real one. However, given our current needs, unless there is a clear place where the async will benefit us, it is a bit hard for me to believe that one postgres write will be enough worse than the one redis write at this time and for our use cases.

I am not going to block the PR over this, and if you think there is a specific place where that performance matters (the form processor would be reasonable if that were blocking a mobile call), I am certainly willing to consider it, but my strong presumption is that this is overkill for our current needs, while not being robust enough to be a long term solution.

calellowitz · 2024-08-02T03:18:28Z

commcare_connect/events/models.py

+    Type = types
+
+    date_created = models.DateTimeField(auto_now_add=True, db_index=True)
+    event_type = models.CharField(max_length=40, choices=types.EVENT_TYPE_CHOICES)


Do you feel strongly against migrations?

I am not sure I totally understand why we need restricted choices here at all. That seems pretty nonstandard for an analytics system (if you think of GA or kissmetrics, they allow arbitrary events), and makes adding new events, whcih we expect to do incrementally much trickier. Mobile will need to coordinate releases around web running these migrations, and simple communication errors could cause confusing problems. I like the idea on web of predefining our events, and only using ones we have listed in something like a constant file, but I am not sure I am sold on enforcing them at the db level.

As with other comments, I am open to being convinced

calellowitz

None of the comments I left are blocking, in case this is holding up development on the user timeline view or other work, but I think the mobile API and model fields could still be improved in the ways we previously discussed.

Thanks for how responsive you have been on this

calellowitz · 2024-08-08T01:04:45Z

commcare_connect/events/views.py

+            if failed_items:
+                partial_error_response = {"error": "Some items could not be saved", "failed_items": failed_items}
+                headers = self.get_success_headers(serializer.data)
+                return Response(partial_error_response, status=status.HTTP_206_PARTIAL_CONTENT, headers=headers)


it looks like 206 is really intended for "range" requests. Since that isn't what this is, I think it's fine to use a more standard response like 200, especially since that is still distinguishable from a full success.

Cool, I will update it.

Do you still plan to update this?

Yeah, I have updated it to 201

calellowitz · 2024-08-08T01:13:13Z

commcare_connect/events/views.py

-
-        event_objects = [Event(**item) for item in serializer.validated_data]
-        Event.objects.bulk_create(event_objects)
+        try:


we talked a few times about having these events include IDs that you could send back to the phone to indicate which ones succeeded. I don't see any code that could handle that here. Specifically it looks like the code will error if they send ids with these events (or it will try to set that id as the PK, which is even more dangerous).

I do see that you instead send down failures (though still not with uniquely identifiable IDs, so it will not necessarily be straightforward for the phone to know which events they match to). Successes are generally preferable because there are less ways for that to go wrong, and it doesn't require the phone to track which events it sent, and there is no possibility for mismatch, it just deletes the ones you say succeeded.

Yea, I updated the code (and responded to that comment above https://github.com/dimagi/commcare-connect/pull/349/files#r1701902592). I took a simpler approach of sending down just the failed events. I don't understand how including IDs makes it any better for mobile to track what events failed.

calellowitz · 2024-08-08T01:14:46Z

commcare_connect/events/views.py

-        event_objects = [Event(**item) for item in serializer.validated_data]
-        Event.objects.bulk_create(event_objects)
+        try:
+            event_objects = [Event(**item) for item in serializer.validated_data]


its generally safer to be explicit about incoming fields in case the phone adds unexpected ones, and to allow it to intentionally send fields you are not including (like the event id we have discussed).

I have listed the fields here explicitly https://github.com/dimagi/commcare-connect/pull/349/files#diff-a7c344d1bdc227452ddf03733e8218375041272c8e3c6157bb96c11a7fa195ebR23 (May be I am not getting your point)

This might have been my ignorance about the serializer. What happens if there are unexpected fields sent up? Are they ignored or raise an error?

Yeah, they get ignored.

calellowitz · 2024-08-08T01:16:38Z

commcare_connect/events/models.py

+    date_created = models.DateTimeField(db_index=True)
+    event_type = models.CharField(max_length=40, db_index=True)
+    user = models.ForeignKey(User, on_delete=models.CASCADE, null=True)
+    opportunity = models.ForeignKey(Opportunity, on_delete=models.PROTECT, null=True)


I will add again that I think a metadata field will be very beneficial. Even with the existing set of events we have, data like which record was approved, or how much the payment was could be very useful, and I know that mobile was also hoping to include additional metadata.

sravfeyn · 2024-09-13T04:45:52Z

@calellowitz Is this good to merge?

sravfeyn · 2024-11-20T09:46:15Z

@calellowitz bumping this for review.

sravfeyn · 2024-11-28T15:14:34Z

@calellowitz Bumping for review

sravfeyn commented Jul 1, 2024

View reviewed changes

calellowitz reviewed Jul 9, 2024

View reviewed changes

sravfeyn added 2 commits July 13, 2024 19:42

add event tracking

51ffdb5

Add mobile endpoint, tests

f97b2c5

sravfeyn force-pushed the sr/events branch from ec8f001 to f97b2c5 Compare July 14, 2024 12:44

calellowitz reviewed Jul 25, 2024

View reviewed changes

sravfeyn added 4 commits July 29, 2024 17:08

Implement Events report

4a2467c

Fix sort reset, add icons

dadbeae

remove auto_now_add

99467b8

merge main

46e95e5

sravfeyn changed the title ~~WIP: add event tracking~~ add event tracking Jul 31, 2024

Make user field a select2 widget

6be106a

calellowitz reviewed Aug 2, 2024

View reviewed changes

sravfeyn added 4 commits August 2, 2024 18:43

don't constraint on event type.choices

a7b1db5

Remove async event saving

e85f605

Remove org column, fix user select

e1c03fd

Handle partial failure

c59a387

calellowitz approved these changes Aug 8, 2024

View reviewed changes

Add more events

51b7719

sravfeyn force-pushed the sr/events branch from 3fb92a8 to 3166425 Compare August 16, 2024 08:21

Merge branch 'main' into sr/events

34bb90f

sravfeyn force-pushed the sr/events branch from 3166425 to 34bb90f Compare August 16, 2024 10:03

sravfeyn added 3 commits August 16, 2024 15:51

fix order

fdd182a

lazy evaluating

badcef5

accept uid to track errors

a753311

move prod only module import

dac5557

sravfeyn mentioned this pull request Oct 4, 2024

add delivery-type, year, quarter filter to admin report #401

Merged

Merge main

0a6e359

		INFERRED_EVENT_SPECS = [RecordsFlagged()]


		def get_events(user=None, from_date=None, to_date=None):

add event tracking #349

Are you sure you want to change the base?

add event tracking #349

Conversation

sravfeyn commented Jul 1, 2024

sravfeyn commented Jul 1, 2024

sravfeyn Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sravfeyn Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sravfeyn Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calellowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calellowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calellowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calellowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sravfeyn commented Sep 13, 2024

sravfeyn commented Nov 20, 2024

sravfeyn commented Nov 28, 2024

sravfeyn Jul 1, 2024 •

edited

Loading

sravfeyn Jul 11, 2024 •

edited

Loading

sravfeyn Jul 26, 2024 •

edited

Loading