-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dashboard Survey Metrics #966
Conversation
@JGreenlee [ Processing survey data steps ]
I don't know how the survey data will be stored in the database, so I haven't figured out how to code this step. Can you please guide me? I think I can code this shortly once I understand how the survey data will be stored. Thanks! |
The current structure of the server response for aggregate metrics is as follows:
So aggregate metrics is a 2d array. the outer elements correspond to I propose this structure:
|
Then we would be able to use a similar structure when we add the
|
I think that this restructuring looks good. |
@JGreenlee Couple of high-level comments here:
|
Yes, I think it would have to be day-by-day as long as we still allow people to choose specific days with the datepicker, and keep the graphs which show distance, duration, and count by day.
I think that a memoization/caching pattern would work well here (as opposed to precomputing everything). Every request is a range of days. For each day, we'll check if that day is in the cache. If not we'll compute it on the fly and cache it. The cache expires after some time (maybe 24 hours). The advantages of this:
Disadvantages:
|
The next day,
|
At the meeting last week, we decided to continue computing on the fly initially and defer the question about how to cache until later as a performance optimization. To support backwards compatibility, one option is to use the new structure for the fleet version only. @JGreenlee had planned to use the local date as part of the change to the structure, so the backwards compat could be "new structure for local date, old structure for timestamp". @JGreenlee I remembered one issue with using the queries by local date - querying by a single day at a time would be fine, but querying by a range of dates, can run into issues. For example, if we query for "2024-04-28" to "2024-05-03", then the local date query won't return anything because |
We met today to divide the tasks. I will be making the server changes and Jiji will be tweaking the UI to accommodate the new structure. If she finishes that before I am done with server changes, she will work on some admin dashboard tasks (e-mission/e-mission-docs#1066)
I will have to investigate this. It may explain some of the weirdness I found. I tested the existing metrics endpoint with window.cordova.plugins.BEMServerComm.pushGetJSON('/result/metrics/local_date', (m) => Object.assign(m, {
aggregate: true,
end_time: {year: 2024, month: 4, day: 24},
freq: 'DAILY',
is_return_aggregate: true,
metric_list: ["duration", "count", "distance"],
start_time: {year: 2024, month: 5, day: 24},
}), (...p) => console.log(...p), (...p) => console.error(...p)); And got back an interesting result. I expected every day from Apr 24 to May 24. But it seems like it gave data for Jan 24, Feb 24, Mar 24, and Apr 24 {
"aggregate_metrics": [
[
{
"ts": 1706079600,
"label_unlabeled": 25550.522970199585,
"fmt_time": "2024-01-24",
"nUsers": 7,
"label_walk": 4241.722962617874,
"local_dt": {
"timezone": "America/Denver",
"year": 2024,
"month": 1,
"day": 24
}
},
{
"nUsers": 6,
"label_hybrid_drove_alone": 10569.768949985504,
"label_walk": 2147.520366668701,
"ts": 1708758000,
"label_unlabeled": 26029.928629398346,
"fmt_time": "2024-02-24",
"label_hybrid_shared_ride": 2881.8205523490906,
"local_dt": {
"timezone": "America/Denver",
"year": 2024,
"month": 2,
"day": 24
}
},
{
"nUsers": 4,
"label_walk": 1730.4454126358032,
"label_bike": 4934.96901011467,
"ts": 1711252800,
"label_unlabeled": 10737.807738542557,
"fmt_time": "2024-03-24",
"label_e_car_shared_ride": 3664.9217643737793,
"local_dt": {
"timezone": "America/New_York",
"year": 2024,
"month": 3,
"day": 24
}
},
{
"nUsers": 5,
"label_e-bike": 4831.289476633072,
"label_hybrid_drove_alone": 10362.993374109268,
"label_walk": 1366.0865366458893,
"ts": 1713938400,
"label_not_a_trip": 3969.932086467743,
"label_unlabeled": 5386.386969804764,
"fmt_time": "2024-04-24",
"label_e_car_shared_ride": 2676.1139929294586,
"local_dt": {
"timezone": "America/Denver",
"year": 2024,
"month": 4,
"day": 24
}
}
],
...
} |
That is why we don't use the local date more often. It works as a filter (if you want to see the 24th of each month), or only trips on Tuesdays, but is not great for a range. see: https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/tcquery.py
|
I got it to work with ranges with these changes: #968 |
Changes merged, with some comments on requesting additional comments and restoring filter functionality |
I implemented the above structure with a few modifications to make it more flexible. I realized that we basically have 4 metrics we want right now: So I think that when we query for a list of metrics, we should pass the metric name and the 'grouping fields' instead of just a single-dimensional list. Concretely, this:
which will get a response like this:
This would be very flexible, allowing us to easily pick and choose what dimensions of the data to dive into. |
The list of metrics input to generate_summaries is now a dict of {metric name : [ list of fields to group by] }. e-mission/e-mission-server#966 (comment) Add 'response_count' metric which counts the occurences of 'responded' and 'not_responded', as opposed to the existing metrics which just sum up number values. app_config and trip_labels as globals to avoid argument/parameter drilling Add comments and rename variables throughout improve comprehension
Instead of querying by timestamp we can now query by "YYYY-MM-DD" strings. The yyyy_mm_dd endpoint also has a different format for metric_list: instead of just an array of metrics, it is a mapping of "metric names" to "grouping fields" (e-mission/e-mission-server#966 (comment)) We will specify this in the config as "metrics_list". The yyyy_mm_dd endpoint will also ask for app_config, so we will pass that through to `fetchMetricsFromServer` and include it in the `query` (note that `survey_info` is the only field that needs to be included) We will be supporting another metric called "response_count"; added a card to the 'summary' section (if it is configured to show) Updated types in appConfigTypes.ts
If you recall our discussion about the public dashboard, the structure is very similar. Basically, the generic outline is:
e.g. The filtering is not explicit in your example above, but it is implicit in the date range, which is the only kind of filtering that we currently support for this query. This is good because it gives us some validation that this structure is consistent with what makes sense for our data model. |
@jiji14 Would you mind giving me write access to your fork of |
In the meantime, these are the changes:
diff --git a/emission/net/api/cfc_webapp.py b/emission/net/api/cfc_webapp.py
index e585d6a2..a5e0d1dd 100644
--- a/emission/net/api/cfc_webapp.py
+++ b/emission/net/api/cfc_webapp.py
@@ -328,14 +328,18 @@ def summarize_metrics(time_type):
else:
old_style = True
is_return_aggregate = True
+
+ app_config = request.json['app_config'] if 'app_config' in request.json else None
+
time_type_map = {
- 'timestamp': metrics.summarize_by_timestamp,
- 'local_date': metrics.summarize_by_local_date
+ 'timestamp': metrics.summarize_by_timestamp, # used by old UI
+ 'local_date': metrics.summarize_by_local_date,
+ 'yyyy_mm_dd': metrics.summarize_by_yyyy_mm_dd # used by new UI
}
metric_fn = time_type_map[time_type]
ret_val = metric_fn(user_uuid,
start_time, end_time,
- freq_name, metric_list, is_return_aggregate)
+ freq_name, metric_list, is_return_aggregate, app_config)
if old_style:
logging.debug("old_style metrics found, returning array of entries instead of array of arrays")
assert(len(metric_list) == 1)
diff --git a/emission/net/api/metrics.py b/emission/net/api/metrics.py
index 9bd5ffda..5385f1ad 100644
--- a/emission/net/api/metrics.py
+++ b/emission/net/api/metrics.py
@@ -10,16 +10,31 @@ import logging
import emission.analysis.result.metrics.time_grouping as earmt
import emission.analysis.result.metrics.simple_metrics as earms
+import emission.storage.decorations.analysis_timeseries_queries as esda
+import emission.storage.decorations.local_date_queries as esdl
+import emission.storage.timeseries.tcquery as esttc
-def summarize_by_timestamp(user_id, start_ts, end_ts, freq, metric_list, include_aggregate):
+import emcommon.metrics.metrics_summaries as emcms
+
+def summarize_by_timestamp(user_id, start_ts, end_ts, freq, metric_list, include_aggregate, app_config):
return _call_group_fn(earmt.group_by_timestamp, user_id, start_ts, end_ts,
freq, metric_list, include_aggregate)
-def summarize_by_local_date(user_id, start_ld, end_ld, freq_name, metric_list, include_aggregate):
+def summarize_by_local_date(user_id, start_ld, end_ld, freq_name, metric_list, include_aggregate, app_config):
local_freq = earmt.LocalFreq[freq_name]
return _call_group_fn(earmt.group_by_local_date, user_id, start_ld, end_ld,
local_freq, metric_list, include_aggregate)
+def summarize_by_yyyy_mm_dd(user_id, start_ymd, end_ymd, freq, metric_list, include_agg, app_config):
+ time_query = esttc.TimeComponentQuery(
+ "data.start_local_dt",
+ esdl.yyyy_mm_dd_to_local_date(start_ymd),
+ esdl.yyyy_mm_dd_to_local_date(end_ymd)
+ )
+ trips = esda.get_entries(esda.COMPOSITE_TRIP_KEY, None, time_query)
+ print('found ' + str([e for e in trips]))
+ return emcms.generate_summaries(metric_list, trips, app_config)
+
def _call_group_fn(group_fn, user_id, start_time, end_time, freq, metric_list, include_aggregate):
summary_fn_list = [earms.get_summary_fn(metric_name)
for metric_name in metric_list]
diff --git a/emission/storage/decorations/local_date_queries.py b/emission/storage/decorations/local_date_queries.py
index 8a0e2d14..d6aca9fb 100644
--- a/emission/storage/decorations/local_date_queries.py
+++ b/emission/storage/decorations/local_date_queries.py
@@ -48,3 +48,15 @@ def get_comparison_query(field_prefix, base_ld, limit_ld, units, gt_or_lt):
return { "$or": or_conditions }
else:
return {}
+
+def yyyy_mm_dd_to_local_date(ymd: str) -> ecwl.LocalDate:
+ return ecwl.LocalDate({
+ 'year': int(ymd[0:4]),
+ 'month': int(ymd[5:7]),
+ 'day': int(ymd[8:10])
+ })
+
+def get_yyyy_mm_dd_range_query(field_name, start_ymd: str, end_ymd: str) -> dict:
+ start_local_date = ymd_to_local_date(start_ymd)
+ end_local_date = ymd_to_local_date(end_ymd)
+ return get_range_query(field_name, start_local_date, end_local_date)
diff --git a/setup/environment36.yml b/setup/environment36.yml
index d0223175..49dc9c1f 100644
--- a/setup/environment36.yml
+++ b/setup/environment36.yml
@@ -19,7 +19,7 @@ dependencies:
- scipy=1.10.0
- utm=0.7.0
- pip:
- - git+https://github.com/JGreenlee/[email protected]
+ - git+https://github.com/JGreenlee/[email protected]
- pyfcm==1.5.4
- pygeocoder==1.2.5
- pymongo==4.3.3 |
Added you as a collaborator! Let me know if you can't still push the code |
…into survey-dashboard
The updated in-app dashboard will be using this new time type via `/result/metrics/yyyy_mm_dd`. When using this time type, `e-mission-common` is used for the metrics calculations, which has a new structure (e-mission#966 (comment)) `start_time` and `end_time` are given as ISO dates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than that, I don't see anything much in here. Briefly reviewing the common code as well since that is where the secret sauce is.
|
||
app_config = request.json['app_config'] if 'app_config' in request.json else None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, so you are assuming that the client passes in the app config? That sounds fairly bonkers, particularly since the app config is fairly large. We read the app_config directly from the study_config in multiple locations, including in cfc_webapp
- why would we not just read in the app_config where it is used (presumably in api/metrics.py
) instead of passing in around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, so you are assuming that the client passes in the app config? That sounds fairly bonkers, particularly since the app config is fairly large.
I agree it's not the best solution. to mitigate the size issue I have the phone sending a partial app_config
with only survey_info
, since that is what is really needed for the metrics
why would we not just read in the app_config where it is used (presumably in
api/metrics.py
) instead of passing in around?
If we did this, then every time we query the server, the server has to query github and wait for the config before proceeding, right? I thought this would probably introduce even more latency.
I also didn't think there was a good way to keep the config in memory because I believe cfc_webapp
is continuously running and wouldn't receive config updates.
Is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. The standard way to mitigate the performance issue is through caching. You would read the config once a day, and use the cached version otherwise. That would not necessarily introduce a perceptible performance impact.
However, that may still leave us with a consistency issue. If the phone config has updated, but the server has not yet updated, I suspect we may have issues because the survey names may be inconsistent across them. The standard way to resolve such consistency issues is to send over the version that the phone is using. The server would cache the last several versions and use the version that the phone sends in.
While thinking through this, I just realized that this might be even more complex. For example, if you had surveyNameA
, surveyNameB
and surveyNameC
in May, and switched to surveyNameD
, surveyNameE
and surveyNameF
in October, you should still apply the older version of the config (e.g. use surveyNameA
, surveyNameB
and surveyNameC
when generating metrics for June in November. That will not work with the current approach of passing the config from the phone either.
I will merge this for now to unblock ourselves, and we should say that, after a study has been deployed to real users, we will not support changing the surveys. That is a reasonable restriction to impose, given that changing the survey while the data collection is in progress is not a recommended best-practice from a social science perspective - the results are not necessarily comparable with each other.
And then we should figure out the correct performance optimization vs. complexity tradeoff that we want to be at
@JGreenlee I just checked the server code and I can see the flow. I'd love to discuss the details during the meeting. Thanks for all your work! |
includes bugfixes for metrics calculations JGreenlee/e-mission-common@0.5.0...0.5.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I indicated earlier, the main changes here are high level comments on the code in e-mission-common:
- why are we using composite trips? The user inputs are also available in the confirmed trips, and using confirmed trips prevents us from having to read all the trajectory data in the composite trips, and reduce network demands on the database.
- what is
labels_map
used for and why do we use it as fallback if there is no user_input? It is passed in as input togenerate_summaries
, but the invocation ofgenerate_summaries
does not pass it in. - Given that we are changing this structure anyway, I wonder if it would be helpful to have this structured as
{ "date": "2024-05-20",
"mode_confirm": {"bike": 1000, "walk": 500}
"purpose_confirm":{"home": 1500 }
They will then be similar to the summary objects. We could also add the metric and then they would have an even more similar type.
- why do we parse out the
fmt_time
instead of using the local date objects directly?
date = trip['start_fmt_time'].split('T')[0]
if date not in days_of_metrics_data:
days_of_metrics_data[date] = []
days_of_metrics_data[date].append(trip)
Is it just for the return values?
Most of these are "future fix". The only actual potential bug is the comment below.
Co-authored-by: K. Shankari <[email protected]>
I was using composite trips because confirmed trips don't have any vehicle-specific info (until e-mission/e-mission-docs#1073 is done). However, I didn't implement vehicle-specific metrics yet anyway, so there's really no reason it needs to be composite trips.
That's for unprocessed labels, which are passed in on the phone but not applicable on the server
Yes, I think that would be much better.
What would this look like in context? Currently each day is a child of the metric: {
"distance": [
{
"date": "2024-05-20",
"mode_confirm": {
"bike": 1000,
"walk": 500
},
"purpose_confirm": {
"home": 1500
}
},
...
],
"duration": [ ... ],
"count": [ ... ],
} Are you suggesting this, where the metric is a child of the day? To me, this makes more sense. [
{
"date": "2024-05-20",
"distance": {
"mode_confirm": {
"bike": 1000,
"walk": 500
},
"purpose_confirm": {
"home": 1500
}
},
"duration": ...,
"count": ...
},
...
]
It's for the It could be changed to use the full local date objects instead of YYYY-MM-DD strings. It's just more verbose. Combined with the above changes, it would look like this: [
{
"local_date": {
"year": 2024,
"month": 5,
"day": 20
},
"distance": {
"mode_confirm": {
"bike": 1000,
"walk": 500
},
"purpose_confirm": {
"home": 1500
}
},
"duration": ...,
"count": ...
},
...
] |
Related PR:
e-mission/e-mission-phone#1142
Overview
Get survey data for phone app dashboard - leaderboard, comparison, trip
Related UI