-
Notifications
You must be signed in to change notification settings - Fork 131
Data validation #186
Comments
Thanks for the suggestion! This is something we are actively looking into. A couple of notes:
We already have a staging bucket and copy the files over to the production bucket. The issue with doing validation at this level is that it's all-or-nothing, and we would not want to block updating the entire output because a handful of inputs look off. We also wouldn't want to "fix" data at this stage since it may mask issues with underlying data sources.
That looks very interesting! The claims around performance seem great too. We would be open to implementing something similar, but I would suggest doing it at the individual data source level rather than the final output. If this is something you are interested in, we would definitely welcome your contributions. My only request is having a high-level discussion about design first, before submitting a big PR :-)
We are aware of some of the data irregularities, but please don't feel discouraged from opening an issue anytime you see something that looks off! |
Thanks for the feedback!
I’d assume it doesn’t have to be all-or-nothing and that is why the data validation utility I referred to has configurable thresholds. The idea is stop issues like issue #59 where the final output was truncated from containing data for ~200 countries to one country’s data. Unless something major like this happens, the utility will happily return zero exit code indicating success and produce (as a side benefit) error files with rejected epidemiology and index rows for review. The rejections can be treated as warnings. For instance, right now the side benefit includes flagging the index
I’d recommend considering both types of checks as complimentary to each other. When it comes to epidemiology data, most use cases will need both |
Good point, let me clarify. When I say "all or nothing" I mean that we would not want to selectively filter outputs at the final stage, either we copy them or we don't. That said, it's perfectly valid to run some validation on those outputs, so we can spot issues and fix them at the data source -- the So the validation can be run with different thresholds, some can be considered an error which would prevent copying altogether (like #59), some can be considered a warning which would result in an issue being open for us to investigate. |
If you'd like to go ahead with integrating the utility, then its repository can be cloned/copied and the existing functionality can be used as is, I assume there is no need in PR. Please let me know if this assumption is incorrect or you’d like to change the utility and make it more suitable for testing of the unit comprised of the |
Would be good to have a staging bucket and copy files from there to the current production bucket after the data has been validated by a data validation utility, for example like this one or similar.
The set of validation checks would address the existing data irregularities issues and cover the space where future issues could develop.
The text was updated successfully, but these errors were encountered: