-
Notifications
You must be signed in to change notification settings - Fork 131
UK Information is incorrect #418
Comments
The rest of the data is also incorrect, however, I included only the most recent for brevity. |
The count in question 4,105,399 (+1,447) is taken from virusquery.com output:
'Calculated total' means the count has been calculated as the running sum of the State/Province counts shown. 'Total' means the country-wide total was taken from the dataset. |
@winwiz1 From today's https://virusquery.com/ it looks like the countrywide total is updated later than the nationwide totals. e.g. 2021-02-19 now agrees (but 2021-02-20 does not). |
@themonk911 Yes, this is what seems to be happening. |
Can I suggest to replace in Public Health England API calls Would be good to get the previously collected |
From https://coronavirus.data.gov.uk/details/developers-guide, they claim to not support publish date for regions, only for nations.
So there will be a mismatch between region/nation, but given it appears to resolve within a day, I'm not convinced this is a huge issue. |
As of now the page https://coronavirus.data.gov.uk/details/cases shows:
At the bottom:
This latest count is already quoted by hundreds of sources: https://www.google.com/search?q=UK+COVID+cases+%224%2C126%2C150%22+site%3Auk As for the availability of this data for various levels, the bottom heading "Cases by area (whole pandemic)" shows counts at different levels and my initial assumption would be that all this data is consistent e.g. comes from one API metric. You can compare data for the
The difference is substantial, does not resolve within a day and depends on the specific date specified as a part of API call since |
I see. The problem at the moment is that we'd like to be consistent between the different levels. cumCasesByPublishDate is not available for L3 and Regions data.(e.g. https://api.coronavirus.data.gov.uk/v2/data?areaType=region&metric=cumCasesByPublishDate&metric=cumCasesBySpecimenDate&format=csv&release=2021-02-22) shows a blank column for cumCasesByPublishDate. At one point we were using PublishDate for nations + UK, and SpecimenDate for the rest (L3 + regions), but then you have a different set of inconsistencies than we currently have. I'm not sure whether one is really better than the other, and we're limited by the data available to us. @owahltinez not sure whether you have thoughts on this matter? |
correction: regions data has cumCasesByPublishDate for only the most recent date. |
Right, so there are issues with L3 and Regions as far as switching to
To be more precise, an API call for a region will return the Issue with L3.
Multiple calls like that can be used as described above to get London historical data. Issue with NUTS regions.
The list of regions returned includes London and The issue is however that those regions seem to be different from NUTS regions (including these in Again, for utla like
Root cause of the issue with NUTS regions. On the UK government page there is a link to the document Hierarchical Representation of UK Statistical Geographies (December 2020). It tells us there are eight UK Statistical Geographies as of December 2020, each with its own hierarchy. It would be reasonable to assume Public Health England focuses on the Health Geography, its hierarchy and ensures the API supports it. Whereas NUTS regions belong to Eurostat Geography. It would appear blending both geographies into one dataset was based on best intentions to accomodate a request from a researcher but it created issues down the track since PHE caters mostly for Health Geography. |
I don't have a very strong preference here. I recall going back and forth about this and eventually settling on our current metric, it was an informed decision and not arbitrary. It seems that the difference between the two API calls for the larger regions is <1% so either way it wouldn't be the end of the world. If users expect an exact count and the inconsistency across the different levels is such a small difference, I wouldn't oppose changing the metric used to match what is being reported elsewhere (while keeping the more reliable metric for smaller subregions). That said, @themonk911 is the local expert and has been working the longest with this data so I'll defer to their decision.
Blending data from multiple datasets is the core value-proposition of our project. We harmonize geographical locations as much as we can so the data from different sources can be merged seamlessly. Sometimes, a few regions are present in one system but not another, the UK has actually been the most challenging to work with because of the many different ways there are to divide the country into smaller admin regions. To the best of my knowledge, the NUTS regions from the UK that we report data for have identical boundaries. Using the same example of Blackburn with Darwen, you can see in the Wikidata page that it has multiple identifiers associated with it — one is NUTS and another UTLA. In some cases we only have the name of a region to go by, for example the Google Mobility Reports. So the matching of regions is not an exact science but we find it close enough to be useful — although the mobility reports recently started publishing an identifier we can use to disambiguate, so this will be a smaller problem in the future. |
Sure. However the value derived from a particular blending depends on factors like correctness and completeness of the implementation. Once all that is factored in, the value needs to be balanced against the cost of functional regressions it caused if any. I understand the implementation of NUTS regions initially caused undesirable inconsistency (concurrent use of both metrics) and later contributed to switching to As a side note, significant research value of
Correct for this particular NUTS level 3 region named after a single local authority. Looking at the Wiki page we can see in the table that names of some NUTS 3 regions include more than one local authority. For example, the first region is UKC11 “Hartlepool and Stockton-on-Tees”. Searching There are 174 NUTS 3 regions in UK. The dataset contains index entries for 49 level 3 regions. Data for each region can be collected either by direct API call (in cases when there is one-to-one match between a region and a local authority) or by summing up the counts provided by the relevant local authorities. Looks like UK government renamed NUTS regions to ITLs. So this area could be looked at in some future – on contrary to fixing the metrics which is a more urgent issue. |
It may be confusing if you're expecting the counts to exactly match other sources, but since the difference is <1% I don't think it will affect research significantly. We determined that the currently used metric was more accurate and consistent across aggregation levels, but based on your feedback we are evaluating switching to the metric that matches other data sources.
If I understand what you are saying correctly, this is technically possible to do with our dataset but sadly very difficult. You can get a "snapshot" of what our dataset looked like at any arbitrary point in time by accessing the object versioning of the file. I hope to make some time in the future to provide step-by-step examples of how to do this...
This is not a bad idea, but it would incur a huge penalty in the total file size since it would be an empty column for nearly all rows. I would much rather choose one metric or the other.
This sounds like a bug in our mapping of regions from NUTS 3 system to ours. This is not done automatically based on the region name, so a region having the word 'and' is not the root cause (although it probably makes it more likely that we got confused when mapping the regions). In this case, fortunately, it's only a
The choice of which UK regions to use for data reporting was made based on what epidemiological data was available nearly a year ago. It seems now many more regions are covered so we would probably make different choices today. I believe there is an ongoing effort to include more of the newly available regions in our dataset as part of the "catch-all" aggregation level 3, but I don't know what the timeline of that is nor if they correspond to the NUTS3 admin breakdown or something else entirely. |
Actual data for UK:
The text was updated successfully, but these errors were encountered: