Skip to content
This repository has been archived by the owner on Dec 16, 2024. It is now read-only.

Null values currently fail validation for EmergencyCareEpisodeSchema due to int64 type #44

Open
georgm8 opened this issue Mar 11, 2023 · 6 comments

Comments

@georgm8
Copy link
Contributor

georgm8 commented Mar 11, 2023

Following columns are currently failing validation because they contain null values but are set to dtype=np.int64 in the EmergencyCareEpisodeSchema

edcomorb_[0-9]{2}$
eddiag_[0-9]{2}$
eddiagqual_[0-9]{2}$
edentryseq_[0-9]{2}$
edinvest_[0-9]{2}$
edtreat_[0-9]{2}$

Suggest changing to pd.Int64Dtype() to allow null values

Could be changed to float type but you get this awkward situation where pandas adds a decimal point onto the end of the SNOMED code

@georgm8
Copy link
Contributor Author

georgm8 commented Mar 13, 2023

Having had a better look at feature_maps.py I think might be better managed by doing a fillna(0) on the relevant columns!

Was wondering if I could clarify a few other things however @vvcb

  1. Missing values have already been accounted for edinvest_[0-9]{2}$ with the SNOMED code specified for this as 1088291000000101

  2. However, missing values have not been accounted for in edtreat_[0-9]{2}$ - the HDRUK document specifies the SNOMED code for this as 183964008 , so this could easily be added into feature_maps.py (happy to do this)

  3. Only the eddiag_01 column is required for analysis and presumably if this value is missing we should discard the row from the dataset (same applies to eddiagqual_01)

  4. There is currently no pipeline to manage edcomorb_[0-9]{2}$ - However, I could pull this code in from admitted_care_features.py

  5. Although edentryseq is specified in the Regional Data Specification it doesn't appear to be used in the analysis

@vvcb
Copy link
Member

vvcb commented Mar 13, 2023

Missing SNOMED codes should be replaced with 0. This avoids pandas NaN issues (I have included a link in the documentation). Is PR #46 still necessary if this is already done?

If there is a specific code for missing values, then this should be included in feature_maps. Will be great if you are happy to do this .

@vvcb
Copy link
Member

vvcb commented Mar 13, 2023

  1. Not sure if you would discard the entire row unless this column is necessary for the inclusion criteria...in which case it should not be missing. @quindavies , thoughts?

  2. Yes please.

  3. @quindavies ? I haven't looked at the ED spec in detail and am happy to be guided by Quin and others on this. Eyeballs deep in TRE and OMOP work.

@quindavies
Copy link
Collaborator

  1. I don't recall the presence of a diagnosis being part of the inclusion criteria, however the analysis tables are all summarised but as ASCS flag which is based on diagnosis? We can't assume that the absence of diagnosis means that the attendance wasn't ambulatory related? What percentage of records does this affect?
  2. No I can't see it being used either 😄 in this or the winter pressures work

@georgm8
Copy link
Contributor Author

georgm8 commented Mar 14, 2023

  1. Yes I believe that is correct - the analysis tables are grouped by 'ACSC' and 'Non-ASCS' which are both derived from the diagnosis. So I think the options would be:

a) Treat absence of diagnosis as 'Non-ACSC'
b) Discard the row as we don't have a 'No Diagnosis' category and therefore can't include these patients in the analysis

We have about 10% of patients where there is no diagnosis assigned within the emergency care dataset

@vvcb
Copy link
Member

vvcb commented Mar 14, 2023

Ah...I see it now 😊. Option b maybe the correct one but worth checking with the lead team regarding how they want this handled. 10% is a sizable proportion to be discarding.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants