missing data visualization and imputation
To provide an easy to use yet thorough assessment of missing values in one's dataset:
- in addition to the blackholes plot bellow,
- show the variable-to-variable, subject-to-subject co-missingness, and
- quantify the TYPE of missingness etc
To easily manage your data with missing values etc, I strongly recommend you to move away from CSV files and start managing your data in self-contained flexible data structures like pyradigm, as your data, as well your needs, will only get bigger & more complicated e.g. with mixed-types, missing values and large number of groups.
These would be great contributions if you have time.
- visualization
- imputation (coming!)
- other handling
- Software is beta and under dev. Update regularly and quite often!!
- Contributions most welcome, esp. reporting bugs and improving usability.
pip install -U missingdata
We encourage you to update quite often, when you run into any issues.
Take a look at the help text first before diving in to use it - with the following code:
from missingdata import blackholes
help(blackholes)
I encourage you to read the text for each parameter carefully to understand the behaviour of this plotting mechanism.
Note
If you don't see any labels (for rows or columns), when you try the blackholes plot for the first time, it may be because the total effective number of rows/cols being displayed, after applying filter_spec_*
, exceeded a preset number (60/80) and we removed the labels to avoid them getting occluded or becoming illegible. You can use the parameter freq_thresh_show_labels to bring the effective number of rows/cols down to display to a smaller number, or pass show_all_labels=True
to force the display of labels. If number of subjects or variables is large, you may want to increase figsize
(width or height), to minimize occlusion and improve label readability.
Also, the defaults chosen may not work for you, hence I strongly encourage you to control as many parameters as needed to customize the plot to your liking. If a feature you need is not served currently, send a PR with improvements, or open an issue. Thanks.
Let's say you have all the data in a pandas DataFrame, where subject IDs are in a 'sub_ids'
column and variable names are in a 'var_names'
column, and they belong to groups identified by sub_class
and var_group
, you can use the following code produce the blackholes
plot:
blackholes(data_frame,
label_rows_with='sub_ids', label_cols_with='var_names',
group_rows_by=sub_class, group_cols_by=var_group)
If you were interested in seeing subjects/variables with least amount of missing data, you can control miss perc window
with filter_spec_samples
and/or filter_spec_variables
by passing a tuple of two floats e.g. (0, 0.1) which
will filter away those with more than 10% of missing data.
blackholes(data_frame,
label_rows_with='sub_ids', label_cols_with='var_names',
filter_spec_samples=(0, 0.1))
The other parameters for the function are self-explanatory.
Please open an issue if you find something confusing, or have feedback to improve, or identify a bug. Thanks.
If you find this package useful, I'd greatly appreciate if cite this package via:
Pradeep Reddy Raamana, (2019), "missingdata python library for visualization and handling of missing values" (Version v0.1). Zenodo. http://doi.org/10.5281/zenodo.3352336 DOI: 10.5281/zenodo.3352336