Skip to content

Commit

Permalink
Merge branch 'dev' into chp_add_rand_state_ddpm
Browse files Browse the repository at this point in the history
  • Loading branch information
Julien Roussel authored and Julien Roussel committed Jun 13, 2024
2 parents ddb6d69 + 47565ff commit 210e2f4
Show file tree
Hide file tree
Showing 67 changed files with 4,308 additions and 3,037 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.0.15
current_version = 0.1.7
commit = True
tag = True

Expand Down
2 changes: 2 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[run]
omit = qolmat/_version.py
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[flake8]
exclude = .git,__pycache__,.vscode,tests
exclude = .git,__pycache__,.vscode
max-line-length=99
ignore=E302,E305,W503,E203,E731,E402,E266,E712,F401,F821
indent-size = 4
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
Expand Down
4 changes: 1 addition & 3 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Unit test on many environments
name: Unit tests

on:
push:
Expand Down Expand Up @@ -31,11 +31,9 @@ jobs:
environment-file: environment.ci.yml
- name: Lint with flake8
run: |
conda install flake8
flake8
- name: Test with pytest
run: |
conda install pytest
make coverage
- name: typing with mypy
run: |
Expand Down
39 changes: 32 additions & 7 deletions .github/workflows/test_quick.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Unit test Qolmat
name: Unit tests fast

on:
push:
Expand All @@ -21,19 +21,44 @@ jobs:
steps:
- name: Git clone
uses: actions/checkout@v3
- name: Set up venv for ci

# See caching environments
# https://github.com/conda-incubator/setup-miniconda#caching-environments
- name: Setup Mambaforge
uses: conda-incubator/setup-miniconda@v2
with:
python-version: ${{matrix.python-version}}
environment-file: environment.ci.yml
miniforge-variant: Mambaforge
miniforge-version: latest
activate-environment: env_qolmat_ci
use-mamba: true

- name: Get Date
id: get-date
run: echo "today=$(/bin/date -u '+%Y%m%d')" >> $GITHUB_OUTPUT

- name: Cache Conda env
uses: actions/cache@v2
with:
path: ${{ env.CONDA }}/envs
key:
conda-${{ runner.os }}--${{ runner.arch }}--${{
steps.get-date.outputs.today }}-${{
hashFiles('environment.ci.yml') }}-${{ env.CACHE_NUMBER
}}
env:
# Increase this value to reset cache if environment.ci.yml has not changed
CACHE_NUMBER: 0
id: cache

- name: Update environment
run: mamba env update -n env_qolmat_ci -f environment.ci.yml
if: steps.cache.outputs.cache-hit != 'true'

- name: Lint with flake8
run: |
conda install flake8
flake8
- name: Test with pytest
run: |
conda install pytest
pip install -e .[pytorch]
make coverage
- name: Test docstrings
run: make doctest
Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ examples/*.ipynb
examples/figures/*
examples/data/*
examples/local

data/data_local/*

# VSCode
.vscode
Expand Down
2 changes: 2 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ python:
install:
- method: pip
path: .
extra_requirements:
- pytorch

conda:
environment: environment.doc.yml
Expand Down
9 changes: 5 additions & 4 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ Credits
Development Team
----------------

* Julien Roussel <jroussel@quantmetry.com>
* Anh Khoa Ngo Ho <angoho@quantmetry.com>
* Charles-Henri Prat <chprat@quantmetry.com>
* Guillaume Saës <gsaes@quantmetry.com>
* Julien Roussel <julien.a.roussel@capgemini.com>
* Anh Khoa Ngo Ho <anh-khoa.ngo-ho@capgemini.com>
* Guillaume Saës <guillaume.saes@capgemini.com>
* Yasser Zidani <yasser.zidani@capgemini.com>

Past Contributors
-----------------
Expand All @@ -19,3 +19,4 @@ Past Contributors
* Mikaïl Duran
* Rima Hajou
* Thomas Morzadec
* Charles-Henri Prat
47 changes: 46 additions & 1 deletion HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,55 @@
History
=======

0.1.1 (2023-??-??)
0.1.7 (2024-06-13)
------------------
* Little's test implemented in a new hole_characterization module
* Documentation now includes an analysis section with a tutorial
* Hole generators now provide reproducible outputs

0.1.5 (2024-04-17)
------------------

* CICD now relies on Node.js 20
* New tests for comparator.py and data.py

0.1.4 (2024-04-15)
------------------

* ImputerMean, ImputerMedian and ImputerMode have been merged into ImputerSimple
* File preprocessing.py added with classes new MixteHGBM, BinTransformer, OneHotEncoderProjector and WrapperTransformer providing tools to manage mixed types data
* Tutorial plot_tuto_categorical showcasing mixed type imputation
* Titanic dataset added
* accuracy metric implemented
* metrics.py rationalized, and split with algebra.py

0.1.3 (2024-03-07)
------------------

* RPCA algorithms now start with a normalizing scaler
* The EM algorithms now include a gradient projection step to be more robust to colinearity
* The EM algorithm based on the Gaussian model is now initialized using a robust estimation of the covariance matrix
* A bug in the EM algorithm has been patched: the normalizing matrix gamma was creating a sampling biais
* Speed up of the EM algorithm likelihood maximization, using the conjugate gradient method
* The ImputeRegressor class now handles the nans by `row` by default
* The metric `frechet` was not correctly called and has been patched
* The EM algorithm with VAR(p) now fills initial holes in order to avoid exponential explosions

0.1.2 (2024-02-28)
------------------

* RPCA Noisy now has separate fit and transform methods, allowing to impute efficiently new data without retraining
* The class ImputerRPCA has been splitted between a class ImputerRpcaNoisy, which can fit then transform, and a class ImputerRpcaPcp which can only fit_transform
* The class SoftImpute has been recoded to better fit the architecture, and is more tested
* The class RPCANoisy now relies on sparse matrices for H, speeding it up for large instances

0.1.1 (2023-11-03)
-------------------

* Hotfix reference to tensorflow in the documentation, when it should be pytorch
* Metrics KL forest has been removed from package
* EM imputer made more robust to colinearity, and transform bug patched
* CICD made faster with mamba and a quick test setting

0.1.0 (2023-10-11)
-------------------
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
coverage:
pytest --cov-branch --cov=qolmat --cov-report=xml
pytest --cov-branch --cov=qolmat --cov-report=xml tests

doctest:
pytest --doctest-modules --pyargs qolmat
Expand Down
11 changes: 7 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
.. |Commits| image:: https://img.shields.io/github/commits-since/Quantmetry/qolmat/latest/main
.. _Commits: https://github.com/Quantmetry/qolmat/commits/main

.. |Codecov| image:: https://codecov.io/gh/quantmetry/qolmat/branch/master/graph/badge.svg
.. |Codecov| image:: https://codecov.io/gh/quantmetry/qolmat/branch/main/graph/badge.svg
.. _Codecov: https://codecov.io/gh/quantmetry/qolmat

.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/logo.png
Expand All @@ -47,7 +47,7 @@ Qolmat can be installed in different ways:
.. code:: sh
$ pip install qolmat # installation via `pip`
$ pip install qolmat[pytorch] # if you need pytorch
$ pip install qolmat[pytorch] # if you need ImputerDiffusion relying on pytorch
$ pip install git+https://github.com/Quantmetry/qolmat # or directly from the github repository
⚡️ Quickstart
Expand Down Expand Up @@ -106,7 +106,8 @@ The full documentation can be found `on this link <https://qolmat.readthedocs.io
**How does Qolmat work ?**

Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_,

1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_.
2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.

Expand All @@ -117,7 +118,7 @@ This is very similar in spirit to the `cross_val_score <https://scikit-learn.org

**Imputation methods**

The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic).
The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic). For further details regarding the distinction between single and multiple imputation, you can refer to the `Imputation article <https://en.wikipedia.org/wiki/Imputation_(statistics)>`_ on Wikipedia.

.. list-table::
:widths: 25 70 15 15
Expand Down Expand Up @@ -231,6 +232,8 @@ Selected Topics in Signal Processing 10.4 (2016): 740-756.
[6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
(`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)

[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)

📝 License
==========

Expand Down
68 changes: 68 additions & 0 deletions docs/analysis.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@

Analysis
========
This section gives a better understanding of the holes in a dataset.

1. General approach
-------------------

As described in section :ref:`hole_generator`, there are 3 main types of missing data mechanism: MCAR, MAR and MNAR.
The analysis module provides tools to characterize the type of holes.

The MNAR case is the trickiest, the user must first consider whether their missing data mechanism is MNAR. In the meantime, we make assume that the missing-data mechanism is ignorable (ie., it is not MNAR). If an MNAR mechanism is suspected, please see this article :ref:`An approach to test for MNAR [1]<Noonan-article>` for relevant actions.

Then Qolmat proposes a test to determine whether the missing data mechanism is MCAR or MAR.

2. How to use the results
-------------------------

At the end of the MCAR test, it can then be assumed whether the missing data mechanism is MCAR or not. This serves three differents purposes:

a. Diagnosis
^^^^^^^^^^^^

If the result of the MCAR test is "The MCAR hypothesis is rejected", we can then ask ourselves over which range of values holes are more present.
The test result can then be used for continuous data quality management.

b. Estimation
^^^^^^^^^^^^^

Some estimation methods are not suitable for the MAR case. For example, dropping the nans introduces bias into the estimator, it is necessary to have validated that the missing-data mechanism is MCAR.

c. Imputation
^^^^^^^^^^^^^

Qolmat allows model selection imputation algorithms. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user-specified hole generator. It seems natural to create these masks according to the same missing-data mechanism as determined by the test. Here is the documentation on using Qolmat for imputation `model selection <https://qolmat.readthedocs.io/en/latest/#:~:text=How%20does%20Qolmat%20work%20%3F>`_.

3. The MCAR Tests
-----------------

There are several statistical tests to determine if the missing data mechanism is MCAR or MAR. Most tests are based on the notion of missing pattern.
A missing pattern, also called a pattern, is the structure of observed and missing values in a dataset. For example, for a dataset with two columns, the possible patterns are: (0, 0), (1, 0), (0, 1), (1, 1). The value 1 indicates that the value in the column is missing.

The MCAR missing-data mechanism means that there is independence between the presence of holes and the observed values. In other words, the data distribution is the same for all patterns.

a. Little's Test
^^^^^^^^^^^^^^^^

The best-known MCAR test is the :ref:`Little [2]<Little-article>` test, and it has been implemented in :class:`LittleTest`. Keep in mind that the Little's test is designed to test the homogeneity of means across the missing patterns and won't be efficient to detect the heterogeneity of covariance accross missing patterns.

b. PKLM Test
^^^^^^^^^^^^

The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data. It is not implemented yet in Qolmat.

References
----------

.. _Noonan-article:

[1] Noonan, Jack, et al. `An integrated approach to test for missing not at random. <https://arxiv.org/abs/2208.07813>`_ arXiv preprint arXiv:2208.07813 (2022).

.. _Little-article:

[2] Little, R. J. A. `A Test of Missing Completely at Random for Multivariate Data with Missing Values. <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_ Journal of the American Statistical Association, Volume 83, 1988 - Issue 404.

.. _PKLM-article:

[3] Spohn, Meta-Lina, et al. `PKLM: A flexible MCAR test using Classification. <https://arxiv.org/abs/2109.10150>`_ arXiv preprint arXiv:2109.10150 (2021).
Loading

0 comments on commit 210e2f4

Please sign in to comment.