Merge branch 'dev' into chp_add_rand_state_ddpm

scikit-learn-contrib · Jun 13, 2024 · 210e2f4 · 210e2f4
2 parents ddb6d69 + 47565ff
commit 210e2f4
Show file tree

Hide file tree

Showing 67 changed files with 4,308 additions and 3,037 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.0.15
+current_version = 0.1.7
 commit = True
 tag = True
 

diff --git a/.coveragerc b/.coveragerc
@@ -0,0 +1,2 @@
+[run]
+omit = qolmat/_version.py
diff --git a/.flake8 b/.flake8
@@ -1,5 +1,5 @@
 [flake8]
-exclude = .git,__pycache__,.vscode,tests
+exclude = .git,__pycache__,.vscode
 max-line-length=99
 ignore=E302,E305,W503,E203,E731,E402,E266,E712,F401,F821
 indent-size = 4

diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -11,9 +11,9 @@ jobs:
     runs-on: ubuntu-latest
 
     steps:
-    - uses: actions/checkout@v3
+    - uses: actions/checkout@v4
     - name: Set up Python
-      uses: actions/setup-python@v3
+      uses: actions/setup-python@v4
       with:
         python-version: '3.10'
     - name: Install dependencies

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -1,4 +1,4 @@
-name: Unit test on many environments
+name: Unit tests
 
 on:
   push:
@@ -31,11 +31,9 @@ jobs:
           environment-file: environment.ci.yml
       - name: Lint with flake8
         run: |
-          conda install flake8
           flake8
       - name: Test with pytest
         run: |
-          conda install pytest
           make coverage
       - name: typing with mypy
         run: |

diff --git a/.github/workflows/test_quick.yml b/.github/workflows/test_quick.yml
@@ -1,4 +1,4 @@
-name: Unit test Qolmat
+name: Unit tests fast
 
 on:
   push:
@@ -21,19 +21,44 @@ jobs:
     steps:
       - name: Git clone
         uses: actions/checkout@v3
-      - name: Set up venv for ci
+
+      # See caching environments
+      # https://github.com/conda-incubator/setup-miniconda#caching-environments
+      - name: Setup Mambaforge
         uses: conda-incubator/setup-miniconda@v2
         with:
-          python-version: ${{matrix.python-version}}
-          environment-file: environment.ci.yml
+            miniforge-variant: Mambaforge
+            miniforge-version: latest
+            activate-environment: env_qolmat_ci
+            use-mamba: true
+
+      - name: Get Date
+        id: get-date
+        run: echo "today=$(/bin/date -u '+%Y%m%d')" >> $GITHUB_OUTPUT
+
+      - name: Cache Conda env
+        uses: actions/cache@v2
+        with:
+          path: ${{ env.CONDA }}/envs
+          key:
+            conda-${{ runner.os }}--${{ runner.arch }}--${{
+            steps.get-date.outputs.today }}-${{
+            hashFiles('environment.ci.yml') }}-${{ env.CACHE_NUMBER
+            }}
+        env:
+          # Increase this value to reset cache if environment.ci.yml has not changed
+          CACHE_NUMBER: 0
+        id: cache
+
+      - name: Update environment
+        run: mamba env update -n env_qolmat_ci -f environment.ci.yml
+        if: steps.cache.outputs.cache-hit != 'true'
+
       - name: Lint with flake8
         run: |
-          conda install flake8
           flake8
       - name: Test with pytest
         run: |
-          conda install pytest
-          pip install -e .[pytorch]
           make coverage
       - name: Test docstrings
         run: make doctest

diff --git a/.gitignore b/.gitignore
@@ -59,7 +59,7 @@ examples/*.ipynb
 examples/figures/*
 examples/data/*
 examples/local
-
+data/data_local/*
 
 # VSCode
 .vscode

diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -9,6 +9,8 @@ python:
   install:
     - method: pip
       path: .
+      extra_requirements:
+        - pytorch
 
 conda:
   environment: environment.doc.yml

diff --git a/AUTHORS.rst b/AUTHORS.rst
@@ -5,10 +5,10 @@ Credits
 Development Team
 ----------------
 
-* Julien Roussel <jroussel@quantmetry.com>
-* Anh Khoa Ngo Ho <angoho@quantmetry.com>
-* Charles-Henri Prat <chprat@quantmetry.com>
-* Guillaume Saës <gsaes@quantmetry.com>
+* Julien Roussel <julien.a.roussel@capgemini.com>
+* Anh Khoa Ngo Ho <anh-khoa.ngo-ho@capgemini.com>
+* Guillaume Saës <guillaume.saes@capgemini.com>
+* Yasser Zidani <yasser.zidani@capgemini.com>
 
 Past Contributors
 -----------------
@@ -19,3 +19,4 @@ Past Contributors
 * Mikaïl Duran
 * Rima Hajou
 * Thomas Morzadec
+* Charles-Henri Prat
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -2,10 +2,55 @@
 History
 =======
 
-0.1.1 (2023-??-??)
+0.1.7 (2024-06-13)
+------------------
+* Little's test implemented in a new hole_characterization module
+* Documentation now includes an analysis section with a tutorial
+* Hole generators now provide reproducible outputs
+
+0.1.5 (2024-04-17)
+------------------
+
+* CICD now relies on Node.js 20
+* New tests for comparator.py and data.py
+
+0.1.4 (2024-04-15)
+------------------
+
+* ImputerMean, ImputerMedian and ImputerMode have been merged into ImputerSimple
+* File preprocessing.py added with classes new MixteHGBM, BinTransformer, OneHotEncoderProjector and WrapperTransformer providing tools to manage mixed types data
+* Tutorial plot_tuto_categorical showcasing mixed type imputation
+* Titanic dataset added
+* accuracy metric implemented
+* metrics.py rationalized, and split with algebra.py
+
+0.1.3 (2024-03-07)
+------------------
+
+* RPCA algorithms now start with a normalizing scaler
+* The EM algorithms now include a gradient projection step to be more robust to colinearity
+* The EM algorithm based on the Gaussian model is now initialized using a robust estimation of the covariance matrix
+* A bug in the EM algorithm has been patched: the normalizing matrix gamma was creating a sampling biais
+* Speed up of the EM algorithm likelihood maximization, using the conjugate gradient method
+* The ImputeRegressor class now handles the nans by `row` by default
+* The metric `frechet` was not correctly called and has been patched
+* The EM algorithm with VAR(p) now fills initial holes in order to avoid exponential explosions
+
+0.1.2 (2024-02-28)
+------------------
+
+* RPCA Noisy now has separate fit and transform methods, allowing to impute efficiently new data without retraining
+* The class ImputerRPCA has been splitted between a class ImputerRpcaNoisy, which can fit then transform, and a class ImputerRpcaPcp which can only fit_transform
+* The class SoftImpute has been recoded to better fit the architecture, and is more tested
+* The class RPCANoisy now relies on sparse matrices for H, speeding it up for large instances
+
+0.1.1 (2023-11-03)
 -------------------
 
 * Hotfix reference to tensorflow in the documentation, when it should be pytorch
+* Metrics KL forest has been removed from package
+* EM imputer made more robust to colinearity, and transform bug patched
+* CICD made faster with mamba and a quick test setting
 
 0.1.0 (2023-10-11)
 -------------------

diff --git a/Makefile b/Makefile
@@ -1,5 +1,5 @@
 coverage:
-	pytest --cov-branch --cov=qolmat --cov-report=xml
+	pytest --cov-branch --cov=qolmat --cov-report=xml tests
 
 doctest:
 	pytest --doctest-modules --pyargs qolmat

diff --git a/README.rst b/README.rst
@@ -23,7 +23,7 @@
 .. |Commits| image:: https://img.shields.io/github/commits-since/Quantmetry/qolmat/latest/main
 .. _Commits: https://github.com/Quantmetry/qolmat/commits/main
 
-.. |Codecov| image:: https://codecov.io/gh/quantmetry/qolmat/branch/master/graph/badge.svg
+.. |Codecov| image:: https://codecov.io/gh/quantmetry/qolmat/branch/main/graph/badge.svg
 .. _Codecov: https://codecov.io/gh/quantmetry/qolmat
 
 .. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/logo.png
@@ -47,7 +47,7 @@ Qolmat can be installed in different ways:
 .. code:: sh
 
     $ pip install qolmat  # installation via `pip`
-    $ pip install qolmat[pytorch] # if you need pytorch
+    $ pip install qolmat[pytorch] # if you need ImputerDiffusion relying on pytorch
     $ pip install git+https://github.com/Quantmetry/qolmat  # or directly from the github repository
 
 ⚡️ Quickstart
@@ -106,7 +106,8 @@ The full documentation can be found `on this link <https://qolmat.readthedocs.io
 **How does Qolmat work ?**
 
 Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
-1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_,
+
+1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_.
 2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
 3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.
 
@@ -117,7 +118,7 @@ This is very similar in spirit to the `cross_val_score <https://scikit-learn.org
 
 **Imputation methods**
 
-The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic).
+The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic). For further details regarding the distinction between single and multiple imputation, you can refer to the `Imputation article <https://en.wikipedia.org/wiki/Imputation_(statistics)>`_ on Wikipedia.
 
 .. list-table::
    :widths: 25 70 15 15
@@ -231,6 +232,8 @@ Selected Topics in Signal Processing 10.4 (2016): 740-756.
 [6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
 (`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)
 
+[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)
+
 📝 License
 ==========
 

diff --git a/docs/analysis.rst b/docs/analysis.rst
@@ -0,0 +1,68 @@
+
+Analysis
+========
+This section gives a better understanding of the holes in a dataset.
+
+1. General approach
+-------------------
+
+As described in section :ref:`hole_generator`, there are 3 main types of missing data mechanism: MCAR, MAR and MNAR.
+The analysis module provides tools to characterize the type of holes.
+
+The MNAR case is the trickiest, the user must first consider whether their missing data mechanism is MNAR. In the meantime, we make assume that the missing-data mechanism is ignorable (ie., it is not MNAR). If an MNAR mechanism is suspected, please see this article :ref:`An approach to test for MNAR [1]<Noonan-article>` for relevant actions.
+
+Then Qolmat proposes a test to determine whether the missing data mechanism is MCAR or MAR.
+
+2. How to use the results
+-------------------------
+
+At the end of the MCAR test, it can then be assumed whether the missing data mechanism is MCAR or not. This serves three differents purposes:
+
+a. Diagnosis
+^^^^^^^^^^^^
+
+If the result of the MCAR test is "The MCAR hypothesis is rejected", we can then ask ourselves over which range of values holes are more present.
+The test result can then be used for continuous data quality management.
+
+b. Estimation
+^^^^^^^^^^^^^
+
+Some estimation methods are not suitable for the MAR case. For example, dropping the nans introduces bias into the estimator, it is necessary to have validated that the missing-data mechanism is MCAR.
+
+c. Imputation
+^^^^^^^^^^^^^
+
+Qolmat allows model selection imputation algorithms. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user-specified hole generator. It seems natural to create these masks according to the same missing-data mechanism as determined by the test. Here is the documentation on using Qolmat for imputation `model selection <https://qolmat.readthedocs.io/en/latest/#:~:text=How%20does%20Qolmat%20work%20%3F>`_.
+
+3. The MCAR Tests
+-----------------
+
+There are several statistical tests to determine if the missing data mechanism is MCAR or MAR. Most tests are based on the notion of missing pattern.
+A missing pattern, also called a pattern, is the structure of observed and missing values in a dataset. For example, for a dataset with two columns, the possible patterns are: (0, 0), (1, 0), (0, 1), (1, 1). The value 1 indicates that the value in the column is missing.
+
+The MCAR missing-data mechanism means that there is independence between the presence of holes and the observed values. In other words, the data distribution is the same for all patterns.
+
+a. Little's Test
+^^^^^^^^^^^^^^^^
+
+The best-known MCAR test is the :ref:`Little [2]<Little-article>` test, and it has been implemented in :class:`LittleTest`. Keep in mind that the Little's test is designed to test the homogeneity of means across the missing patterns and won't be efficient to detect the heterogeneity of covariance accross missing patterns.
+
+b. PKLM Test
+^^^^^^^^^^^^
+
+The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data. It is not implemented yet in Qolmat.
+
+References
+----------
+
+.. _Noonan-article:
+
+[1] Noonan, Jack, et al. `An integrated approach to test for missing not at random. <https://arxiv.org/abs/2208.07813>`_ arXiv preprint arXiv:2208.07813 (2022).
+
+.. _Little-article:
+
+[2] Little, R. J. A. `A Test of Missing Completely at Random for Multivariate Data with Missing Values. <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_ Journal of the American Statistical Association, Volume 83, 1988 - Issue 404.
+
+.. _PKLM-article:
+
+[3] Spohn, Meta-Lina, et al. `PKLM: A flexible MCAR test using Classification. <https://arxiv.org/abs/2109.10150>`_ arXiv preprint arXiv:2109.10150 (2021).