make release-tag: Merge branch 'master' into stable

sdv-dev · Jul 12, 2021 · fbf610f · fbf610f
2 parents f43c0fc + ee3e243
commit fbf610f
Show file tree

Hide file tree

Showing 27 changed files with 3,988 additions and 472 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,5 +1,35 @@
 # Release Notes
 
+## 0.11.0 - 2021-07-12
+
+This release primarily addresses bugs and feature requests related to using constraints for the single-table models.
+Users can now enforce scalar comparison with the existing `GreaterThan` constraint and apply 5 new constraints: `OneHotEncoding`, `Positive`, `Negative`, `Between` and `Rounding`.
+Additionally, the SDV will now auto-apply constraints for rounding numerical values, and for keeping the data within the observed bounds.
+All related user guides are updated with the new functionality.
+
+### New Features
+
+* Add OneHotEncoding Constraint - Issue [#303](https://github.com/sdv-dev/SDV/issues/303) by @fealho
+* GreaterThan Constraint should apply to scalars - Issue [#410](https://github.com/sdv-dev/SDV/issues/410) by @amontanez24
+* Improve GreaterThan constraint - Issue [#368](https://github.com/sdv-dev/SDV/issues/368) by @amontanez24
+* Add Non-negative and Positive constraints across multiple columns- Issue [#409](https://github.com/sdv-dev/SDV/issues/409) by @amontanez24
+* Add Between values constraint - Issue [#367](https://github.com/sdv-dev/SDV/issues/367) by @fealho
+* Ensure values fall within the specified range - Issue [#423](https://github.com/sdv-dev/SDV/issues/423) by @amontanez24
+* Add Rounding constraint - Issue [#482](https://github.com/sdv-dev/SDV/issues/482) by @katxiao
+* Add rounding and min/max arguments that are passed down to the NumericalTransformer - Issue [#491](https://github.com/sdv-dev/SDV/issues/491) by @amontanez24
+
+### Bugs Fixed
+
+* GreaterThan constraint between Date columns rasises TypeError - Issue [#421](https://github.com/sdv-dev/SDV/issues/421) by @amontanez24
+* GreaterThan constraint's transform strategy fails on columns that are not float - Issue [#448](https://github.com/sdv-dev/SDV/issues/448) by @amontanez24
+* AttributeError on UniqueCombinations constraint with non-strings - Issue [#196](https://github.com/sdv-dev/SDV/issues/196) by @katxiao
+* Use reject sampling to sample missing columns for constraints - Issue [#435](https://github.com/sdv-dev/SDV/issues/435) by @amontanez24
+
+### Documentation Changes
+
+* Ensure privacy metrics are available in the API docs - Issue [#458](https://github.com/sdv-dev/SDV/issues/458) by @fealho
+* Ensure forumla constraint is called ColumnFormula everywhere in the docs - Issue [#449](https://github.com/sdv-dev/SDV/issues/449) by @fealho
+
 ## 0.10.1 - 2021-06-10
 
 This release changes the way we sample conditions to not only group by the conditions passed by the user, but also by the transformed conditions that result from them.

diff --git a/conda/meta.yaml b/conda/meta.yaml
@@ -1,5 +1,5 @@
 {% set name = 'sdv' %}
-{% set version = '0.10.1' %}
+{% set version = '0.11.0.dev1' %}
 
 package:
   name: "{{ name|lower }}"
@@ -26,10 +26,10 @@ requirements:
     - pytorch >=1.4,<2
     - tqdm >=4.14,<5
     - copulas >=0.5.0,<0.6
-    - ctgan >=0.4.2,<0.5
+    - ctgan >=0.4.3,<0.5
     - deepecho >=0.2.0,<0.3
-    - rdt >=0.4.1,<0.5
-    - sdmetrics >=0.3.0,<0.4
+    - rdt >=0.5.0,<0.5
+    - sdmetrics >=0.3.1,<0.4
     - torchvision >=0.5.0,<1
     - sktime >=0.4,<0.6
     - pomegranate >=0.13.4,<0.14.2
@@ -43,10 +43,10 @@ requirements:
     - pytorch >=1.4,<2
     - tqdm >=4.14,<5
     - copulas >=0.5.0,<0.6
-    - ctgan >=0.4.2,<0.5
+    - ctgan >=0.4.3,<0.5
     - deepecho >=0.2.0,<0.3
-    - rdt >=0.4.1,<0.5
-    - sdmetrics >=0.3.0,<0.4
+    - rdt >=0.5.0,<0.5
+    - sdmetrics >=0.3.1,<0.4
     - torchvision >=0.5.0,<1
     - sktime >=0.4,<0.6
     - pomegranate >=0.13.4,<0.14.2

diff --git a/docs/api_reference/constraints/tabular.rst b/docs/api_reference/constraints/tabular.rst
@@ -53,6 +53,38 @@ GreaterThan
    GreaterThan.from_dict
    GreaterThan.to_dict
 
+Positive
+~~~~~~~~~~~~~~~~
+
+.. autosummary::
+   :toctree: api/
+
+   Positive
+   Positive.fit
+   Positive.transform
+   Positive.fit_transform
+   Positive.reverse_transform
+   Positive.is_valid
+   Positive.filter_valid
+   Positive.from_dict
+   Positive.to_dict
+
+Negative
+~~~~~~~~~~~~~~~~
+
+.. autosummary::
+   :toctree: api/
+
+   Negative
+   Negative.fit
+   Negative.transform
+   Negative.fit_transform
+   Negative.reverse_transform
+   Negative.is_valid
+   Negative.filter_valid
+   Negative.from_dict
+   Negative.to_dict
+
 ColumnFormula
 ~~~~~~~~~~~~~~~~
 
@@ -68,3 +100,51 @@ ColumnFormula
    ColumnFormula.filter_valid
    ColumnFormula.from_dict
    ColumnFormula.to_dict
+
+Between
+~~~~~~~
+
+.. autosummary::
+   :toctree: api/
+
+   Between
+   Between.fit
+   Between.transform
+   Between.fit_transform
+   Between.reverse_transform
+   Between.is_valid
+   Between.filter_valid
+   Between.from_dict
+   Between.to_dict
+
+Rounding
+~~~~~~~~
+
+.. autosummary::
+   :toctree: api/
+
+   Rounding
+   Rounding.fit
+   Rounding.transform
+   Rounding.fit_transform
+   Rounding.reverse_transform
+   Rounding.is_valid
+   Rounding.filter_valid
+   Rounding.from_dict
+   Rounding.to_dict
+
+OneHotEncoding
+~~~~~~~~~~~~~~~~
+
+.. autosummary::
+   :toctree: api/
+
+   OneHotEncoding
+   OneHotEncoding.fit
+   OneHotEncoding.transform
+   OneHotEncoding.fit_transform
+   OneHotEncoding.reverse_transform
+   OneHotEncoding.is_valid
+   OneHotEncoding.filter_valid
+   OneHotEncoding.from_dict
+   OneHotEncoding.to_dict
diff --git a/docs/api_reference/metrics/tabular.rst b/docs/api_reference/metrics/tabular.rst
@@ -115,3 +115,50 @@ Single Table Efficacy Metrics
     MLPRegressor
     MLPRegressor.get_subclasses
     MLPRegressor.compute
+
+Single Table Privacy Metrics
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autosummary::
+   :toctree: api/
+
+    CategoricalPrivacyMetric
+    CategoricalPrivacyMetric.get_subclasses
+    NumericalPrivacyMetric
+    NumericalPrivacyMetric.get_subclasses
+    CategoricalCAP
+    CategoricalCAP.get_subclasses
+    CategoricalCAP.compute
+    CategoricalZeroCAP
+    CategoricalZeroCAP.get_subclasses
+    CategoricalZeroCAP.compute
+    CategoricalGeneralizedCAP
+    CategoricalGeneralizedCAP.get_subclasses
+    CategoricalGeneralizedCAP.compute
+    CategoricalKNN
+    CategoricalKNN.get_subclasses
+    CategoricalKNN.compute
+    CategoricalNB
+    CategoricalNB.get_subclasses
+    CategoricalNB.compute
+    CategoricalRF
+    CategoricalRF.get_subclasses
+    CategoricalRF.compute
+    CategoricalSVM
+    CategoricalSVM.get_subclasses
+    CategoricalSVM.compute
+    NumericalMLP
+    NumericalMLP.get_subclasses
+    NumericalMLP.compute
+    NumericalLR
+    NumericalLR.get_subclasses
+    NumericalLR.compute
+    NumericalSVR
+    NumericalSVR.get_subclasses
+    NumericalSVR.compute
+    CategoricalEnsemble
+    CategoricalEnsemble.get_subclasses
+    CategoricalEnsemble.compute
+    NumericalRadiusNearestNeighbor
+    NumericalRadiusNearestNeighbor.get_subclasses
+    NumericalRadiusNearestNeighbor.compute
diff --git a/docs/user_guides/single_table/constraints.rst b/docs/user_guides/single_table/constraints.rst
@@ -61,6 +61,13 @@ If we observe the data closely we will find a few **constraints**:
    years passed since they joined the company, which means that the
    ``years_in_the_company`` will always be equal to the ``age`` minus
    the ``age_when_joined``.
+4. We have a ``salary`` column that should always be rounded to 2
+   decimal points.
+5. The ``age`` column is bounded, since realistically an employee can only be
+   so old (or so young).
+6. The ``full_time``, ``part_time`` and ``contractor`` columns
+   are related in such a way that one of them will always be one and the others
+   zero, since the employee must be part of one of the three categories.
 
 How does SDV Handle Constraints?
 --------------------------------
@@ -150,7 +157,47 @@ passing:
         handling_strategy='reject_sampling'
     )
 
-CustomFormula Constraint
+The ``GreaterThan`` constraint can also be used to guarantee a column is greater
+than a scalar value or specific datetime value instead of another column. To use
+this functionality, we can pass:
+
+-  the scalar value for ``low``
+-  the scalar value for ``high``
+-  a boolean indicating ``low`` or ``high`` is a scalar
+
+.. ipython:: python
+    :okwarning:
+
+    salary_gt_30000_constraint = GreaterThan(
+        low=30000,
+        high='salary',
+        handling_strategy='reject_sampling'
+    )
+
+
+Positive and Negative Constraints
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Similar to the ``GreaterThan`` constraint, we can use the ``Positive``
+or ``Negative`` constraints. These constraints enforce that a specified
+column is always positive or negative. We can create an instance passing:
+
+- the name of the ``low`` column for ``Negative`` or the name of the ``high`` column for ``Positive``
+- the handling strategy that we want to use
+- a boolean specifying whether to make the data strictly above or below 0, or include 0 as a possible value
+
+.. ipython:: python
+    :okwarning:
+
+    from sdv.constraints import Positive
+
+    positive_prior_exp_constraint = Positive(
+        high='prior_years_experience',
+        strict=False,
+        handling_strategy='reject_sampling'
+    )
+
+ColumnFormula Constraint
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
 In some cases, one column will need to be computed based on the other
@@ -184,6 +231,78 @@ constraint by passing it:
         handling_strategy='transform'
     )
 
+Rounding Constraint
+~~~~~~~~~~~~~~~~~~~
+
+In order for data to be realistic, we also might want to round data
+to a certain number of digits. To do this, we can use the Rounding
+Constraint. We will pass this constraint:
+
+-  the name of the column(s) that should be rounded.
+-  the number of digits each column should be rounded to.
+-  the handling strategy that we want to use
+-  (optional) if reject sampling, we can customize the threshold of
+   the sampled values.
+
+.. ipython:: python
+    :okwarning:
+
+    from sdv.constraints import Rounding
+
+    salary_rounding_constraint = Rounding(
+        columns='salary',
+        digits=2,
+        handling_strategy='transform'
+    )
+
+Between Constraint
+~~~~~~~~~~~~~~~~~~
+
+Another possibility is the ``Between`` constraint. It guarantees
+that one column is always in between two other columns/values. For example,
+the ``age`` column in our demo data is realistically bounded to the ages of
+15 and 90 since acual employees won't be too young or too old.
+
+In order to use it, we need to create an instance passing:
+
+-  the name of the ``low`` column or a scalar value to be used as the lower bound
+-  the name of the ``high`` column or a scalar value to be used as the upper bound
+-  the handling strategy that we want to use
+
+.. ipython:: python
+    :okwarning:
+    
+    from sdv.constraints import Between
+
+    reasonable_age_constraint = Between(
+        column='age',
+        low=15,
+        high=90,
+        handling_strategy='transform'
+    )
+
+OneHotEncoding Constraint
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Another constraint available is the ``OneHotEncoding`` constraint.
+This constraint allows the user to specify a list of columns where each row 
+is a one hot vector. Then, the constraint will make sure that the output
+of the model is transformed so that the column with the largest value is
+set to 1 while all other columns are set to 0. To apply the constraint we
+need to create an instance passing:
+
+- A list of the names of the columns of interest
+- The strategy we want to use (``transform`` is recommended)
+
+.. ipython:: python
+    :okwarning:
+
+    from sdv.constraints import OneHotEncoding
+
+    one_hot_constraint = OneHotEncoding(
+        columns=['full_time', 'part_time', 'contractor']
+    )
+
 Using the Constraints
 ---------------------
 
@@ -200,7 +319,12 @@ constraints that we just defined as a ``list``:
     constraints = [
         unique_company_department_constraint,
         age_gt_age_when_joined_constraint,
-        years_in_the_company_constraint
+        years_in_the_company_constraint,
+        salary_gt_30000_constraint,
+        positive_prior_exp_constraint,
+        salary_rounding_constraint,
+        reasonable_age_constraint,
+        one_hot_constraint
     ]
 
     gc = GaussianCopula(constraints=constraints)