Skip to content

Commit

Permalink
make release-tag: Merge branch 'master' into stable
Browse files Browse the repository at this point in the history
  • Loading branch information
npatki committed Jul 12, 2021
2 parents f43c0fc + ee3e243 commit fbf610f
Show file tree
Hide file tree
Showing 27 changed files with 3,988 additions and 472 deletions.
30 changes: 30 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,35 @@
# Release Notes

## 0.11.0 - 2021-07-12

This release primarily addresses bugs and feature requests related to using constraints for the single-table models.
Users can now enforce scalar comparison with the existing `GreaterThan` constraint and apply 5 new constraints: `OneHotEncoding`, `Positive`, `Negative`, `Between` and `Rounding`.
Additionally, the SDV will now auto-apply constraints for rounding numerical values, and for keeping the data within the observed bounds.
All related user guides are updated with the new functionality.

### New Features

* Add OneHotEncoding Constraint - Issue [#303](https://github.com/sdv-dev/SDV/issues/303) by @fealho
* GreaterThan Constraint should apply to scalars - Issue [#410](https://github.com/sdv-dev/SDV/issues/410) by @amontanez24
* Improve GreaterThan constraint - Issue [#368](https://github.com/sdv-dev/SDV/issues/368) by @amontanez24
* Add Non-negative and Positive constraints across multiple columns- Issue [#409](https://github.com/sdv-dev/SDV/issues/409) by @amontanez24
* Add Between values constraint - Issue [#367](https://github.com/sdv-dev/SDV/issues/367) by @fealho
* Ensure values fall within the specified range - Issue [#423](https://github.com/sdv-dev/SDV/issues/423) by @amontanez24
* Add Rounding constraint - Issue [#482](https://github.com/sdv-dev/SDV/issues/482) by @katxiao
* Add rounding and min/max arguments that are passed down to the NumericalTransformer - Issue [#491](https://github.com/sdv-dev/SDV/issues/491) by @amontanez24

### Bugs Fixed

* GreaterThan constraint between Date columns rasises TypeError - Issue [#421](https://github.com/sdv-dev/SDV/issues/421) by @amontanez24
* GreaterThan constraint's transform strategy fails on columns that are not float - Issue [#448](https://github.com/sdv-dev/SDV/issues/448) by @amontanez24
* AttributeError on UniqueCombinations constraint with non-strings - Issue [#196](https://github.com/sdv-dev/SDV/issues/196) by @katxiao
* Use reject sampling to sample missing columns for constraints - Issue [#435](https://github.com/sdv-dev/SDV/issues/435) by @amontanez24

### Documentation Changes

* Ensure privacy metrics are available in the API docs - Issue [#458](https://github.com/sdv-dev/SDV/issues/458) by @fealho
* Ensure forumla constraint is called ColumnFormula everywhere in the docs - Issue [#449](https://github.com/sdv-dev/SDV/issues/449) by @fealho

## 0.10.1 - 2021-06-10

This release changes the way we sample conditions to not only group by the conditions passed by the user, but also by the transformed conditions that result from them.
Expand Down
14 changes: 7 additions & 7 deletions conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{% set name = 'sdv' %}
{% set version = '0.10.1' %}
{% set version = '0.11.0.dev1' %}

package:
name: "{{ name|lower }}"
Expand All @@ -26,10 +26,10 @@ requirements:
- pytorch >=1.4,<2
- tqdm >=4.14,<5
- copulas >=0.5.0,<0.6
- ctgan >=0.4.2,<0.5
- ctgan >=0.4.3,<0.5
- deepecho >=0.2.0,<0.3
- rdt >=0.4.1,<0.5
- sdmetrics >=0.3.0,<0.4
- rdt >=0.5.0,<0.5
- sdmetrics >=0.3.1,<0.4
- torchvision >=0.5.0,<1
- sktime >=0.4,<0.6
- pomegranate >=0.13.4,<0.14.2
Expand All @@ -43,10 +43,10 @@ requirements:
- pytorch >=1.4,<2
- tqdm >=4.14,<5
- copulas >=0.5.0,<0.6
- ctgan >=0.4.2,<0.5
- ctgan >=0.4.3,<0.5
- deepecho >=0.2.0,<0.3
- rdt >=0.4.1,<0.5
- sdmetrics >=0.3.0,<0.4
- rdt >=0.5.0,<0.5
- sdmetrics >=0.3.1,<0.4
- torchvision >=0.5.0,<1
- sktime >=0.4,<0.6
- pomegranate >=0.13.4,<0.14.2
Expand Down
80 changes: 80 additions & 0 deletions docs/api_reference/constraints/tabular.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,38 @@ GreaterThan
GreaterThan.from_dict
GreaterThan.to_dict

Positive
~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: api/

Positive
Positive.fit
Positive.transform
Positive.fit_transform
Positive.reverse_transform
Positive.is_valid
Positive.filter_valid
Positive.from_dict
Positive.to_dict

Negative
~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: api/

Negative
Negative.fit
Negative.transform
Negative.fit_transform
Negative.reverse_transform
Negative.is_valid
Negative.filter_valid
Negative.from_dict
Negative.to_dict

ColumnFormula
~~~~~~~~~~~~~~~~

Expand All @@ -68,3 +100,51 @@ ColumnFormula
ColumnFormula.filter_valid
ColumnFormula.from_dict
ColumnFormula.to_dict

Between
~~~~~~~

.. autosummary::
:toctree: api/

Between
Between.fit
Between.transform
Between.fit_transform
Between.reverse_transform
Between.is_valid
Between.filter_valid
Between.from_dict
Between.to_dict

Rounding
~~~~~~~~

.. autosummary::
:toctree: api/

Rounding
Rounding.fit
Rounding.transform
Rounding.fit_transform
Rounding.reverse_transform
Rounding.is_valid
Rounding.filter_valid
Rounding.from_dict
Rounding.to_dict

OneHotEncoding
~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: api/

OneHotEncoding
OneHotEncoding.fit
OneHotEncoding.transform
OneHotEncoding.fit_transform
OneHotEncoding.reverse_transform
OneHotEncoding.is_valid
OneHotEncoding.filter_valid
OneHotEncoding.from_dict
OneHotEncoding.to_dict
47 changes: 47 additions & 0 deletions docs/api_reference/metrics/tabular.rst
Original file line number Diff line number Diff line change
Expand Up @@ -115,3 +115,50 @@ Single Table Efficacy Metrics
MLPRegressor
MLPRegressor.get_subclasses
MLPRegressor.compute

Single Table Privacy Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: api/

CategoricalPrivacyMetric
CategoricalPrivacyMetric.get_subclasses
NumericalPrivacyMetric
NumericalPrivacyMetric.get_subclasses
CategoricalCAP
CategoricalCAP.get_subclasses
CategoricalCAP.compute
CategoricalZeroCAP
CategoricalZeroCAP.get_subclasses
CategoricalZeroCAP.compute
CategoricalGeneralizedCAP
CategoricalGeneralizedCAP.get_subclasses
CategoricalGeneralizedCAP.compute
CategoricalKNN
CategoricalKNN.get_subclasses
CategoricalKNN.compute
CategoricalNB
CategoricalNB.get_subclasses
CategoricalNB.compute
CategoricalRF
CategoricalRF.get_subclasses
CategoricalRF.compute
CategoricalSVM
CategoricalSVM.get_subclasses
CategoricalSVM.compute
NumericalMLP
NumericalMLP.get_subclasses
NumericalMLP.compute
NumericalLR
NumericalLR.get_subclasses
NumericalLR.compute
NumericalSVR
NumericalSVR.get_subclasses
NumericalSVR.compute
CategoricalEnsemble
CategoricalEnsemble.get_subclasses
CategoricalEnsemble.compute
NumericalRadiusNearestNeighbor
NumericalRadiusNearestNeighbor.get_subclasses
NumericalRadiusNearestNeighbor.compute
128 changes: 126 additions & 2 deletions docs/user_guides/single_table/constraints.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,13 @@ If we observe the data closely we will find a few **constraints**:
years passed since they joined the company, which means that the
``years_in_the_company`` will always be equal to the ``age`` minus
the ``age_when_joined``.
4. We have a ``salary`` column that should always be rounded to 2
decimal points.
5. The ``age`` column is bounded, since realistically an employee can only be
so old (or so young).
6. The ``full_time``, ``part_time`` and ``contractor`` columns
are related in such a way that one of them will always be one and the others
zero, since the employee must be part of one of the three categories.

How does SDV Handle Constraints?
--------------------------------
Expand Down Expand Up @@ -150,7 +157,47 @@ passing:
handling_strategy='reject_sampling'
)
CustomFormula Constraint
The ``GreaterThan`` constraint can also be used to guarantee a column is greater
than a scalar value or specific datetime value instead of another column. To use
this functionality, we can pass:

- the scalar value for ``low``
- the scalar value for ``high``
- a boolean indicating ``low`` or ``high`` is a scalar

.. ipython:: python
:okwarning:
salary_gt_30000_constraint = GreaterThan(
low=30000,
high='salary',
handling_strategy='reject_sampling'
)
Positive and Negative Constraints
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Similar to the ``GreaterThan`` constraint, we can use the ``Positive``
or ``Negative`` constraints. These constraints enforce that a specified
column is always positive or negative. We can create an instance passing:

- the name of the ``low`` column for ``Negative`` or the name of the ``high`` column for ``Positive``
- the handling strategy that we want to use
- a boolean specifying whether to make the data strictly above or below 0, or include 0 as a possible value

.. ipython:: python
:okwarning:
from sdv.constraints import Positive
positive_prior_exp_constraint = Positive(
high='prior_years_experience',
strict=False,
handling_strategy='reject_sampling'
)
ColumnFormula Constraint
~~~~~~~~~~~~~~~~~~~~~~~~

In some cases, one column will need to be computed based on the other
Expand Down Expand Up @@ -184,6 +231,78 @@ constraint by passing it:
handling_strategy='transform'
)
Rounding Constraint
~~~~~~~~~~~~~~~~~~~

In order for data to be realistic, we also might want to round data
to a certain number of digits. To do this, we can use the Rounding
Constraint. We will pass this constraint:

- the name of the column(s) that should be rounded.
- the number of digits each column should be rounded to.
- the handling strategy that we want to use
- (optional) if reject sampling, we can customize the threshold of
the sampled values.

.. ipython:: python
:okwarning:
from sdv.constraints import Rounding
salary_rounding_constraint = Rounding(
columns='salary',
digits=2,
handling_strategy='transform'
)
Between Constraint
~~~~~~~~~~~~~~~~~~

Another possibility is the ``Between`` constraint. It guarantees
that one column is always in between two other columns/values. For example,
the ``age`` column in our demo data is realistically bounded to the ages of
15 and 90 since acual employees won't be too young or too old.

In order to use it, we need to create an instance passing:

- the name of the ``low`` column or a scalar value to be used as the lower bound
- the name of the ``high`` column or a scalar value to be used as the upper bound
- the handling strategy that we want to use

.. ipython:: python
:okwarning:
from sdv.constraints import Between
reasonable_age_constraint = Between(
column='age',
low=15,
high=90,
handling_strategy='transform'
)
OneHotEncoding Constraint
~~~~~~~~~~~~~~~~~~~~~~~~~

Another constraint available is the ``OneHotEncoding`` constraint.
This constraint allows the user to specify a list of columns where each row
is a one hot vector. Then, the constraint will make sure that the output
of the model is transformed so that the column with the largest value is
set to 1 while all other columns are set to 0. To apply the constraint we
need to create an instance passing:

- A list of the names of the columns of interest
- The strategy we want to use (``transform`` is recommended)

.. ipython:: python
:okwarning:
from sdv.constraints import OneHotEncoding
one_hot_constraint = OneHotEncoding(
columns=['full_time', 'part_time', 'contractor']
)
Using the Constraints
---------------------

Expand All @@ -200,7 +319,12 @@ constraints that we just defined as a ``list``:
constraints = [
unique_company_department_constraint,
age_gt_age_when_joined_constraint,
years_in_the_company_constraint
years_in_the_company_constraint,
salary_gt_30000_constraint,
positive_prior_exp_constraint,
salary_rounding_constraint,
reasonable_age_constraint,
one_hot_constraint
]
gc = GaussianCopula(constraints=constraints)
Expand Down
Loading

0 comments on commit fbf610f

Please sign in to comment.