Skip to content

Commit

Permalink
Merge pull request #165 from ipums/add_new_ml_algs
Browse files Browse the repository at this point in the history
Add support for XGBoost and LightGBM
  • Loading branch information
riley-harper authored Dec 4, 2024
2 parents 71c4fea + 52d7721 commit c52d835
Show file tree
Hide file tree
Showing 33 changed files with 1,515 additions and 179 deletions.
5 changes: 3 additions & 2 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,13 @@ jobs:
fail-fast: false
matrix:
python_version: ["3.10", "3.11", "3.12"]
hlink_extras: ["dev", "dev,lightgbm,xgboost"]
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Build the Docker image
run: docker build . --file Dockerfile --tag $HLINK_TAG-${{ matrix.python_version}} --build-arg PYTHON_VERSION=${{ matrix.python_version }}
run: docker build . --file Dockerfile --tag $HLINK_TAG-${{ matrix.python_version}} --build-arg PYTHON_VERSION=${{ matrix.python_version }} --build-arg HLINK_EXTRAS=${{ matrix.hlink_extras }}

- name: Check dependency versions
run: |
Expand All @@ -34,7 +35,7 @@ jobs:
run: docker run $HLINK_TAG-${{ matrix.python_version}} black --check .

- name: Test
run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest
run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest -ra

- name: Build sdist and wheel
run: docker run $HLINK_TAG-${{ matrix.python_version}} python -m build
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ scala_jar/target
scala_jar/project/target
*.class
*.cache
.metals/

# MacOS
.DS_Store
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
ARG PYTHON_VERSION=3.10
FROM python:${PYTHON_VERSION}
ARG HLINK_EXTRAS=dev

RUN apt-get update && apt-get install default-jre-headless -y

Expand All @@ -8,4 +9,4 @@ WORKDIR /hlink

COPY . .
RUN python -m pip install --upgrade pip
RUN pip install -e .[dev]
RUN pip install -e .[${HLINK_EXTRAS}]
49 changes: 43 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,19 +26,56 @@ We do our best to make hlink compatible with Python 3.10-3.12. If you have a
problem using hlink on one of these versions of Python, please open an issue
through GitHub. Versions of Python older than 3.10 are not supported.

Note that pyspark 3.5 does not yet officially support Python 3.12. If you
encounter pyspark-related import errors while running hlink on Python 3.12, try
Note that PySpark 3.5 does not yet officially support Python 3.12. If you
encounter PySpark-related import errors while running hlink on Python 3.12, try

- Installing the setuptools package. The distutils package was deleted from the
standard library in Python 3.12, but some versions of pyspark still import
standard library in Python 3.12, but some versions of PySpark still import
it. The setuptools package provides a hacky stand-in distutils library which
should fix some import errors in pyspark. We install setuptools in our
should fix some import errors in PySpark. We install setuptools in our
development and test dependencies so that our tests work on Python 3.12.

- Downgrading Python to 3.10 or 3.11. Pyspark officially supports these
versions of Python. So you should have better chances getting pyspark to work
- Downgrading Python to 3.10 or 3.11. PySpark officially supports these
versions of Python. So you should have better chances getting PySpark to work
well on Python 3.10 or 3.11.

### Additional Machine Learning Algorithms

hlink has optional support for two additional machine learning algorithms,
[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) and
[LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html). Both of these
algorithms are highly performant gradient boosting libraries, each with its own
characteristics. These algorithms are not implemented directly in Spark, so
they require some additional dependencies. To install the required Python
dependencies, run

```
pip install hlink[xgboost]
```

for XGBoost or

```
pip install hlink[lightgbm]
```

for LightGBM. If you would like to install both at once, you can run

```
pip install hlink[xgboost,lightgbm]
```

to get the Python dependencies for both. Both XGBoost and LightGBM also require
libomp, which will need to be installed separately if you don't already have it.

After installing the dependencies for one or both of these algorithms, you can
use them as model types in training and model exploration. You can read more
about these models in the hlink documentation [here](https://hlink.docs.ipums.org/models.html).

*Note: The XGBoost-PySpark integration provided by the xgboost Python package is
currently unstable. So the hlink xgboost support is experimental and may change
in the future.*

## Docs

The documentation site can be found at [hlink.docs.ipums.org](https://hlink.docs.ipums.org).
Expand Down
1 change: 1 addition & 0 deletions docs/_sources/model_exploration.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Configuring Model Exploration
195 changes: 152 additions & 43 deletions docs/_sources/models.md.txt
Original file line number Diff line number Diff line change
@@ -1,53 +1,80 @@
# Models

These are models available to be used in the model evaluation, training, and household training link tasks.

* Attributes for all models:
* `threshold` -- Type: `float`. Alpha threshold (model hyperparameter).
* `threshold_ratio` -- Type: `float`. Beta threshold (de-duplication distance ratio).
* Any parameters available in the model as defined in the Spark documentation can be passed as params using the label given in the Spark docs. Commonly used parameters are listed below with descriptive explanations from the Spark docs.
These are the machine learning models available for use in the model evaluation
and training tasks and in their household counterparts.

There are a few attributes available for all models.

* `type` -- Type: `string`. The name of the model type. The available model
types are listed below.
* `threshold` -- Type: `float`. The "alpha threshold". This is the probability
score required for a potential match to be labeled a match. `0 ≤ threshold ≤
1`.
* `threshold_ratio` -- Type: `float`. The threshold ratio or "beta threshold".
This applies to records which have multiple potential matches when
`training.decision` is set to `"drop_duplicate_with_threshold_ratio"`. For
each record, only potential matches which have the highest probability, have
a probability of at least `threshold`, *and* whose probabilities are at least
`threshold_ratio` times larger than the second-highest probability are
matches. This is sometimes called the "de-duplication distance ratio". `1 ≤
threshold_ratio < ∞`.

In addition, any model parameters documented in a model type's Spark
documentation can be passed as parameters to the model through hlink's
`training.chosen_model` and `training.model_exploration` configuration
sections.

Here is an example `training.chosen_model` configuration. The `type`,
`threshold`, and `threshold_ratio` attributes are hlink specific. `maxDepth` is
a parameter to the random forest model which hlink passes through to the
underlying Spark classifier.

```toml
[training.chosen_model]
type = "random_forest"
threshold = 0.2
threshold_ratio = 1.2
maxDepth = 5
```

## random_forest

Uses [pyspark.ml.classification.RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html). Returns probability as an array.
Uses [pyspark.ml.classification.RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html).
* Parameters:
* `maxDepth` -- Type: `int`. Maximum depth of the tree. Spark default value is 5.
* `numTrees` -- Type: `int`. The number of trees to train. Spark default value is 20, must be >= 1.
* `featureSubsetStrategy` -- Type: `string`. Per the Spark docs: "The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]."

```
model_parameters = {
type = "random_forest",
maxDepth = 5,
numTrees = 75,
featureSubsetStrategy = "sqrt",
threshold = 0.15,
threshold_ratio = 1.0
}
```toml
[training.chosen_model]
type = "random_forest"
threshold = 0.15
threshold_ratio = 1.0
maxDepth = 5
numTrees = 75
featureSubsetStrategy = "sqrt"
```

## probit

Uses [pyspark.ml.regression.GeneralizedLinearRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GeneralizedLinearRegression.html) with `family="binomial"` and `link="probit"`.

```
model_parameters = {
type = "probit",
threshold = 0.85,
threshold_ratio = 1.2
}
```toml
[training.chosen_model]
type = "probit"
threshold = 0.85
threshold_ratio = 1.2
```

## logistic_regression

Uses [pyspark.ml.classification.LogisticRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html)

```
chosen_model = {
type = "logistic_regression",
threshold = 0.5,
threshold_ratio = 1.0
}
```toml
[training.chosen_model]
type = "logistic_regression"
threshold = 0.5
threshold_ratio = 1.0
```

## decision_tree
Expand All @@ -59,13 +86,14 @@ Uses [pyspark.ml.classification.DecisionTreeClassifier](https://spark.apache.org
* `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
* `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."

```
chosen_model = {
type = "decision_tree",
maxDepth = 6,
minInstancesPerNode = 2,
maxBins = 4
}
```toml
[training.chosen_model]
type = "decision_tree"
threshold = 0.5
threshold_ratio = 1.5
maxDepth = 6
minInstancesPerNode = 2
maxBins = 4
```

## gradient_boosted_trees
Expand All @@ -77,13 +105,94 @@ Uses [pyspark.ml.classification.GBTClassifier](https://spark.apache.org/docs/lat
* `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
* `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."

```toml
[training.chosen_model]
type = "gradient_boosted_trees"
threshold = 0.7
threshold_ratio = 1.3
maxDepth = 4
minInstancesPerNode = 1
maxBins = 6
```

## xgboost

*Added in version 3.8.0.*

XGBoost is an alternate, high-performance implementation of gradient boosting.
It uses [xgboost.spark.SparkXGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.spark.SparkXGBClassifier).
Since the XGBoost-PySpark integration which the xgboost Python package provides
is currently unstable, support for the xgboost model type is disabled in hlink
by default. hlink will stop with an error if you try to use this model type
without enabling support for it. To enable support for xgboost, install hlink
with the `xgboost` extra.

```
chosen_model = {
type = "gradient_boosted_trees",
maxDepth = 4,
minInstancesPerNode = 1,
maxBins = 6,
threshold = 0.7,
threshold_ratio = 1.3
}
pip install hlink[xgboost]
```

This installs the xgboost package and its Python dependencies. Depending on
your machine and operating system, you may also need to install the libomp
library, which is another dependency of xgboost. xgboost should raise a helpful
error if it detects that you need to install libomp.

You can view a list of xgboost's parameters
[here](https://xgboost.readthedocs.io/en/latest/parameter.html).

```toml
[training.chosen_model]
type = "xgboost"
threshold = 0.8
threshold_ratio = 1.5
max_depth = 5
eta = 0.5
gamma = 0.05
```

## lightgbm

*Added in version 3.8.0.*

LightGBM is another alternate, high-performance implementation of gradient
boosting. It uses
[synapse.ml.lightgbm.LightGBMClassifier](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
`synapse.ml` is a library which provides various integrations with PySpark,
including integrations between the C++ LightGBM library and PySpark.

LightGBM requires some additional Scala libraries that hlink does not usually
install, so support for the lightgbm model is disabled in hlink by default.
hlink will stop with an error if you try to use this model type without
enabling support for it. To enable support for lightgbm, install hlink with the
`lightgbm` extra.

```
pip install hlink[lightgbm]
```

This installs the lightgbm package and its Python dependencies. Depending on
your machine and operating system, you may also need to install the libomp
library, which is another dependency of lightgbm. If you encounter errors when
training a lightgbm model, please try installing libomp if you do not have it
installed.

lightgbm has an enormous number of available parameters. Many of these are
available as normal in hlink, via the [LightGBMClassifier
class](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
Others are available through the special `passThroughArgs` parameter, which
passes additional parameters through to the C++ library. You can see a full
list of the supported parameters
[here](https://lightgbm.readthedocs.io/en/latest/Parameters.html).

```toml
[training.chosen_model]
type = "lightgbm"
# hlink's threshold and threshold_ratio
threshold = 0.8
threshold_ratio = 1.5
# LightGBMClassifier supports these parameters (and many more).
maxDepth = 5
learningRate = 0.5
# LightGBMClassifier does not directly support this parameter,
# so we have to send it to the C++ library with passThroughArgs.
passThroughArgs = "force_row_wise=true"
```
1 change: 1 addition & 0 deletions docs/column_mappings.html
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
1 change: 1 addition & 0 deletions docs/comparison_features.html
Original file line number Diff line number Diff line change
Expand Up @@ -1301,6 +1301,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
1 change: 1 addition & 0 deletions docs/comparisons.html
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
1 change: 1 addition & 0 deletions docs/config.html
Original file line number Diff line number Diff line change
Expand Up @@ -958,6 +958,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
1 change: 1 addition & 0 deletions docs/feature_selection_transforms.html
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
2 changes: 2 additions & 0 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,8 @@ <h1>Configuration API<a class="headerlink" href="#configuration-api" title="Link
<li class="toctree-l2"><a class="reference internal" href="models.html#logistic-regression">logistic_regression</a></li>
<li class="toctree-l2"><a class="reference internal" href="models.html#decision-tree">decision_tree</a></li>
<li class="toctree-l2"><a class="reference internal" href="models.html#gradient-boosted-trees">gradient_boosted_trees</a></li>
<li class="toctree-l2"><a class="reference internal" href="models.html#xgboost">xgboost</a></li>
<li class="toctree-l2"><a class="reference internal" href="models.html#lightgbm">lightgbm</a></li>
</ul>
</li>
</ul>
Expand Down
Loading

0 comments on commit c52d835

Please sign in to comment.