Merge pull request #165 from ipums/add_new_ml_algs

Add support for XGBoost and LightGBM
ipums · Dec 4, 2024 · c52d835 · c52d835
2 parents 71c4fea + 52d7721
commit c52d835
Show file tree

Hide file tree

Showing 33 changed files with 1,515 additions and 179 deletions.
diff --git a/.github/workflows/docker-build.yml b/.github/workflows/docker-build.yml
@@ -17,12 +17,13 @@ jobs:
       fail-fast: false
       matrix:
         python_version: ["3.10", "3.11", "3.12"]
+        hlink_extras: ["dev", "dev,lightgbm,xgboost"]
     runs-on: ubuntu-latest
 
     steps:
     - uses: actions/checkout@v4
     - name: Build the Docker image
-      run: docker build . --file Dockerfile --tag $HLINK_TAG-${{ matrix.python_version}} --build-arg PYTHON_VERSION=${{ matrix.python_version }}
+      run: docker build . --file Dockerfile --tag $HLINK_TAG-${{ matrix.python_version}} --build-arg PYTHON_VERSION=${{ matrix.python_version }} --build-arg HLINK_EXTRAS=${{ matrix.hlink_extras }}
 
     - name: Check dependency versions
       run: |
@@ -34,7 +35,7 @@ jobs:
       run: docker run $HLINK_TAG-${{ matrix.python_version}} black --check .
 
     - name: Test
-      run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest
+      run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest -ra
 
     - name: Build sdist and wheel
       run: docker run $HLINK_TAG-${{ matrix.python_version}} python -m build
diff --git a/.gitignore b/.gitignore
@@ -15,6 +15,7 @@ scala_jar/target
 scala_jar/project/target
 *.class
 *.cache
+.metals/
 
 # MacOS
 .DS_Store

diff --git a/Dockerfile b/Dockerfile
@@ -1,5 +1,6 @@
 ARG PYTHON_VERSION=3.10
 FROM python:${PYTHON_VERSION}
+ARG HLINK_EXTRAS=dev
 
 RUN apt-get update && apt-get install default-jre-headless -y
 
@@ -8,4 +9,4 @@ WORKDIR /hlink
 
 COPY . .
 RUN python -m pip install --upgrade pip
-RUN pip install -e .[dev]
+RUN pip install -e .[${HLINK_EXTRAS}]
diff --git a/README.md b/README.md
@@ -26,19 +26,56 @@ We do our best to make hlink compatible with Python 3.10-3.12. If you have a
 problem using hlink on one of these versions of Python, please open an issue
 through GitHub. Versions of Python older than 3.10 are not supported.
 
-Note that pyspark 3.5 does not yet officially support Python 3.12. If you
-encounter pyspark-related import errors while running hlink on Python 3.12, try
+Note that PySpark 3.5 does not yet officially support Python 3.12. If you
+encounter PySpark-related import errors while running hlink on Python 3.12, try
 
 - Installing the setuptools package. The distutils package was deleted from the
-  standard library in Python 3.12, but some versions of pyspark still import
+  standard library in Python 3.12, but some versions of PySpark still import
   it. The setuptools package provides a hacky stand-in distutils library which
-  should fix some import errors in pyspark. We install setuptools in our
+  should fix some import errors in PySpark. We install setuptools in our
   development and test dependencies so that our tests work on Python 3.12.
 
-- Downgrading Python to 3.10 or 3.11. Pyspark officially supports these
-  versions of Python. So you should have better chances getting pyspark to work
+- Downgrading Python to 3.10 or 3.11. PySpark officially supports these
+  versions of Python. So you should have better chances getting PySpark to work
   well on Python 3.10 or 3.11.
 
+### Additional Machine Learning Algorithms
+
+hlink has optional support for two additional machine learning algorithms,
+[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) and
+[LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html). Both of these
+algorithms are highly performant gradient boosting libraries, each with its own
+characteristics. These algorithms are not implemented directly in Spark, so
+they require some additional dependencies. To install the required Python
+dependencies, run
+
+```
+pip install hlink[xgboost]
+```
+
+for XGBoost or
+
+```
+pip install hlink[lightgbm]
+```
+
+for LightGBM. If you would like to install both at once, you can run
+
+```
+pip install hlink[xgboost,lightgbm]
+```
+
+to get the Python dependencies for both. Both XGBoost and LightGBM also require
+libomp, which will need to be installed separately if you don't already have it.
+
+After installing the dependencies for one or both of these algorithms, you can
+use them as model types in training and model exploration. You can read more
+about these models in the hlink documentation [here](https://hlink.docs.ipums.org/models.html).
+
+*Note: The XGBoost-PySpark integration provided by the xgboost Python package is
+currently unstable. So the hlink xgboost support is experimental and may change
+in the future.*
+
 ## Docs
 
 The documentation site can be found at [hlink.docs.ipums.org](https://hlink.docs.ipums.org).

diff --git a/docs/_sources/model_exploration.md.txt b/docs/_sources/model_exploration.md.txt
@@ -0,0 +1 @@
+# Configuring Model Exploration
diff --git a/docs/_sources/models.md.txt b/docs/_sources/models.md.txt
@@ -1,53 +1,80 @@
 # Models
 
-These are models available to be used in the model evaluation, training, and household training link tasks.
-
-* Attributes for all models:
-  * `threshold` -- Type: `float`.  Alpha threshold (model hyperparameter).
-  * `threshold_ratio` -- Type: `float`.  Beta threshold (de-duplication distance ratio).
-  * Any parameters available in the model as defined in the Spark documentation can be passed as params using the label given in the Spark docs.  Commonly used parameters are listed below with descriptive explanations from the Spark docs.
+These are the machine learning models available for use in the model evaluation
+and training tasks and in their household counterparts.
+
+There are a few attributes available for all models.
+
+* `type` -- Type: `string`. The name of the model type. The available model
+  types are listed below.
+* `threshold` -- Type: `float`.  The "alpha threshold". This is the probability
+  score required for a potential match to be labeled a match. `0 ≤ threshold ≤
+  1`.
+* `threshold_ratio` -- Type: `float`. The threshold ratio or "beta threshold".
+  This applies to records which have multiple potential matches when
+  `training.decision` is set to `"drop_duplicate_with_threshold_ratio"`. For
+  each record, only potential matches which have the highest probability, have
+  a probability of at least `threshold`, *and* whose probabilities are at least
+  `threshold_ratio` times larger than the second-highest probability are
+  matches. This is sometimes called the "de-duplication distance ratio". `1 ≤
+  threshold_ratio < ∞`.
+
+In addition, any model parameters documented in a model type's Spark
+documentation can be passed as parameters to the model through hlink's
+`training.chosen_model` and `training.model_exploration` configuration
+sections.
+
+Here is an example `training.chosen_model` configuration. The `type`,
+`threshold`, and `threshold_ratio` attributes are hlink specific. `maxDepth` is
+a parameter to the random forest model which hlink passes through to the
+underlying Spark classifier.
+
+```toml
+[training.chosen_model]
+type = "random_forest"
+threshold = 0.2
+threshold_ratio = 1.2
+maxDepth = 5
+```
 
 ## random_forest
 
-Uses [pyspark.ml.classification.RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html).  Returns probability as an array.
+Uses [pyspark.ml.classification.RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html).
 * Parameters:
   * `maxDepth` -- Type: `int`. Maximum depth of the tree. Spark default value is 5.
   * `numTrees` -- Type: `int`. The number of trees to train.  Spark default value is 20, must be >= 1.
   * `featureSubsetStrategy` -- Type: `string`. Per the Spark docs: "The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]."
 
-```
-model_parameters = {
-    type = "random_forest",
-    maxDepth = 5,
-    numTrees = 75,
-    featureSubsetStrategy = "sqrt",
-    threshold = 0.15,
-    threshold_ratio = 1.0
-}
+```toml
+[training.chosen_model]
+type = "random_forest"
+threshold = 0.15
+threshold_ratio = 1.0
+maxDepth = 5
+numTrees = 75
+featureSubsetStrategy = "sqrt"
 ```
 
 ## probit
 
 Uses [pyspark.ml.regression.GeneralizedLinearRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GeneralizedLinearRegression.html) with `family="binomial"` and `link="probit"`.  
 
-```
-model_parameters = {
-    type = "probit",
-    threshold = 0.85,
-    threshold_ratio = 1.2
-}
+```toml
+[training.chosen_model]
+type = "probit"
+threshold = 0.85
+threshold_ratio = 1.2
 ```
 
 ## logistic_regression
 
 Uses [pyspark.ml.classification.LogisticRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html)
 
-```
-chosen_model = {
-    type = "logistic_regression",
-    threshold = 0.5,
-    threshold_ratio = 1.0
-}
+```toml
+[training.chosen_model]
+type = "logistic_regression"
+threshold = 0.5
+threshold_ratio = 1.0
 ```
 
 ## decision_tree
@@ -59,13 +86,14 @@ Uses [pyspark.ml.classification.DecisionTreeClassifier](https://spark.apache.org
   * `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
   * `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."
 
-```
-chosen_model = {
-    type = "decision_tree",
-    maxDepth = 6,
-    minInstancesPerNode = 2,
-    maxBins = 4
-}
+```toml
+[training.chosen_model]
+type = "decision_tree"
+threshold = 0.5
+threshold_ratio = 1.5
+maxDepth = 6
+minInstancesPerNode = 2
+maxBins = 4
 ```
 
 ## gradient_boosted_trees
@@ -77,13 +105,94 @@ Uses [pyspark.ml.classification.GBTClassifier](https://spark.apache.org/docs/lat
   * `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
   * `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."
 
+```toml
+[training.chosen_model]
+type = "gradient_boosted_trees"
+threshold = 0.7
+threshold_ratio = 1.3
+maxDepth = 4
+minInstancesPerNode = 1
+maxBins = 6
+```
+
+## xgboost
+
+*Added in version 3.8.0.*
+
+XGBoost is an alternate, high-performance implementation of gradient boosting.
+It uses [xgboost.spark.SparkXGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.spark.SparkXGBClassifier).
+Since the XGBoost-PySpark integration which the xgboost Python package provides
+is currently unstable, support for the xgboost model type is disabled in hlink
+by default. hlink will stop with an error if you try to use this model type
+without enabling support for it. To enable support for xgboost, install hlink
+with the `xgboost` extra.
+
 ```
-chosen_model = {
-    type = "gradient_boosted_trees",
-    maxDepth = 4,
-    minInstancesPerNode = 1,
-    maxBins = 6,
-    threshold = 0.7,
-    threshold_ratio = 1.3
-}
+pip install hlink[xgboost]
+```
+
+This installs the xgboost package and its Python dependencies. Depending on
+your machine and operating system, you may also need to install the libomp
+library, which is another dependency of xgboost. xgboost should raise a helpful
+error if it detects that you need to install libomp.
+
+You can view a list of xgboost's parameters
+[here](https://xgboost.readthedocs.io/en/latest/parameter.html).
+
+```toml
+[training.chosen_model]
+type = "xgboost"
+threshold = 0.8
+threshold_ratio = 1.5
+max_depth = 5
+eta = 0.5
+gamma = 0.05
+```
+
+## lightgbm
+
+*Added in version 3.8.0.*
+
+LightGBM is another alternate, high-performance implementation of gradient
+boosting. It uses
+[synapse.ml.lightgbm.LightGBMClassifier](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
+`synapse.ml` is a library which provides various integrations with PySpark,
+including integrations between the C++ LightGBM library and PySpark.
+
+LightGBM requires some additional Scala libraries that hlink does not usually
+install, so support for the lightgbm model is disabled in hlink by default.
+hlink will stop with an error if you try to use this model type without
+enabling support for it. To enable support for lightgbm, install hlink with the
+`lightgbm` extra.
+
+```
+pip install hlink[lightgbm]
+```
+
+This installs the lightgbm package and its Python dependencies. Depending on
+your machine and operating system, you may also need to install the libomp
+library, which is another dependency of lightgbm. If you encounter errors when
+training a lightgbm model, please try installing libomp if you do not have it
+installed.
+
+lightgbm has an enormous number of available parameters. Many of these are
+available as normal in hlink, via the [LightGBMClassifier
+class](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
+Others are available through the special `passThroughArgs` parameter, which
+passes additional parameters through to the C++ library. You can see a full
+list of the supported parameters
+[here](https://lightgbm.readthedocs.io/en/latest/Parameters.html).
+
+```toml
+[training.chosen_model]
+type = "lightgbm"
+# hlink's threshold and threshold_ratio
+threshold = 0.8
+threshold_ratio = 1.5
+# LightGBMClassifier supports these parameters (and many more).
+maxDepth = 5
+learningRate = 0.5
+# LightGBMClassifier does not directly support this parameter,
+# so we have to send it to the C++ library with passThroughArgs.
+passThroughArgs = "force_row_wise=true"
 ```
diff --git a/docs/column_mappings.html b/docs/column_mappings.html
@@ -402,6 +402,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
 <li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
 <li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
+<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
 </ul>
 
 <div class="relations">

diff --git a/docs/comparison_features.html b/docs/comparison_features.html
@@ -1301,6 +1301,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
 <li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
 <li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
+<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
 </ul>
 
 <div class="relations">

diff --git a/docs/comparisons.html b/docs/comparisons.html
@@ -197,6 +197,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
 <li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
 <li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
+<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
 </ul>
 
 <div class="relations">

diff --git a/docs/config.html b/docs/config.html
@@ -958,6 +958,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
 <li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
 <li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
+<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
 </ul>
 
 <div class="relations">

diff --git a/docs/feature_selection_transforms.html b/docs/feature_selection_transforms.html
@@ -220,6 +220,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
 <li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
 <li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
+<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
 </ul>
 
 <div class="relations">

diff --git a/docs/index.html b/docs/index.html
@@ -135,6 +135,8 @@ <h1>Configuration API<a class="headerlink" href="#configuration-api" title="Link
 <li class="toctree-l2"><a class="reference internal" href="models.html#logistic-regression">logistic_regression</a></li>
 <li class="toctree-l2"><a class="reference internal" href="models.html#decision-tree">decision_tree</a></li>
 <li class="toctree-l2"><a class="reference internal" href="models.html#gradient-boosted-trees">gradient_boosted_trees</a></li>
+<li class="toctree-l2"><a class="reference internal" href="models.html#xgboost">xgboost</a></li>
+<li class="toctree-l2"><a class="reference internal" href="models.html#lightgbm">lightgbm</a></li>
 </ul>
 </li>
 </ul>
-Original file line number
+Diff line change
@@ Expand Up / @@ -15,6 +15,7 @@ scala_jar/target @@
     scala_jar/project/target
     *.class
     *.cache
+    .metals/
     # MacOS
     .DS_Store
@@ Expand Down @@