Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FairGBM refactor (multi-threading etc.) #18

Open
wants to merge 32 commits into
base: main-fairgbm
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
2bd6b99
added FairGBM config examples
AndreFCruz Jun 24, 2022
7ea1fbc
Python API implementation for the FairGBM algorithm
AndreFCruz Jun 24, 2022
408e7fc
corrected typo support->supports
AndreFCruz Jun 27, 2022
cbb9deb
added check to input_weights > 0
AndreFCruz Jun 27, 2022
6619d0e
minor change for python readability
AndreFCruz Jun 27, 2022
ae6942a
reordering arguments of _construct_dataset calls for consistency
AndreFCruz Jun 27, 2022
51be6be
fixed typo in examples configs
AndreFCruz Jun 27, 2022
d9f63a5
added type checks to SetIntField for constraint_group data
AndreFCruz Jun 27, 2022
b5e139e
omit aliases of constraint_group got getting and setting field
AndreFCruz Jul 1, 2022
44f0047
removing ifdef guards from debug output
AndreFCruz Jul 1, 2022
42addb6
added guards against using int8_t with C_API
AndreFCruz Jul 1, 2022
bd75b7f
cahnged SetConstraintGroup from taking constraint_group_t* to int*
AndreFCruz Jul 1, 2022
dc5f863
removing useless max_value call
AndreFCruz Jul 1, 2022
376800f
constraint_group_t now is int, was int32_t
AndreFCruz Jul 4, 2022
7016f9d
added GE and LT checks for constraint group value
AndreFCruz Jul 6, 2022
bc3bbc6
SetConstraintgroupAt now has a generic type
AndreFCruz Jul 6, 2022
2d90ebd
drafting tests with COMPAS data
AndreFCruz Jun 24, 2022
16d5e20
added FairGBM example with BAF-base dataset
AndreFCruz Jun 28, 2022
779c51f
minor clarification in example confs
AndreFCruz Jun 28, 2022
48b3b76
moved FairGBM examples
AndreFCruz Jun 28, 2022
f9feb53
python package requirements for testing
AndreFCruz Jun 28, 2022
48926e3
util for loading BAF data in python
AndreFCruz Jun 28, 2022
329c297
implemented FairGBM python tests
AndreFCruz Jun 29, 2022
43f5ccd
adding tests for FairGBM FPR and FNR fairness
AndreFCruz Jun 30, 2022
eaa54d6
tests should now pass
AndreFCruz Jun 30, 2022
aed7564
minor fix
AndreFCruz Jun 30, 2022
1a71cfe
added fairgbm main branch to python tests workflow
AndreFCruz Jul 1, 2022
c8339fb
large constraint gradients loop is now parallelized
AndreFCruz Jul 4, 2022
48b4aea
parallelizing computation of all metrics
AndreFCruz Jul 4, 2022
bc4cb17
changed atomic int to omp reduction clause
AndreFCruz Jul 4, 2022
4883893
removed parallelization of group-wise metrics
AndreFCruz Jul 4, 2022
bccc70e
loss_proxies to stop doing string comparisons
AndreFCruz Jul 5, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/python_package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@ on:
push:
branches:
- master
- main-fairgbm
pull_request:
branches:
- master
- main-fairgbm

env:
CONDA_ENV: test-env
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ if(USE_CUDA)

LIST(APPEND CMAKE_CUDA_FLAGS ${CUDA_ARCH_FLAGS})
if(USE_DEBUG)
SET(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -g")
SET(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -g -fno-omit-frame-pointer")
else()
SET(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -O3 -lineinfo")
endif()
Expand Down
2,131 changes: 2,131 additions & 0 deletions examples/FairGBM-other/COMPAS.test

Large diffs are not rendered by default.

3,149 changes: 3,149 additions & 0 deletions examples/FairGBM-other/COMPAS.train

Large diffs are not rendered by default.

38 changes: 38 additions & 0 deletions examples/FairGBM-other/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
Binary Classification with Fairness Constraints
===============================================
> Example of training a binary classifier with fairness constraints.

Dataset source: https://github.com/propublica/compas-analysis/

***You must follow the [installation instructions](https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html)
for the following commands to work. The `lightgbm` binary must be built and available at the root of this project.***


Training
--------
Run the following command in this folder to train **FairGBM**:

```bash
"../../lightgbm" config=train.conf
```

To train the vanilla LightGBM on the same data use:
```bash
"../../lightgbm" config=train_unconstrained.conf
```

Prediction
----------

You should finish training first.

Run the following command in this folder to compute test predictions for **FairGBM**:

```bash
"../../lightgbm" config=predict.conf
```

To compute test predictions for the vanilla LightGBM use:
```bash
"../../lightgbm" config=predict_unconstrained.conf
```
11 changes: 11 additions & 0 deletions examples/FairGBM-other/predict.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
task = predict

data = COMPAS.test

has_header=True

label_column = name:two_year_recid

input_model= FairGBM_model.txt

output_result = FairGBM_predictions.txt
11 changes: 11 additions & 0 deletions examples/FairGBM-other/predict_unconstrained.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
task = predict

data = COMPAS.test

has_header=True

label_column = name:two_year_recid

input_model= LightGBM_model.txt

output_result = LightGBM_predictions.txt
123 changes: 123 additions & 0 deletions examples/FairGBM-other/train.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# task type, supports train and predict
task = train

# boosting type, support gbdt for now, alias: boosting, boost
boosting_type = gbdt

# application type, support following application
# regression , regression task
# binary , binary classification task
# lambdarank , LambdaRank task
# constrained_cross_entropy , constrained optimization task (classification)
# alias: application, app
objective = constrained_cross_entropy # training FairGBM!

# eval metrics, support multi metric, delimited by ',' , support following metrics
# l1
# l2 , default metric for regression
# ndcg , default metric for lambdarank
# auc
# binary_logloss , default metric for binary
# binary_error
metric = binary_logloss,auc

# frequency for metric output
metric_freq = 1

# true if need output metric for training data, alias: training_metric, train_metric
is_training_metric = true

# number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy.
max_bin = 255

# training data
# if existing weight file, should name to "binary.train.weight"
# alias: train_data, train
data = COMPAS.train

# validation data, support multi validation data, separated by ','
# if existing weight file, should name to "binary.test.weight"
# alias: valid, test, test_data,
valid_data = COMPAS.test

# first row has column names
has_header=True

# name of label column
label_column = name:two_year_recid

# number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
num_trees = 200

# shrinkage rate , alias: shrinkage_rate
learning_rate = 0.1

# number of leaves for one tree, alias: num_leaf
num_leaves = 63

# type of tree learner, support following types:
# serial , single machine version
# feature , use feature parallel to train
# data , use data parallel to train
# voting , use voting based parallel to train
# alias: tree
tree_learner = serial

# number of threads for multi-threading. One thread will use each CPU. The default is the CPU count.
# num_threads = 8

# feature sub-sample, will random select 80% feature to train on each iteration
# alias: sub_feature
feature_fraction = 0.8

# Support bagging (data sub-sample), will perform bagging every 5 iterations
bagging_freq = 5

# Bagging fraction, will random select 80% data on bagging
# alias: sub_row
bagging_fraction = 0.8

# minimal number data for one leaf, use this to deal with over-fit
# alias : min_data_per_leaf, min_data
min_data_in_leaf = 50

# max depth of each individual base learner
max_depth = 10

# minimal sum Hessians for one leaf, use this to deal with over-fit
min_sum_hessian_in_leaf = 5.0

# save memory and faster speed for sparse feature, alias: is_sparse
is_enable_sparse = true

# when data is bigger than memory size, set this to true. otherwise set false will have faster speed
# alias: two_round_loading, two_round
use_two_round_loading = false

# true if need to save data to binary file and application will auto load data from binary file next time
# alias: is_save_binary, save_binary
is_save_binary_file = false

# output model file
output_model = FairGBM_model.txt

# output prediction file for predict task
# output_result= prediction.txt

# fixed random state for reproducibility
# (if multi-threaded may still result in variable results)
random_state = 42

# ** FairGBM-specific parameters **
constraint_stepwise_proxy=cross_entropy
constraint_group_column=name:race_Caucasian
constraint_type=fpr
constraint_fpr_threshold=0
proxy_margin=1
score_threshold=0.5
lagrangian_learning_rate=200

# global_constraint_type=fpr,fnr # no global constraint on this example, as we're just aiming to maximize accuracy
# global_score_threshold=0.5
# global_target_fpr=0.05
# global_target_fnr=0.30
109 changes: 109 additions & 0 deletions examples/FairGBM-other/train_unconstrained.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# task type, supports train and predict
task = train

# boosting type, support gbdt for now, alias: boosting, boost
boosting_type = gbdt

# application type, support following application
# regression , regression task
# binary , binary classification task
# lambdarank , LambdaRank task
# constrained_cross_entropy , constrained optimization task (classification)
# alias: application, app
objective = cross_entropy # training vanilla LightGBM!

# eval metrics, support multi metric, delimited by ',' , support following metrics
# l1
# l2 , default metric for regression
# ndcg , default metric for lambdarank
# auc
# binary_logloss , default metric for binary
# binary_error
metric = binary_logloss,auc

# frequency for metric output
metric_freq = 1

# true if need output metric for training data, alias: training_metric, train_metric
is_training_metric = true

# number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy.
max_bin = 255

# training data
# if existing weight file, should name to "binary.train.weight"
# alias: train_data, train
data = COMPAS.train

# validation data, support multi validation data, separated by ','
# if existing weight file, should name to "binary.test.weight"
# alias: valid, test, test_data,
valid_data = COMPAS.test

# first row has column names
has_header=True

# name of label column
label_column = name:two_year_recid

# number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
num_trees = 200

# shrinkage rate , alias: shrinkage_rate
learning_rate = 0.1

# number of leaves for one tree, alias: num_leaf
num_leaves = 63

# type of tree learner, support following types:
# serial , single machine version
# feature , use feature parallel to train
# data , use data parallel to train
# voting , use voting based parallel to train
# alias: tree
tree_learner = serial

# number of threads for multi-threading. One thread will use each CPU. The default is the CPU count.
# num_threads = 8

# feature sub-sample, will random select 80% feature to train on each iteration
# alias: sub_feature
feature_fraction = 0.8

# Support bagging (data sub-sample), will perform bagging every 5 iterations
bagging_freq = 5

# Bagging fraction, will random select 80% data on bagging
# alias: sub_row
bagging_fraction = 0.8

# minimal number data for one leaf, use this to deal with over-fit
# alias : min_data_per_leaf, min_data
min_data_in_leaf = 50

# max depth of each individual base learner
max_depth = 10

# minimal sum Hessians for one leaf, use this to deal with over-fit
min_sum_hessian_in_leaf = 5.0

# save memory and faster speed for sparse feature, alias: is_sparse
is_enable_sparse = true

# when data is bigger than memory size, set this to true. otherwise set false will have faster speed
# alias: two_round_loading, two_round
use_two_round_loading = false

# true if need to save data to binary file and application will auto load data from binary file next time
# alias: is_save_binary, save_binary
is_save_binary_file = false

# output model file
output_model = LightGBM_model.txt

# output prediction file for predict task
# output_result= prediction.txt

# fixed random state for reproducibility
# (if multi-threaded may still result in variable results)
random_state = 42
Loading