diff --git a/README.md b/README.md
index 3d985389f..da09d1903 100644
--- a/README.md
+++ b/README.md
@@ -16,7 +16,8 @@ Triage is designed to:
## Quick Links
-- [Dirty Duck Tutorial](https://dssg.github.io/triage/dirtyduck/) - Are you completely new to Triage? Go through the tutorial here with sample data
+- [Tutorial on Google Colab](https://colab.research.google.com/github/dssg/triage/blob/master/example/colab/colab_triage.ipynb) - Are you completely new to Triage? Run through a quick tutorial hosted on google colab (no setup necessary) to see what triage can do!
+- [Dirty Duck Tutorial](https://dssg.github.io/triage/dirtyduck/) - Want a more in-depth walk through of triage's functionality and concepts? Go through the dirty duck tutorial here with sample data
- [QuickStart Guide](https://dssg.github.io/triage/quickstart/) - Try Triage out with your own project and data
- [Triage Documentation Site](https://dssg.github.io/triage/) - Used Triage before and want more reference documentation?
- [Development](https://github.com/dssg/triage#development) - Contribute to Triage development.
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
index 7721f4d58..8e4b13f08 100644
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -81,6 +81,7 @@ plugins:
nav:
- Home: index.md
+ - Online Tutorial (Google Colab): https://colab.research.google.com/github/dssg/triage/blob/master/example/colab/colab_triage.ipynb
- Get started with your own project:
- Quickstart guide: quickstart.md
- Suggested workflow: triage_project_workflow.md
diff --git a/docs/sources/index.md b/docs/sources/index.md
index a2c1f04c4..a0d8d68c0 100644
--- a/docs/sources/index.md
+++ b/docs/sources/index.md
@@ -1,122 +1,26 @@
-Triage
-======
+# Triage
-Data Science Toolkit for Social Good and Public Policy Problems
+[![Build Status](https://travis-ci.org/dssg/triage.svg?branch=master)](https://travis-ci.org/dssg/triage)
+[![codecov](https://codecov.io/gh/dssg/triage/branch/master/graph/badge.svg)](https://codecov.io/gh/dssg/triage)
+[![codeclimate](https://codeclimate.com/github/dssg/triage.png)](https://codeclimate.com/github/dssg/triage)
-[![image](https://travis-ci.com/dssg/triage.svg?branch=master)](https://travis-ci.org/dssg/triage)
-[![image](https://codecov.io/gh/dssg/triage/branch/master/graph/badge.svg)](https://codecov.io/gh/dssg/triage)
-[![image](https://codeclimate.com/github/dssg/triage.png)](https://codeclimate.com/github/dssg/triage)
-Building data science systems requires answering many design questions, turning them into modeling choices, which in turn run machine learning models. Questions such as cohort selection, unit of analysis determination, outcome determination, feature (explanantory variables) generation, model/classifier training, evaluation, selection, and list generation are often complicated and hard to choose apriori. In addition, once these choices are made, they have to be combined in different ways throughout the course of a project.
+## What is Triage?
-Triage is designed to:
+Triage is an open source machine learning toolkit to help data scientists, machine learning developers, and analysts quickly prototype, build and evaluate end-to-end predictive risk modeling systems for public policy and social good problem.
-- Guide users (data scientists, analysts, researchers) through these design choices by highlighting critical operational use questions.
-- Provide an integrated interface to components that are needed throughout a data science project workflow.
+While many tools (sklearn, keras, pytorch, etc.) exist to build ML models, an end-to-end project requires a lot more than just building models. Developing data science systems requires making many design decisions that need to match with how the system is going to be used. These choices then get turned into modeling choices and code. Triage lets you focus on the problem you’re solving and guides you through design choices you need to make at each step of the machine learning pipeline.
-## Quick Links
+## How to get started with Triage?
-- [Dirty Duck Tutorial](https://dssg.github.io/triage/dirtyduck/) - Are you completely new to Triage? Go through the tutorial here with sample data
-- [QuickStart Guide](https://dssg.github.io/triage/quickstart/) - Try Triage out with your own project and data
-- [Triage Documentation Site](https://dssg.github.io/triage/) - Used Triage before and want more reference documentation?
-- [Development](https://github.com/dssg/triage#development) - Contribute to Triage development.
+### [Go through a quick online tutorial with sample data (no setup required)](https://colab.research.google.com/github/dssg/triage/blob/master/example/colab/colab_triage.ipynb)
-## Installation
+### [Go through a more in-depth tutorial with sample data](dirtyduck/index.md)
-To install Triage, you need:
+### [Get started with your own project and data](quickstart.md)
-- Python 3.8+
-- A PostgreSQL 9.6+ database with your source data (events,
- geographical data, etc) loaded.
- - **NOTE**: If your database is PostgreSQL 11+ you will get some
- speed improvements. We recommend to update to a recent
- version of PostgreSQL.
-- Ample space on an available disk, (or for example in Amazon Web
- Services's S3), to store the needed matrices and models for your
- experiments
-We recommend starting with a new python virtual environment (with Python 3.6 or greater) and pip installing triage there.
-```bash
-$ virtualenv triage-env
-$ . triage-env/bin/activate
-(triage-env) $ pip install triage
-```
+## Background
-## Data
-Triage needs data in a postgres database and a configuration file that has credentials for the database. The Triage CLI defaults database connection information to a file stored in 'database.yaml' (example in [example/database.yaml](https://github.com/dssg/triage/blob/master/example/database.yaml)).
+Triage was initially developed at the University of Chicago's [Center For Data Science and Public Policy](http://dsapp.uchicago.edu) and is now being maintained and enhanced at Carnegie Mellon University.
-If you don't want to install Postgres yourself, try `triage db up` to create a vanilla Postgres 12 database using docker. For more details on this command, check out [Triage Database Provisioner](db.md)
-
-## Configure Triage for your project
-
-Triage is configured with a config.yaml file that has parameters defined for each component. You can see some [sample configuration with explanations](https://github.com/dssg/triage/blob/master/example/config/experiment.yaml) to see what configuration looks like.
-
-## Using Triage
-
-1. Via CLI:
-```bash
-
-triage experiment example/config/experiment.yaml
-```
-2. Import as a python package:
-```python
-from triage.experiments import SingleThreadedExperiment
-
-experiment = SingleThreadedExperiment(
- config=experiment_config, # a dictionary
- db_engine=create_engine(...), # http://docs.sqlalchemy.org/en/latest/core/engines.html
- project_path='/path/to/directory/to/save/data' # could be an S3 path too: 's3://mybucket/myprefix/'
-)
-experiment.run()
-```
-
-There are a plethora of options available for experiment running, affecting things like parallelization, storage, and more. These options are detailed in the [Running an Experiment](https://dssg.github.io/triage/experiments/running/) page.
-
-## Development
-
-Triag was initially developed at [University of Chicago's Center For Data Science and Public Policy](http://dsapp.uchicago.edu) and is now being maintained at Carnegie Mellon University.
-
-To build this package (without installation), its dependencies may
-alternatively be installed from the terminal using `pip`:
-
- pip install -r requirement/main.txt
-
-### Testing
-
-To add test (and development) dependencies, use **test.txt**:
-
- pip install -r requirement/test.txt [-r requirement/dev.txt]
-
-Then, to run tests:
-
- pytest
-
-### Development Environment
-
-To quickly bootstrap a development environment, having cloned the
-repository, invoke the executable `develop` script from your system
-shell:
-
- ./develop
-
-A "wizard" will suggest set-up steps and optionally execute these, for
-example:
-
- (install) begin
-
- (pyenv) installed
-
- (python-3.9.10) installed
-
- (virtualenv) installed
-
- (activation) installed
-
- (libs) install?
- 1) yes, install {pip install -r requirement/main.txt -r requirement/test.txt -r requirement/dev.txt}
- 2) no, ignore
- #? 1
-
-### Contributing
-
-If you'd like to contribute to Triage development, see the [CONTRIBUTING.md](https://github.com/dssg/triage/blob/master/CONTRIBUTING.md) document.
diff --git a/docs/update_docs.py b/docs/update_docs.py
index 2895ea8f2..51e0a9aaa 100644
--- a/docs/update_docs.py
+++ b/docs/update_docs.py
@@ -30,5 +30,5 @@ def copy_templates():
if __name__ == "__main__":
#copy_templates()
- update_index_md()
+ #update_index_md()
#generate_api_docs()
diff --git a/example/colab/colab_triage.ipynb b/example/colab/colab_triage.ipynb
new file mode 100644
index 000000000..1bdc3829c
--- /dev/null
+++ b/example/colab/colab_triage.ipynb
@@ -0,0 +1,5993 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "colab_triage.ipynb",
+ "provenance": [],
+ "collapsed_sections": [],
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4ix00QRsvd45"
+ },
+ "source": [
+ "# Colab Triage Tutorial\n",
+ "\n",
+ "## Problem Overview\n",
+ "\n",
+ "This notebook provides a quick, interactive tutorial for [triage](http://www.datasciencepublicpolicy.org/our-work/tools-guides/triage/), a python machine learning pipeline for social good problems, using a sample of the data provided by [DonorsChoose](https://www.donorschoose.org/) to the [2014 KDD Cup](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data). Public schools in the United States face large disparities in funding, often resulting in teachers and staff members filling these gaps by purchasing classroom supplies out of their own pockets. DonorsChoose is an online crowdfunding platform that tries to help fill this gap by allowing teachers to seek funding for projects and resources from the community (projects can include classroom basics like books and markers, larger items like lab equipment or musical instruments, specific experiences like field trips or guest speakers). Projects on DonorsChoose expire after 4 months, and if the target funding level isn't reached, the project receives no funding. Since its launch in 2000, the platform has helped fund over 2 million projects at schools across the US, but about 1/3 of the projects that are posted nevertheless fail to meet their goal and go unfunded.\n",
+ "\n",
+ "For the purposes of this tutorial, we'll imagine that DonorsChoose has hired a digital content expert who will review projects and help teachers improve their postings and increase their chances of reaching their funding threshold. Because this individualized review is a labor-intensive process, the digital content expert has time to review only 10% of the projects posted to the platform on a given day. \n",
+ "\n",
+ "### The Modeling Problem\n",
+ "\n",
+ "You are a data scientist working with DonorsChoose, and your task is to help this content expert focus their limited resources on projects that most need the help. As such, you want to build a model to identify projects that are least likely to be fully funded before they expire and pass them off to the digital content expert for review.\n",
+ "\n",
+ "In building that model, our unit of analysis (what triage calls a **cohort**) will be new projects right at the time they're posted, while the **label** we're seeing to predict is whether or not the project reaches its funding goal in the subsequent 4 months (before it expires), making our task a binary classification problem. In order to make this prediction, we'll develop **features** that include information we know about the project when it's posted as well as historical performance of other projects posted by this teacher, school, etc.\n",
+ "\n",
+ "### Outline of the Tutorial\n",
+ "\n",
+ "The remainder of this tutorial will focus on how to use `triage` to solve this problem. Starting from scratch, we'll:\n",
+ "- **Install our tools**, including triage and a postgres server.\n",
+ "- **Explore the data** to get familiar with its structure.\n",
+ "- **Formulate the project** to make sure the models we build meet the needs of the context (and see how to configure `triage` along the way).\n",
+ "- **Build models**, using `triage` to run the modeling pipeline.\n",
+ "- **Look at the results** to ensure they make sense.\n",
+ "- **Select the model to deploy** using the `audition` component of `triage`.\n",
+ "- **Audit our models for bias** using `aequitas`.\n",
+ "- **Lear about next steps** and where to go from here.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PtAMABPn971u"
+ },
+ "source": [
+ "## Getting Set Up\n",
+ "\n",
+ "We'll need a few dependencies to run triage in a colab notebook:\n",
+ "- A local postgresql server (we'll use version 11)\n",
+ "- A simplified dataset loaded into this database (we'll use data from DonorsChoose)\n",
+ "- Triage and its dependencies (we'll use the current version in pypi)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "-htIBoS7N4gK",
+ "outputId": "4eff8fd0-3ab1-43c6-b9eb-e8efaf7ab9ce"
+ },
+ "source": [
+ "# Install and start postgresql-11 server\n",
+ "!sudo apt-get -y -qq update\n",
+ "!wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -\n",
+ "!echo \"deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main\" |sudo tee /etc/apt/sources.list.d/pgdg.list\n",
+ "!sudo apt-get -y -qq update\n",
+ "!sudo apt-get -y -qq install postgresql-11 postgresql-client-11\n",
+ "!sudo service postgresql start\n",
+ "\n",
+ "# Setup a password `postgres` for username `postgres`\n",
+ "!sudo -u postgres psql -U postgres -c \"ALTER USER postgres PASSWORD 'postgres';\"\n",
+ "\n",
+ "# Setup a database with name `donors_choose` to be used\n",
+ "!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS donors_choose;'\n",
+ "\n",
+ "!sudo -u postgres psql -U postgres -c 'CREATE DATABASE donors_choose;'\n",
+ "\n",
+ "# Environment variables for connecting to the database\n",
+ "%env DEMO_DATABASE_NAME=donors_choose\n",
+ "%env DEMO_DATABASE_HOST=localhost\n",
+ "%env DEMO_DATABASE_PORT=5432\n",
+ "%env DEMO_DATABASE_USER=postgres\n",
+ "%env DEMO_DATABASE_PASS=postgres"
+ ],
+ "execution_count": 2,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "OK\n",
+ "deb http://apt.postgresql.org/pub/repos/apt/ bionic-pgdg main\n",
+ "debconf: unable to initialize frontend: Dialog\n",
+ "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 16.)\n",
+ "debconf: falling back to frontend: Readline\n",
+ "debconf: unable to initialize frontend: Readline\n",
+ "debconf: (This frontend requires a controlling tty.)\n",
+ "debconf: falling back to frontend: Teletype\n",
+ "dpkg-preconfigure: unable to re-open stdin: \n",
+ "Selecting previously unselected package cron.\n",
+ "(Reading database ... 155639 files and directories currently installed.)\n",
+ "Preparing to unpack .../00-cron_3.0pl1-128.1ubuntu1.2_amd64.deb ...\n",
+ "Unpacking cron (3.0pl1-128.1ubuntu1.2) ...\n",
+ "Selecting previously unselected package logrotate.\n",
+ "Preparing to unpack .../01-logrotate_3.11.0-0.1ubuntu1_amd64.deb ...\n",
+ "Unpacking logrotate (3.11.0-0.1ubuntu1) ...\n",
+ "Selecting previously unselected package netbase.\n",
+ "Preparing to unpack .../02-netbase_5.4_all.deb ...\n",
+ "Unpacking netbase (5.4) ...\n",
+ "Selecting previously unselected package libcommon-sense-perl.\n",
+ "Preparing to unpack .../03-libcommon-sense-perl_3.74-2build2_amd64.deb ...\n",
+ "Unpacking libcommon-sense-perl (3.74-2build2) ...\n",
+ "Selecting previously unselected package libjson-perl.\n",
+ "Preparing to unpack .../04-libjson-perl_2.97001-1_all.deb ...\n",
+ "Unpacking libjson-perl (2.97001-1) ...\n",
+ "Selecting previously unselected package libtypes-serialiser-perl.\n",
+ "Preparing to unpack .../05-libtypes-serialiser-perl_1.0-1_all.deb ...\n",
+ "Unpacking libtypes-serialiser-perl (1.0-1) ...\n",
+ "Selecting previously unselected package libjson-xs-perl.\n",
+ "Preparing to unpack .../06-libjson-xs-perl_3.040-1_amd64.deb ...\n",
+ "Unpacking libjson-xs-perl (3.040-1) ...\n",
+ "Preparing to unpack .../07-libpq-dev_14.4-1.pgdg18.04+1_amd64.deb ...\n",
+ "Unpacking libpq-dev (14.4-1.pgdg18.04+1) over (10.21-0ubuntu0.18.04.1) ...\n",
+ "Preparing to unpack .../08-libpq5_14.4-1.pgdg18.04+1_amd64.deb ...\n",
+ "Unpacking libpq5:amd64 (14.4-1.pgdg18.04+1) over (10.21-0ubuntu0.18.04.1) ...\n",
+ "Selecting previously unselected package pgdg-keyring.\n",
+ "Preparing to unpack .../09-pgdg-keyring_2018.2_all.deb ...\n",
+ "Unpacking pgdg-keyring (2018.2) ...\n",
+ "Selecting previously unselected package postgresql-client-common.\n",
+ "Preparing to unpack .../10-postgresql-client-common_241.pgdg18.04+1_all.deb ...\n",
+ "Unpacking postgresql-client-common (241.pgdg18.04+1) ...\n",
+ "Selecting previously unselected package postgresql-client-11.\n",
+ "Preparing to unpack .../11-postgresql-client-11_11.16-1.pgdg18.04+1_amd64.deb ...\n",
+ "Unpacking postgresql-client-11 (11.16-1.pgdg18.04+1) ...\n",
+ "Selecting previously unselected package ssl-cert.\n",
+ "Preparing to unpack .../12-ssl-cert_1.0.39_all.deb ...\n",
+ "Unpacking ssl-cert (1.0.39) ...\n",
+ "Selecting previously unselected package postgresql-common.\n",
+ "Preparing to unpack .../13-postgresql-common_241.pgdg18.04+1_all.deb ...\n",
+ "Adding 'diversion of /usr/bin/pg_config to /usr/bin/pg_config.libpq-dev by postgresql-common'\n",
+ "Unpacking postgresql-common (241.pgdg18.04+1) ...\n",
+ "Selecting previously unselected package postgresql-11.\n",
+ "Preparing to unpack .../14-postgresql-11_11.16-1.pgdg18.04+1_amd64.deb ...\n",
+ "Unpacking postgresql-11 (11.16-1.pgdg18.04+1) ...\n",
+ "Selecting previously unselected package sysstat.\n",
+ "Preparing to unpack .../15-sysstat_11.6.1-1ubuntu0.1_amd64.deb ...\n",
+ "Unpacking sysstat (11.6.1-1ubuntu0.1) ...\n",
+ "Setting up libcommon-sense-perl (3.74-2build2) ...\n",
+ "Setting up sysstat (11.6.1-1ubuntu0.1) ...\n",
+ "debconf: unable to initialize frontend: Dialog\n",
+ "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)\n",
+ "debconf: falling back to frontend: Readline\n",
+ "\n",
+ "Creating config file /etc/default/sysstat with new version\n",
+ "update-alternatives: using /usr/bin/sar.sysstat to provide /usr/bin/sar (sar) in auto mode\n",
+ "Created symlink /etc/systemd/system/multi-user.target.wants/sysstat.service → /lib/systemd/system/sysstat.service.\n",
+ "Setting up ssl-cert (1.0.39) ...\n",
+ "debconf: unable to initialize frontend: Dialog\n",
+ "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)\n",
+ "debconf: falling back to frontend: Readline\n",
+ "Setting up libtypes-serialiser-perl (1.0-1) ...\n",
+ "Setting up libpq5:amd64 (14.4-1.pgdg18.04+1) ...\n",
+ "Setting up pgdg-keyring (2018.2) ...\n",
+ "Removing apt.postgresql.org key from trusted.gpg: OK\n",
+ "Setting up libjson-perl (2.97001-1) ...\n",
+ "Setting up cron (3.0pl1-128.1ubuntu1.2) ...\n",
+ "Adding group `crontab' (GID 110) ...\n",
+ "Done.\n",
+ "Created symlink /etc/systemd/system/multi-user.target.wants/cron.service → /lib/systemd/system/cron.service.\n",
+ "update-rc.d: warning: start and stop actions are no longer supported; falling back to defaults\n",
+ "invoke-rc.d: could not determine current runlevel\n",
+ "invoke-rc.d: policy-rc.d denied execution of start.\n",
+ "Setting up logrotate (3.11.0-0.1ubuntu1) ...\n",
+ "Setting up netbase (5.4) ...\n",
+ "Setting up libpq-dev (14.4-1.pgdg18.04+1) ...\n",
+ "Setting up libjson-xs-perl (3.040-1) ...\n",
+ "Setting up postgresql-client-common (241.pgdg18.04+1) ...\n",
+ "Setting up postgresql-common (241.pgdg18.04+1) ...\n",
+ "debconf: unable to initialize frontend: Dialog\n",
+ "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)\n",
+ "debconf: falling back to frontend: Readline\n",
+ "Adding user postgres to group ssl-cert\n",
+ "\n",
+ "Creating config file /etc/postgresql-common/createcluster.conf with new version\n",
+ "Building PostgreSQL dictionaries from installed myspell/hunspell packages...\n",
+ "Removing obsolete dictionary files:\n",
+ "Created symlink /etc/systemd/system/multi-user.target.wants/postgresql.service → /lib/systemd/system/postgresql.service.\n",
+ "Setting up postgresql-client-11 (11.16-1.pgdg18.04+1) ...\n",
+ "update-alternatives: using /usr/share/postgresql/11/man/man1/psql.1.gz to provide /usr/share/man/man1/psql.1.gz (psql.1.gz) in auto mode\n",
+ "Setting up postgresql-11 (11.16-1.pgdg18.04+1) ...\n",
+ "debconf: unable to initialize frontend: Dialog\n",
+ "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)\n",
+ "debconf: falling back to frontend: Readline\n",
+ "Creating new PostgreSQL cluster 11/main ...\n",
+ "/usr/lib/postgresql/11/bin/initdb -D /var/lib/postgresql/11/main --auth-local peer --auth-host md5\n",
+ "The files belonging to this database system will be owned by user \"postgres\".\n",
+ "This user must also own the server process.\n",
+ "\n",
+ "The database cluster will be initialized with locale \"en_US.UTF-8\".\n",
+ "The default database encoding has accordingly been set to \"UTF8\".\n",
+ "The default text search configuration will be set to \"english\".\n",
+ "\n",
+ "Data page checksums are disabled.\n",
+ "\n",
+ "fixing permissions on existing directory /var/lib/postgresql/11/main ... ok\n",
+ "creating subdirectories ... ok\n",
+ "selecting default max_connections ... 100\n",
+ "selecting default shared_buffers ... 128MB\n",
+ "selecting default timezone ... Etc/UTC\n",
+ "selecting dynamic shared memory implementation ... posix\n",
+ "creating configuration files ... ok\n",
+ "running bootstrap script ... ok\n",
+ "performing post-bootstrap initialization ... ok\n",
+ "syncing data to disk ... ok\n",
+ "\n",
+ "Success. You can now start the database server using:\n",
+ "\n",
+ " pg_ctlcluster 11 main start\n",
+ "\n",
+ "update-alternatives: using /usr/share/postgresql/11/man/man1/postmaster.1.gz to provide /usr/share/man/man1/postmaster.1.gz (postmaster.1.gz) in auto mode\n",
+ "invoke-rc.d: could not determine current runlevel\n",
+ "invoke-rc.d: policy-rc.d denied execution of start.\n",
+ "Processing triggers for systemd (237-3ubuntu10.53) ...\n",
+ "Processing triggers for man-db (2.8.3-2ubuntu0.1) ...\n",
+ "Processing triggers for libc-bin (2.27-3ubuntu1.3) ...\n",
+ "/sbin/ldconfig.real: /usr/local/lib/python3.7/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link\n",
+ "\n",
+ " * Starting PostgreSQL 11 database server\n",
+ " ...done.\n",
+ "ALTER ROLE\n",
+ "NOTICE: database \"donors_choose\" does not exist, skipping\n",
+ "DROP DATABASE\n",
+ "CREATE DATABASE\n",
+ "env: DEMO_DATABASE_NAME=donors_choose\n",
+ "env: DEMO_DATABASE_HOST=localhost\n",
+ "env: DEMO_DATABASE_PORT=5432\n",
+ "env: DEMO_DATABASE_USER=postgres\n",
+ "env: DEMO_DATABASE_PASS=postgres\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3mWNhJ2rOVtS"
+ },
+ "source": [
+ "# Download sampled DonorsChoose data and load it into our postgres server\n",
+ "!curl -s -OL https://dsapp-public-data-migrated.s3.us-west-2.amazonaws.com/donors_sampled_20210920_v3.dmp\n",
+ "!PGPASSWORD=$DEMO_DATABASE_PASS pg_restore -h $DEMO_DATABASE_HOST -p $DEMO_DATABASE_PORT -d $DEMO_DATABASE_NAME -U $DEMO_DATABASE_USER -O -j 8 donors_sampled_20210920_v3.dmp"
+ ],
+ "execution_count": 3,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 1000
+ },
+ "id": "t5E-9VRjRlSk",
+ "outputId": "8c470c70-6b25-4cca-c102-7d88907281c3"
+ },
+ "source": [
+ "# Install triage and its dependencies\n",
+ "!pip install triage"
+ ],
+ "execution_count": 4,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
+ "Collecting triage\n",
+ " Downloading triage-5.1.1-py2.py3-none-any.whl (250 kB)\n",
+ "\u001b[K |████████████████████████████████| 250 kB 13.6 MB/s \n",
+ "\u001b[?25hCollecting matplotlib==3.3.4\n",
+ " Downloading matplotlib-3.3.4-cp37-cp37m-manylinux1_x86_64.whl (11.5 MB)\n",
+ "\u001b[K |████████████████████████████████| 11.5 MB 77.9 MB/s \n",
+ "\u001b[?25hCollecting pebble==4.5.3\n",
+ " Downloading Pebble-4.5.3-py2.py3-none-any.whl (24 kB)\n",
+ "Collecting signalled-timeout==1.0.0\n",
+ " Downloading signalled-timeout-1.0.0.tar.gz (2.7 kB)\n",
+ "Collecting seaborn==0.10.1\n",
+ " Downloading seaborn-0.10.1-py3-none-any.whl (215 kB)\n",
+ "\u001b[K |████████████████████████████████| 215 kB 66.9 MB/s \n",
+ "\u001b[?25hCollecting pandas==1.0.5\n",
+ " Downloading pandas-1.0.5-cp37-cp37m-manylinux1_x86_64.whl (10.1 MB)\n",
+ "\u001b[K |████████████████████████████████| 10.1 MB 45.5 MB/s \n",
+ "\u001b[?25hCollecting s3fs==0.4.2\n",
+ " Downloading s3fs-0.4.2-py3-none-any.whl (19 kB)\n",
+ "Collecting sqlalchemy-postgres-copy==0.5.0\n",
+ " Downloading sqlalchemy_postgres_copy-0.5.0-py2.py3-none-any.whl (6.6 kB)\n",
+ "Requirement already satisfied: click==7.1.2 in /usr/local/lib/python3.7/dist-packages (from triage) (7.1.2)\n",
+ "Collecting python-dateutil==2.8.1\n",
+ " Downloading python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)\n",
+ "\u001b[K |████████████████████████████████| 227 kB 72.6 MB/s \n",
+ "\u001b[?25hCollecting boto3==1.14.45\n",
+ " Downloading boto3-1.14.45-py2.py3-none-any.whl (129 kB)\n",
+ "\u001b[K |████████████████████████████████| 129 kB 70.7 MB/s \n",
+ "\u001b[?25hCollecting Dickens==1.0.1\n",
+ " Downloading Dickens-1.0.1.tar.gz (2.3 kB)\n",
+ "Collecting argcmdr==0.7.0\n",
+ " Downloading argcmdr-0.7.0-py3-none-any.whl (33 kB)\n",
+ "Collecting scipy==1.5.0\n",
+ " Downloading scipy-1.5.0-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)\n",
+ "\u001b[K |████████████████████████████████| 25.9 MB 81.5 MB/s \n",
+ "\u001b[?25hCollecting numpy==1.21.1\n",
+ " Downloading numpy-1.21.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)\n",
+ "\u001b[K |████████████████████████████████| 15.7 MB 36.1 MB/s \n",
+ "\u001b[?25hCollecting graphviz==0.14\n",
+ " Downloading graphviz-0.14-py2.py3-none-any.whl (18 kB)\n",
+ "Collecting requests==2.24.0\n",
+ " Downloading requests-2.24.0-py2.py3-none-any.whl (61 kB)\n",
+ "\u001b[K |████████████████████████████████| 61 kB 509 kB/s \n",
+ "\u001b[?25hCollecting retrying==1.3.3\n",
+ " Downloading retrying-1.3.3.tar.gz (10 kB)\n",
+ "Collecting aequitas==0.42.0\n",
+ " Downloading aequitas-0.42.0-py3-none-any.whl (2.2 MB)\n",
+ "\u001b[K |████████████████████████████████| 2.2 MB 47.9 MB/s \n",
+ "\u001b[?25hCollecting alembic==1.4.2\n",
+ " Downloading alembic-1.4.2.tar.gz (1.1 MB)\n",
+ "\u001b[K |████████████████████████████████| 1.1 MB 67.3 MB/s \n",
+ "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
+ " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
+ " Preparing wheel metadata ... \u001b[?25l\u001b[?25hdone\n",
+ "Collecting verboselogs==1.7\n",
+ " Downloading verboselogs-1.7-py2.py3-none-any.whl (11 kB)\n",
+ "Requirement already satisfied: sqlparse==0.4.2 in /usr/local/lib/python3.7/dist-packages (from triage) (0.4.2)\n",
+ "Collecting PyYAML==5.4.1\n",
+ " Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)\n",
+ "\u001b[K |████████████████████████████████| 636 kB 56.6 MB/s \n",
+ "\u001b[?25hCollecting psycopg2-binary==2.8.5\n",
+ " Downloading psycopg2_binary-2.8.5-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB)\n",
+ "\u001b[K |████████████████████████████████| 2.9 MB 50.2 MB/s \n",
+ "\u001b[?25hCollecting SQLAlchemy==1.3.18\n",
+ " Downloading SQLAlchemy-1.3.18-cp37-cp37m-manylinux2010_x86_64.whl (1.3 MB)\n",
+ "\u001b[K |████████████████████████████████| 1.3 MB 49.3 MB/s \n",
+ "\u001b[?25hCollecting adjustText==0.7.3\n",
+ " Downloading adjustText-0.7.3.tar.gz (7.5 kB)\n",
+ "Collecting wrapt==1.13.3\n",
+ " Downloading wrapt-1.13.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (79 kB)\n",
+ "\u001b[K |████████████████████████████████| 79 kB 9.2 MB/s \n",
+ "\u001b[?25hCollecting scikit-learn==0.23.1\n",
+ " Downloading scikit_learn-0.23.1-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)\n",
+ "\u001b[K |████████████████████████████████| 6.8 MB 52.4 MB/s \n",
+ "\u001b[?25hCollecting ohio==0.5.0\n",
+ " Downloading ohio-0.5.0-py3-none-any.whl (26 kB)\n",
+ "Collecting coloredlogs==14.0\n",
+ " Downloading coloredlogs-14.0-py2.py3-none-any.whl (43 kB)\n",
+ "\u001b[K |████████████████████████████████| 43 kB 2.4 MB/s \n",
+ "\u001b[?25hCollecting inflection==0.5.0\n",
+ " Downloading inflection-0.5.0-py2.py3-none-any.whl (5.8 kB)\n",
+ "Collecting xhtml2pdf==0.2.2\n",
+ " Downloading xhtml2pdf-0.2.2.tar.gz (97 kB)\n",
+ "\u001b[K |████████████████████████████████| 97 kB 7.7 MB/s \n",
+ "\u001b[?25hCollecting tabulate==0.8.2\n",
+ " Downloading tabulate-0.8.2.tar.gz (45 kB)\n",
+ "\u001b[K |████████████████████████████████| 45 kB 4.6 MB/s \n",
+ "\u001b[?25hCollecting markdown2==2.3.5\n",
+ " Downloading markdown2-2.3.5.zip (161 kB)\n",
+ "\u001b[K |████████████████████████████████| 161 kB 23.0 MB/s \n",
+ "\u001b[?25hCollecting altair==4.1.0\n",
+ " Downloading altair-4.1.0-py3-none-any.whl (727 kB)\n",
+ "\u001b[K |████████████████████████████████| 727 kB 68.1 MB/s \n",
+ "\u001b[?25hCollecting millify==0.1.1\n",
+ " Downloading millify-0.1.1.tar.gz (1.2 kB)\n",
+ "Collecting Flask-Bootstrap==3.3.7.1\n",
+ " Downloading Flask-Bootstrap-3.3.7.1.tar.gz (456 kB)\n",
+ "\u001b[K |████████████████████████████████| 456 kB 69.4 MB/s \n",
+ "\u001b[?25hCollecting Flask==0.12.2\n",
+ " Downloading Flask-0.12.2-py2.py3-none-any.whl (83 kB)\n",
+ "\u001b[K |████████████████████████████████| 83 kB 1.6 MB/s \n",
+ "\u001b[?25hCollecting Mako\n",
+ " Downloading Mako-1.2.0-py3-none-any.whl (78 kB)\n",
+ "\u001b[K |████████████████████████████████| 78 kB 8.4 MB/s \n",
+ "\u001b[?25hCollecting python-editor>=0.3\n",
+ " Downloading python_editor-1.0.4-py3-none-any.whl (4.9 kB)\n",
+ "Requirement already satisfied: jsonschema in /usr/local/lib/python3.7/dist-packages (from altair==4.1.0->aequitas==0.42.0->triage) (4.3.3)\n",
+ "Requirement already satisfied: entrypoints in /usr/local/lib/python3.7/dist-packages (from altair==4.1.0->aequitas==0.42.0->triage) (0.4)\n",
+ "Requirement already satisfied: toolz in /usr/local/lib/python3.7/dist-packages (from altair==4.1.0->aequitas==0.42.0->triage) (0.11.2)\n",
+ "Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from altair==4.1.0->aequitas==0.42.0->triage) (2.11.3)\n",
+ "Collecting argcomplete==1.9.4\n",
+ " Downloading argcomplete-1.9.4-py2.py3-none-any.whl (36 kB)\n",
+ "Collecting plumbum==1.6.4\n",
+ " Downloading plumbum-1.6.4-py2.py3-none-any.whl (110 kB)\n",
+ "\u001b[K |████████████████████████████████| 110 kB 73.9 MB/s \n",
+ "\u001b[?25hCollecting botocore<1.18.0,>=1.17.45\n",
+ " Downloading botocore-1.17.63-py2.py3-none-any.whl (6.6 MB)\n",
+ "\u001b[K |████████████████████████████████| 6.6 MB 53.5 MB/s \n",
+ "\u001b[?25hCollecting jmespath<1.0.0,>=0.7.1\n",
+ " Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)\n",
+ "Collecting s3transfer<0.4.0,>=0.3.0\n",
+ " Downloading s3transfer-0.3.7-py2.py3-none-any.whl (73 kB)\n",
+ "\u001b[K |████████████████████████████████| 73 kB 2.4 MB/s \n",
+ "\u001b[?25hCollecting humanfriendly>=7.1\n",
+ " Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)\n",
+ "\u001b[K |████████████████████████████████| 86 kB 7.5 MB/s \n",
+ "\u001b[?25hRequirement already satisfied: Werkzeug>=0.7 in /usr/local/lib/python3.7/dist-packages (from Flask==0.12.2->aequitas==0.42.0->triage) (1.0.1)\n",
+ "Requirement already satisfied: itsdangerous>=0.21 in /usr/local/lib/python3.7/dist-packages (from Flask==0.12.2->aequitas==0.42.0->triage) (1.1.0)\n",
+ "Collecting dominate\n",
+ " Downloading dominate-2.6.0-py2.py3-none-any.whl (29 kB)\n",
+ "Collecting visitor\n",
+ " Downloading visitor-0.1.3.tar.gz (3.3 kB)\n",
+ "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib==3.3.4->triage) (0.11.0)\n",
+ "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib==3.3.4->triage) (1.4.3)\n",
+ "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /usr/local/lib/python3.7/dist-packages (from matplotlib==3.3.4->triage) (3.0.9)\n",
+ "Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib==3.3.4->triage) (7.1.2)\n",
+ "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas==1.0.5->triage) (2022.1)\n",
+ "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil==2.8.1->triage) (1.15.0)\n",
+ "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests==2.24.0->triage) (2022.6.15)\n",
+ "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests==2.24.0->triage) (3.0.4)\n",
+ "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests==2.24.0->triage) (2.10)\n",
+ "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests==2.24.0->triage) (1.24.3)\n",
+ "Collecting fsspec>=0.6.0\n",
+ " Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)\n",
+ "\u001b[K |████████████████████████████████| 140 kB 84.6 MB/s \n",
+ "\u001b[?25hRequirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.23.1->triage) (1.1.0)\n",
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.23.1->triage) (3.1.0)\n",
+ "Requirement already satisfied: psycopg2 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy-postgres-copy==0.5.0->triage) (2.7.6.1)\n",
+ "Requirement already satisfied: html5lib>=1.0 in /usr/local/lib/python3.7/dist-packages (from xhtml2pdf==0.2.2->aequitas==0.42.0->triage) (1.0.1)\n",
+ "Requirement already satisfied: httplib2 in /usr/local/lib/python3.7/dist-packages (from xhtml2pdf==0.2.2->aequitas==0.42.0->triage) (0.17.4)\n",
+ "Collecting pyPdf2\n",
+ " Downloading PyPDF2-2.3.1-py3-none-any.whl (198 kB)\n",
+ "\u001b[K |████████████████████████████████| 198 kB 68.3 MB/s \n",
+ "\u001b[?25hCollecting reportlab>=3.0\n",
+ " Downloading reportlab-3.6.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)\n",
+ "\u001b[K |████████████████████████████████| 2.8 MB 55.7 MB/s \n",
+ "\u001b[?25hCollecting docutils<0.16,>=0.10\n",
+ " Downloading docutils-0.15.2-py3-none-any.whl (547 kB)\n",
+ "\u001b[K |████████████████████████████████| 547 kB 62.2 MB/s \n",
+ "\u001b[?25hRequirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from html5lib>=1.0->xhtml2pdf==0.2.2->aequitas==0.42.0->triage) (0.5.1)\n",
+ "Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->altair==4.1.0->aequitas==0.42.0->triage) (2.0.1)\n",
+ "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib==3.3.4->triage) (4.1.1)\n",
+ "Collecting pillow>=6.2.0\n",
+ " Downloading Pillow-9.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)\n",
+ "\u001b[K |████████████████████████████████| 3.1 MB 51.7 MB/s \n",
+ "\u001b[?25hRequirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /usr/local/lib/python3.7/dist-packages (from jsonschema->altair==4.1.0->aequitas==0.42.0->triage) (0.18.1)\n",
+ "Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from jsonschema->altair==4.1.0->aequitas==0.42.0->triage) (4.11.4)\n",
+ "Requirement already satisfied: importlib-resources>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from jsonschema->altair==4.1.0->aequitas==0.42.0->triage) (5.7.1)\n",
+ "Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from jsonschema->altair==4.1.0->aequitas==0.42.0->triage) (21.4.0)\n",
+ "Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.7/dist-packages (from importlib-resources>=1.4.0->jsonschema->altair==4.1.0->aequitas==0.42.0->triage) (3.8.0)\n",
+ "Building wheels for collected packages: adjustText, alembic, Dickens, Flask-Bootstrap, markdown2, millify, retrying, signalled-timeout, tabulate, xhtml2pdf, visitor\n",
+ " Building wheel for adjustText (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for adjustText: filename=adjustText-0.7.3-py3-none-any.whl size=7097 sha256=5cb8e926cf54a794e98db5ddc8dfead14dbe5b87cf2363c1724c124d55691b6f\n",
+ " Stored in directory: /root/.cache/pip/wheels/2f/98/32/afbf902d8f040fadfdf0a44357e4ab750afe165d873bf5893d\n",
+ " Building wheel for alembic (PEP 517) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for alembic: filename=alembic-1.4.2-py2.py3-none-any.whl size=159554 sha256=b0276e52e77a3cd7cefb29a7e05f54c5d6d443c041942bc41839adf15c3df576\n",
+ " Stored in directory: /root/.cache/pip/wheels/4e/b5/00/f93fe1c90b3d501774e91e2e99987f49d16019e40e4bd3afc3\n",
+ " Building wheel for Dickens (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for Dickens: filename=Dickens-1.0.1-py3-none-any.whl size=2643 sha256=e8ab520fb261b3233ee613553c6eb81dbbd1a18c9649c77be85311f3d33f2199\n",
+ " Stored in directory: /root/.cache/pip/wheels/11/7b/87/87c72b3ffee9c8830070dfc690b0df03833753e2197c7ed230\n",
+ " Building wheel for Flask-Bootstrap (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for Flask-Bootstrap: filename=Flask_Bootstrap-3.3.7.1-py3-none-any.whl size=460123 sha256=4ee5321f47b128334bf1b80349b57ac5ce742fdc98f5c22b8dfa34bafbc3d40d\n",
+ " Stored in directory: /root/.cache/pip/wheels/67/a2/d6/50d039c9b59b4caca6d7b53839c8100354a52ab7553d2456eb\n",
+ " Building wheel for markdown2 (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for markdown2: filename=markdown2-2.3.5-py3-none-any.whl size=33327 sha256=06fc0f9ede786a1d83e67a52414c4024cdb0ba538a6335fe825ab07451b1a51d\n",
+ " Stored in directory: /root/.cache/pip/wheels/46/b9/ae/4050b5eeeedc7cba8ed5a0203189c89c0fa980f683822bfa31\n",
+ " Building wheel for millify (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for millify: filename=millify-0.1.1-py3-none-any.whl size=1866 sha256=841923968d9c2e34ca7de09da6d07b7e3abbbe03a70ffe19fe4c292a50dde626\n",
+ " Stored in directory: /root/.cache/pip/wheels/38/26/25/c2a8bb99a5cf348903e6ac35a29878e221cc9daeb698545148\n",
+ " Building wheel for retrying (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11447 sha256=e3f02b7975c11140b34230dc83347c67f5e23103d78063728519c9a440026da4\n",
+ " Stored in directory: /root/.cache/pip/wheels/f9/8d/8d/f6af3f7f9eea3553bc2fe6d53e4b287dad18b06a861ac56ddf\n",
+ " Building wheel for signalled-timeout (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for signalled-timeout: filename=signalled_timeout-1.0.0-py3-none-any.whl size=2973 sha256=d947f52f2f3cc5b0b58980c0b9353c9375f3dcf9400f1684e9f31a329ec79596\n",
+ " Stored in directory: /root/.cache/pip/wheels/b8/67/0e/f8daac45e46330192ff71cc9c65c86e817df05f7a4a79531d4\n",
+ " Building wheel for tabulate (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for tabulate: filename=tabulate-0.8.2-py3-none-any.whl size=23550 sha256=a85c2762c9ae5625348cc2b8227f13d4081de9935a418930d16bbc2d3d4af1e3\n",
+ " Stored in directory: /root/.cache/pip/wheels/33/63/72/4156fe55e8e06830d7aed3d20a6d1aacc753536843ab7330f6\n",
+ " Building wheel for xhtml2pdf (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for xhtml2pdf: filename=xhtml2pdf-0.2.2-py3-none-any.whl size=230264 sha256=d712ab3897a5e39f980482c164d8440dc63e81f29174a84a082250c9a9fb3531\n",
+ " Stored in directory: /root/.cache/pip/wheels/65/e6/3a/9851102d40dd8e643a4ff3ce5d69988f95d1d9b7448e37a916\n",
+ " Building wheel for visitor (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+ " Created wheel for visitor: filename=visitor-0.1.3-py3-none-any.whl size=3946 sha256=d46157218c08ad97d2597bddbe54e35495289d0bb4b232396c1d9ad4309c2080\n",
+ " Stored in directory: /root/.cache/pip/wheels/64/34/11/053f47218984c9a31a00f911ed98dda036b867481dcc527a12\n",
+ "Successfully built adjustText alembic Dickens Flask-Bootstrap markdown2 millify retrying signalled-timeout tabulate xhtml2pdf visitor\n",
+ "Installing collected packages: python-dateutil, pillow, numpy, jmespath, docutils, visitor, scipy, reportlab, pyPdf2, pandas, matplotlib, Flask, dominate, botocore, xhtml2pdf, tabulate, SQLAlchemy, seaborn, s3transfer, PyYAML, python-editor, plumbum, ohio, millify, markdown2, Mako, humanfriendly, fsspec, Flask-Bootstrap, Dickens, argcomplete, altair, wrapt, verboselogs, sqlalchemy-postgres-copy, signalled-timeout, scikit-learn, s3fs, retrying, requests, psycopg2-binary, pebble, inflection, graphviz, coloredlogs, boto3, argcmdr, alembic, aequitas, adjustText, triage\n",
+ " Attempting uninstall: python-dateutil\n",
+ " Found existing installation: python-dateutil 2.8.2\n",
+ " Uninstalling python-dateutil-2.8.2:\n",
+ " Successfully uninstalled python-dateutil-2.8.2\n",
+ " Attempting uninstall: pillow\n",
+ " Found existing installation: Pillow 7.1.2\n",
+ " Uninstalling Pillow-7.1.2:\n",
+ " Successfully uninstalled Pillow-7.1.2\n",
+ " Attempting uninstall: numpy\n",
+ " Found existing installation: numpy 1.21.6\n",
+ " Uninstalling numpy-1.21.6:\n",
+ " Successfully uninstalled numpy-1.21.6\n",
+ " Attempting uninstall: docutils\n",
+ " Found existing installation: docutils 0.17.1\n",
+ " Uninstalling docutils-0.17.1:\n",
+ " Successfully uninstalled docutils-0.17.1\n",
+ " Attempting uninstall: scipy\n",
+ " Found existing installation: scipy 1.4.1\n",
+ " Uninstalling scipy-1.4.1:\n",
+ " Successfully uninstalled scipy-1.4.1\n",
+ " Attempting uninstall: pandas\n",
+ " Found existing installation: pandas 1.3.5\n",
+ " Uninstalling pandas-1.3.5:\n",
+ " Successfully uninstalled pandas-1.3.5\n",
+ " Attempting uninstall: matplotlib\n",
+ " Found existing installation: matplotlib 3.2.2\n",
+ " Uninstalling matplotlib-3.2.2:\n",
+ " Successfully uninstalled matplotlib-3.2.2\n",
+ " Attempting uninstall: Flask\n",
+ " Found existing installation: Flask 1.1.4\n",
+ " Uninstalling Flask-1.1.4:\n",
+ " Successfully uninstalled Flask-1.1.4\n",
+ " Attempting uninstall: tabulate\n",
+ " Found existing installation: tabulate 0.8.9\n",
+ " Uninstalling tabulate-0.8.9:\n",
+ " Successfully uninstalled tabulate-0.8.9\n",
+ " Attempting uninstall: SQLAlchemy\n",
+ " Found existing installation: SQLAlchemy 1.4.37\n",
+ " Uninstalling SQLAlchemy-1.4.37:\n",
+ " Successfully uninstalled SQLAlchemy-1.4.37\n",
+ " Attempting uninstall: seaborn\n",
+ " Found existing installation: seaborn 0.11.2\n",
+ " Uninstalling seaborn-0.11.2:\n",
+ " Successfully uninstalled seaborn-0.11.2\n",
+ " Attempting uninstall: PyYAML\n",
+ " Found existing installation: PyYAML 3.13\n",
+ " Uninstalling PyYAML-3.13:\n",
+ " Successfully uninstalled PyYAML-3.13\n",
+ " Attempting uninstall: altair\n",
+ " Found existing installation: altair 4.2.0\n",
+ " Uninstalling altair-4.2.0:\n",
+ " Successfully uninstalled altair-4.2.0\n",
+ " Attempting uninstall: wrapt\n",
+ " Found existing installation: wrapt 1.14.1\n",
+ " Uninstalling wrapt-1.14.1:\n",
+ " Successfully uninstalled wrapt-1.14.1\n",
+ " Attempting uninstall: scikit-learn\n",
+ " Found existing installation: scikit-learn 1.0.2\n",
+ " Uninstalling scikit-learn-1.0.2:\n",
+ " Successfully uninstalled scikit-learn-1.0.2\n",
+ " Attempting uninstall: requests\n",
+ " Found existing installation: requests 2.23.0\n",
+ " Uninstalling requests-2.23.0:\n",
+ " Successfully uninstalled requests-2.23.0\n",
+ " Attempting uninstall: graphviz\n",
+ " Found existing installation: graphviz 0.10.1\n",
+ " Uninstalling graphviz-0.10.1:\n",
+ " Successfully uninstalled graphviz-0.10.1\n",
+ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
+ "yellowbrick 1.4 requires scikit-learn>=1.0.0, but you have scikit-learn 0.23.1 which is incompatible.\n",
+ "xarray 0.20.2 requires pandas>=1.1, but you have pandas 1.0.5 which is incompatible.\n",
+ "imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.23.1 which is incompatible.\n",
+ "google-colab 1.0.0 requires pandas>=1.1.0; python_version >= \"3.0\", but you have pandas 1.0.5 which is incompatible.\n",
+ "google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.24.0 which is incompatible.\n",
+ "datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.\n",
+ "albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.\u001b[0m\n",
+ "Successfully installed Dickens-1.0.1 Flask-0.12.2 Flask-Bootstrap-3.3.7.1 Mako-1.2.0 PyYAML-5.4.1 SQLAlchemy-1.3.18 adjustText-0.7.3 aequitas-0.42.0 alembic-1.4.2 altair-4.1.0 argcmdr-0.7.0 argcomplete-1.9.4 boto3-1.14.45 botocore-1.17.63 coloredlogs-14.0 docutils-0.15.2 dominate-2.6.0 fsspec-2022.5.0 graphviz-0.14 humanfriendly-10.0 inflection-0.5.0 jmespath-0.10.0 markdown2-2.3.5 matplotlib-3.3.4 millify-0.1.1 numpy-1.21.1 ohio-0.5.0 pandas-1.0.5 pebble-4.5.3 pillow-9.1.1 plumbum-1.6.4 psycopg2-binary-2.8.5 pyPdf2-2.3.1 python-dateutil-2.8.1 python-editor-1.0.4 reportlab-3.6.10 requests-2.24.0 retrying-1.3.3 s3fs-0.4.2 s3transfer-0.3.7 scikit-learn-0.23.1 scipy-1.5.0 seaborn-0.10.1 signalled-timeout-1.0.0 sqlalchemy-postgres-copy-0.5.0 tabulate-0.8.2 triage-5.1.1 verboselogs-1.7 visitor-0.1.3 wrapt-1.13.3 xhtml2pdf-0.2.2\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "application/vnd.colab-display-data+json": {
+ "pip_warning": {
+ "packages": [
+ "PIL",
+ "dateutil",
+ "matplotlib",
+ "mpl_toolkits",
+ "numpy"
+ ]
+ }
+ }
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8mQ1nY6lksXD"
+ },
+ "source": [
+ "🛑 **NOTE: Before continuing, your colab runtime may need to be restarted for the installed packages to take effect. If a \"Restart Runtime\" button appeared at the bottom of the output above, be sure to click it before moving on to the next section!**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "reskRriKlcpO"
+ },
+ "source": [
+ "## A Quick Look at the DonorsChoose Data\n",
+ "\n",
+ "Before getting into triage, let's just take a quick look at the data we'll be using here. To get started, we'll need to connect to the database we just created..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mvDxGoeCmSKQ"
+ },
+ "source": [
+ "from sqlalchemy.engine.url import URL\n",
+ "from triage.util.db import create_engine\n",
+ "import pandas as pd\n",
+ "\n",
+ "db_url = URL(\n",
+ " 'postgres',\n",
+ " host='localhost',\n",
+ " username='postgres',\n",
+ " database='donors_choose',\n",
+ " password='postgres',\n",
+ " port=5432,\n",
+ " )\n",
+ "\n",
+ "db_engine = create_engine(db_url)"
+ ],
+ "execution_count": 1,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xfgHR1NfnMI4"
+ },
+ "source": [
+ "The DonorsChoose dataset contains four main tables we'll need here:\n",
+ "- **Projects** contains information about each project as well as some details about the teacher posting it and their school and district\n",
+ "- **Essays** contains the detailed descriptions that the teacher post describing their project and needs\n",
+ "- **Resources** contains detailed information about the specific number, type, and cost of resources being asked for in the project\n",
+ "- **Donations** contains information about the donations received by each project on a transactional level, as well as some details about the donor\n",
+ "\n",
+ "Let's take a look at the projects:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 81
+ },
+ "id": "dhElc5PMprk0",
+ "outputId": "a5a874bf-c637-45ea-d1e7-f027dcf04f51"
+ },
+ "source": [
+ "pd.read_sql('SELECT COUNT(*) FROM data.projects', db_engine)"
+ ],
+ "execution_count": 2,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ " count\n",
+ "0 16480"
+ ],
+ "text/html": [
+ "\n",
+ "
\n",
+ " "
+ ]
+ },
+ "metadata": {},
+ "execution_count": 6
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gIj8GvQF_1cV"
+ },
+ "source": [
+ "## Formulating the project\n",
+ "\n",
+ "Now that we're familiar with the available data, let's turn to the prediction problem at hand. Because reviewing and offering suggestions to posted projects will be time and resource-intensive, we might assume that DonorsChoose can only help a fraction of all projects that get posted, let's suppose 10%. Then, we might formulate our problem along the lines of:\n",
+ "\n",
+ "**Each day, for all the projects posted on that day, can we identify the 10% of projects with the highest risk of not being fully funded within 4 months to prioritize for review by digital content experts.**\n",
+ "\n",
+ "With this formulation in mind, we can define a cohort and label for our analysis. `triage` will allow us to define these directly as a SQL query, so let's start there..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6A7_a1SADxE9"
+ },
+ "source": [
+ "### Defining the Cohort\n",
+ "\n",
+ "Because most models to inform important decisions will need to generalize into the future, `triage` focuses on respecting the temporal nature of the data (discussed in more detail below). The `cohort` is the set of relevant entities for model training/prediction at a given point in time, which `triage` referrs to as an `as_of_date`.\n",
+ "\n",
+ "🚧 NOTE: In `triage`, an `as_of_date` is taken to be midnight at the **beginning** of that date.\n",
+ "\n",
+ "Here, the cohort is relatively straightforward: we simply want to identify all of the projects that were posted, right on the day of posting. Although we were looking at the identifier `projectid_str` above, `triage` looks for a column called `entity_id` to uniquely identify entities to its models. We've already added this column to this dataset, so we'll use that below.\n",
+ "\n",
+ "🚧 NOTE: `triage` expects entities in the data to be identified by an **integer column** called `entity_id`.\n",
+ "\n",
+ "With those details in mind, let's look at an example of how we might define the cohort from our data for this project:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "id": "Fq5W82jLLwQF",
+ "outputId": "1b1bb3b0-d5a3-48db-cb5b-5d6d43671c54"
+ },
+ "source": [
+ "example_as_of_date = '2012-08-07'\n",
+ "\n",
+ "pd.read_sql(\"\"\"\n",
+ " SELECT distinct(entity_id)\n",
+ " FROM data.projects\n",
+ " WHERE date_posted = '{as_of_date}'::date - interval '1day'\n",
+ " ;\n",
+ " \"\"\".format(as_of_date=example_as_of_date), db_engine)"
+ ],
+ "execution_count": 7,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ " entity_id\n",
+ "0 234035\n",
+ "1 234148\n",
+ "2 234234"
+ ],
+ "text/html": [
+ "\n",
+ "
\n",
+ " "
+ ]
+ },
+ "metadata": {},
+ "execution_count": 7
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qiDh3DdOMrGp"
+ },
+ "source": [
+ "In `triage` we'll be able to use `{as_of_date}` as a placeholder for time just as we're doing here.\n",
+ "\n",
+ "Also note that because the `as_of_date` is taken to be midnight, we're looking at the projects posted the previous day (hence subtracting the 1 day interval in the query).\n",
+ "\n",
+ "For `triage`, we use a yaml format for configuration (described further below) and we'll be able to provide this query directly:\n",
+ "```\n",
+ "cohort_config:\n",
+ " query: |\n",
+ " SELECT distinct(entity_id)\n",
+ " FROM data.projects\n",
+ " WHERE date_posted = '{as_of_date}'::date - interval '1day'\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Tv-YbkWuOFnI"
+ },
+ "source": [
+ "### Defining the Label\n",
+ "\n",
+ "For modeling, we also need to consider the outcome we care about. Returning to our formulation, we described trying to identify projects which will not be fully funded within the four months they are active on the platform.\n",
+ "\n",
+ "As with the cohort, notice that labels are calculated relative to a given point in time (the `as_of_date` described above) and over a specific time horizon (here, 4 months from posting). In triage, this time horizon is referred to as a `label_timespan` and is also available as a parameter to your label definition, again specified as a query:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "id": "Odpo6nluPGLk",
+ "outputId": "783747f1-7e10-4fc1-9683-ec76d818715e"
+ },
+ "source": [
+ "example_as_of_date = '2012-08-07'\n",
+ "example_label_timespan = '4month'\n",
+ "\n",
+ "pd.read_sql(\"\"\"\n",
+ " WITH cohort_query AS (\n",
+ " SELECT distinct(entity_id)\n",
+ " FROM data.projects\n",
+ " WHERE date_posted = '{as_of_date}'::date - interval '1day'\n",
+ " )\n",
+ " , cohort_donations AS (\n",
+ " SELECT \n",
+ " c.entity_id, \n",
+ " COALESCE(SUM(d.donation_to_project), 0) AS total_donation\n",
+ " FROM cohort_query c\n",
+ " LEFT JOIN data.donations d \n",
+ " ON c.entity_id = d.entity_id\n",
+ " AND d.donation_timestamp \n",
+ " BETWEEN '{as_of_date}'::date - interval '1day'\n",
+ " AND '{as_of_date}'::date + interval '{label_timespan}'\n",
+ " GROUP BY 1\n",
+ " )\n",
+ " SELECT c.entity_id,\n",
+ " CASE \n",
+ " WHEN COALESCE(d.total_donation, 0) >= p.total_asking_price THEN 0\n",
+ " ELSE 1\n",
+ " END AS outcome \n",
+ " FROM cohort_query c\n",
+ " JOIN data.projects p USING(entity_id)\n",
+ " LEFT JOIN cohort_donations d using(entity_id)\n",
+ " ;\n",
+ " \"\"\".format(as_of_date=example_as_of_date, label_timespan=example_label_timespan), db_engine)"
+ ],
+ "execution_count": 8,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ " entity_id outcome\n",
+ "0 234035 1\n",
+ "1 234148 0\n",
+ "2 234234 0"
+ ],
+ "text/html": [
+ "\n",
+ "
\n",
+ " "
+ ]
+ },
+ "metadata": {},
+ "execution_count": 8
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MLeE7N-NSWam"
+ },
+ "source": [
+ "A little more complicated than our cohort query, but still reasonably straightforward: we start with the cohort defined above, then find all the donations to those projects within the label timespan (e.g., the following 4 months after posting), and finally compare that to the total price of the project to create a binary classification label for whether or not the project was fully funded.\n",
+ "\n",
+ "Notice here that because we will intervene on projects at risk for **NOT** being fully funded, we define this as our class 1 label while those that do reach their funding goal are given class 0.\n",
+ "\n",
+ "As with the cohort, we'll be able to specify this label query directly to triage in our yaml configuation:\n",
+ "```\n",
+ "label_config:\n",
+ " query: |\n",
+ " WITH cohort_query AS (\n",
+ " SELECT distinct(entity_id)\n",
+ " FROM data.projects\n",
+ " WHERE date_posted = '{as_of_date}'::date - interval '1day'\n",
+ " )\n",
+ " , cohort_donations AS (\n",
+ " SELECT \n",
+ " c.entity_id, \n",
+ " COALESCE(SUM(d.donation_to_project), 0) AS total_donation\n",
+ " FROM cohort_query c\n",
+ " LEFT JOIN data.donations d \n",
+ " ON c.entity_id = d.entity_id\n",
+ " AND d.donation_timestamp \n",
+ " BETWEEN '{as_of_date}'::date - interval '1day'\n",
+ " AND '{as_of_date}'::date + interval '{label_timespan}'\n",
+ " GROUP BY 1\n",
+ " )\n",
+ " SELECT c.entity_id,\n",
+ " CASE \n",
+ " WHEN COALESCE(d.total_donation, 0) >= p.total_asking_price THEN 0\n",
+ " ELSE 1\n",
+ " END AS outcome \n",
+ " FROM cohort_query c\n",
+ " JOIN data.projects p USING(entity_id)\n",
+ " LEFT JOIN cohort_donations d using(entity_id)\n",
+ "\n",
+ " name: 'fully_funded'\n",
+ "```\n",
+ "\n",
+ "For more details these two pieces of the modeling pipeline, see the [cohort and label deep dive in the triage docs](https://dssg.github.io/triage/experiments/cohort-labels/). "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3RTN8p6VWQ8z"
+ },
+ "source": [
+ "### Dealing with Time\n",
+ "\n",
+ "As noted above, `triage` is designed for problems where the desire to generalize to future data and therefore is careful to respect the temporal nature of the problem. This is particularly salient in two places: defining the validation strategy for model evaluation and ensuring that features only make use of information available at the time of analysis/prediction.\n",
+ "\n",
+ "For validation, the idea is generally simple: models should be trained on historical data and validated on future data. As such, `triage` constructs validation splits that reflect this process by using a certain point in time as the cut-off between training and validation and then moving this cut-off back through the data to generate multiple splits. The implementation is a bit more complicated and relies on several parameters, the details of which we won't go deep into here, but you can find a much deeper discussion in the [longer \"dirty duck\" tutorial](https://dssg.github.io/triage/dirtyduck/triage_intro/) as well as in the [experiment config docs](https://dssg.github.io/triage/experiments/experiment-config/).\n",
+ "\n",
+ "![temporal figure](https://dssg.github.io/triage/experiments/temporal_config_graph.png)\n",
+ "\n",
+ "In short, these parameters are (illustrated across three training/validation splits in the figure above):\n",
+ "- feature start/end times: what range of history is feature information available for?\n",
+ "- label start/end times: what range of history is outcome (label) data available for?\n",
+ "- model update frequency: what is the interval between refreshes of the model?\n",
+ "- test durations: over what time period will the model be in use for making predictions?\n",
+ "- max training history: how much historical data should be used for model training (that is, for rows/examples)?\n",
+ "- training/test as_of_date frequencies: within a training or validation (test) set, how frequently should cohorts be sampled?\n",
+ "- training/test label timespans: over what time horizon are labels (outcomes) collected?\n",
+ "\n",
+ "As with the cohorts and labels, these parameters are specified to `triage` via its yaml configuration file. Here's what this will look like for our setting:\n",
+ "```\n",
+ "temporal_config:\n",
+ "\n",
+ " # first date our feature data is good\n",
+ " feature_start_time: '2000-01-01'\n",
+ " feature_end_time: '2013-06-01'\n",
+ "\n",
+ " # first date our label data is good\n",
+ " # donorschoose: as far back as we have good donation data\n",
+ " label_start_time: '2011-09-02'\n",
+ " label_end_time: '2013-06-01'\n",
+ "\n",
+ " model_update_frequency: '4month'\n",
+ "\n",
+ " # length of time defining a test set\n",
+ " test_durations: ['3month']\n",
+ " # defines how far back a training set reaches\n",
+ " max_training_histories: ['1y']\n",
+ "\n",
+ " # we sample every day, since new projects are posted\n",
+ " # every day\n",
+ " training_as_of_date_frequencies: ['1day']\n",
+ " test_as_of_date_frequencies: ['1day']\n",
+ " \n",
+ " # when posted project timeout\n",
+ " label_timespans: ['3month']\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kQtPjBJUaaVJ"
+ },
+ "source": [
+ "### Model Evaluation Metrics\n",
+ "\n",
+ "The temporal configuration described above will create several training and validation splits that can be used to estimate the generalization performance of your models and select a model specification to use going forward. In order to do so, of course, you need to choose an appropriate metric (or metrics) by which to evaluate your models. `triage` can use any of the metrics specified by `sklearn` and in general you'll want to focus on those that best reflect the goals, constraints, and deployment scenario of your project. For instance, in our example project, DonorsChoose can help only 10% of the projects posted to the site, so a metric like precision in the top 10% would reflect how efficiently these limited resources are being allocated to projects that would not be fully funded without additional support.\n",
+ "\n",
+ "Although we might want to focus on `precision@10%` as our primary metric, often it can be helpful to look at both precision and recall at a range of thresholds (both percentiles and absolute numbers) both for the purposes of debugging and understanding how sensitive your results are to the available resources, describing a \"menu\" of policy choices.\n",
+ "\n",
+ "The `scoring` section of the yaml configuration file allows you specify separate evaluation metrics for both the training and validation set results, indicating both the type of metric (e.g., `precision`, `recall`, etc) and, where needed, the thresholds at which to calculate them. Here's what that looks like for our example project:\n",
+ "\n",
+ "```\n",
+ "scoring:\n",
+ " testing_metric_groups:\n",
+ " -\n",
+ " metrics: [precision@, recall@]\n",
+ " thresholds:\n",
+ " percentiles: [1, 2, 3, 4, 5, 6, 7, 8, 9, \n",
+ " 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,\n",
+ " 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, \n",
+ " 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, \n",
+ " 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,\n",
+ " 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,\n",
+ " 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,\n",
+ " 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,\n",
+ " 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,\n",
+ " 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,\n",
+ " 100]\n",
+ " top_n: [25, 50, 100]\n",
+ "\n",
+ " training_metric_groups:\n",
+ " -\n",
+ " metrics: [precision@, recall@]\n",
+ " thresholds:\n",
+ " percentiles: [1, 2, 3, 4, 5, 6, 7, 8, 9, \n",
+ " 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,\n",
+ " 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, \n",
+ " 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, \n",
+ " 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,\n",
+ " 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,\n",
+ " 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,\n",
+ " 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,\n",
+ " 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,\n",
+ " 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,\n",
+ " 100]\n",
+ " top_n: [25, 50, 100]\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4cY_UC4Taey9"
+ },
+ "source": [
+ "### Defining Features\n",
+ "\n",
+ "Feature generation is typically the most important aspect of how well your machine learning models will work, so `triage` provides considerable flexibility for feature definition. However, this also means that this section of the configuration file can be particularly complicated and may require some experimentation to get familiar with. A few resources may be helpful for a deeper look at how features work in `triage`:\n",
+ "- [Feature Definition in the Quickstart Guide](https://dssg.github.io/triage/triage_project_workflow/#define-some-additional-features)\n",
+ "- [Feature Generation in the triage Documentation](https://dssg.github.io/triage/experiments/experiment-config/#feature-generation)\n",
+ "- [Features in the Example Configuration File](https://github.com/dssg/triage/blob/master/example/config/experiment.yaml#L102)\n",
+ "\n",
+ "Features in `triage` are defined in blocks, grouping together features drawn from the same data source and allowing several related features to be constructed in a very compact format. Each of these blocks is a list item under the `feature_aggregations` section of your yaml configuration file and contains the following information:\n",
+ "- A `prefix` that is used to identify the group of features.\n",
+ "- A `from_obj` that specifies the underlying information used to construct the features in this group. This can be either a table or a query in itself (in the later case, be sure to give it an alias) and must contain both an `entity_id` column as well as a date column indicating when the information was known, identified to the feature config as the `knowledge_date_column`.\n",
+ "- Information about how missing values should be imputed (see the documentation for details and available options here).\n",
+ "- Definitions of the feature quantities/columns themselves, specified either as `aggregates` or `categoricals`, including the `metrics` for aggregations over time (e.g., `sum`, `max`, `avg`, etc).\n",
+ "- Time ranges over which to calculate feature information, called `intervals` (e.g., last 6 months, last 5 years, etc.)\n",
+ "- A level of aggregation for feature information (`groups`) -- this will almost always be just `entity_id`.\n",
+ "\n",
+ "🚧 NOTE: All features in `triage` are temporal aggregates. Just as `triage` is designed to carefully account for time in temporal cross-validation, it also does so in feature construction focusing on what information was known at training or validation time. Even features you might generally consider \"static\" need to be associated with a knowledge date for these purposes as well as an aggregation metric. This is also true for categoricals, which are first one-hot encoded from each instance then aggregated over the given time interval with the specified metric. For instance, if a patient has had several hospital stays with different primary diagnosis codes at each stay, a categorical feature using a `sum` aggregation would yield a count of how many stays had a given diagnosis while a `max` aggregation would provide an indicator of whether a given diagnosis was ever present. \n",
+ "\n",
+ "For aggregations of numeric features, the resulting feature names will have the format: \n",
+ "`{prefix}_entity_id_{interval}_{quantity}_{metric}`\n",
+ "\n",
+ "For categoricals, the feature names will include each categorical value after one-hot encoding:\n",
+ "`{prefix}_entity_id_{interval}_{quantity}_{value}_{metric}`\n",
+ "\n",
+ "🚧 WARNING: Because `triage`'s features are stored in a `postgres` database, this naming convention can sometimes run afoul of the database's 63 character limit for column names, leading to truncation. When this happens, you might encounter errors indicating a given feature column appears to be missing. This can be common with categoricals with particularly long values, so recoding can be useful in those cases (as can choosing shorter prefix names).\n",
+ "\n",
+ "For illustrative purposes here, we'll start with a single feature group including one categorical and continuous aggregate feature: the primary resource type for the project and the amount being asked for. Because these are both specified once at project posting time, we simply aggegate them over all time (that is, using `all` for our `interval`). Here's how we specify this in our feature configuration:\n",
+ "\n",
+ "```\n",
+ "feature_aggregations:\n",
+ " -\n",
+ " prefix: 'project_features'\n",
+ " from_obj: 'data.projects'\n",
+ " knowledge_date_column: 'date_posted'\n",
+ "\n",
+ " aggregates_imputation:\n",
+ " all:\n",
+ " type: 'zero'\n",
+ "\n",
+ " categoricals_imputation:\n",
+ " all:\n",
+ " type: 'null_category' \n",
+ "\n",
+ " categoricals:\n",
+ " -\n",
+ " column: 'resource_type'\n",
+ " metrics:\n",
+ " - 'max' \n",
+ " choice_query: 'select distinct resource_type from data.projects'\n",
+ " \n",
+ " aggregates:\n",
+ " -\n",
+ " quantity: 'total_asking_price'\n",
+ " metrics:\n",
+ " - 'sum'\n",
+ " \n",
+ " # Since our time-aggregate features are precomputed, feature interval is \n",
+ " # irrelvant. We keep 'all' as a default.\n",
+ " intervals: ['all'] \n",
+ " groups: ['entity_id']\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0YNaIVySaj2Z"
+ },
+ "source": [
+ "### Model and Hyperparameter Grid\n",
+ "\n",
+ "You specify the types of models you want to explore, along with their hyperparameters, in the `grid_config` section of your yaml configuration file. Because there's generally no way to know a priori what model specification will work best for a given problem, `triage` makes it easy to run and explore an extensive grid by providing lists of values for each hyperparameter and training models for the full cross-product of these values.\n",
+ "\n",
+ "Currently, `triage` can work with any classifiction method with an `sklearn`-style interface. In addition to machine learning algorithms found in standard packages, `triage` includes a couple of built-in methods you might find useful:\n",
+ "- `ScaledLogisticRegression` wraps the `sklearn` logistic regression with a min-max scaler to ensure that the input features are on the same scale for regularization. It accepts the same hyperparameters as the underlying `sklearn` method.\n",
+ "- `BaselineRankMultiFeature` is a simple baseline method that ranks examples by one or more features, replicating a comonsense approach that could be taken without making use of machine learning. This method takes a single hyperparameter, `rules`, specified as a list of dictionaries with the keys `feature` and `low_value_high_score` to specify the directin of the ranking. Examples are sorted first by the first feature in this list, then the next, and so on.\n",
+ "- `SimpleThresholder` is another basic baseline method, allowing you to specify a heuristic, rule-based approach to classifying examples. It uses two hyperparameters: a list of `rules` (e.g., `feature_1 > 3`) and a `logical_operator` (e.g., `and` or `or`) to specify how the rules are combined.\n",
+ "\n",
+ "To specify a model type in your grid config, you use the model's class path as a key and each hyperparameter as a key another level down. For example:\n",
+ "```\n",
+ "'module.submodule.ClassName':\n",
+ " param_1: [1,3,5,10,20]\n",
+ " param_2: [100, 500, 1000]\n",
+ "```\n",
+ "\n",
+ "For our purposes here, we'll start with a very small grid that can run quickly in a colab notebook. Here's how that will look:\n",
+ "```\n",
+ "grid_config:\n",
+ " 'sklearn.ensemble.RandomForestClassifier':\n",
+ " n_estimators: [150]\n",
+ " max_depth: [50]\n",
+ " min_samples_split: [25]\n",
+ " \n",
+ " 'sklearn.tree.DecisionTreeClassifier':\n",
+ " max_depth: [3]\n",
+ " max_features: [null]\n",
+ " min_samples_split: [25]\n",
+ " \n",
+ " 'triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression':\n",
+ " C: [0.1]\n",
+ " penalty: ['l1']\n",
+ " \n",
+ " 'triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature':\n",
+ " rules:\n",
+ " - [{feature: 'project_features_entity_id_all_total_asking_price_sum', low_value_high_score: False}]\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-lOxjerFm79c"
+ },
+ "source": [
+ "### Auditing Models for Bias\n",
+ "\n",
+ "The final section of the configuration file specifies how you want to evaluate your models for bias and fairness using the [aequitas](http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/) toolkit. In order to do so, you need to tell `triage` what attributes are relavent for bias audits, a table or SQL query to specify these attributes (the `from_obj`), and the reference group to calculate disparities relative to (the value for this group will serve as the denominator for disparity calculations. Like the evaluation metrics described above, you'll also need to specify the set of thresholds against which you want to calculate fairness metrics. Note that `aequitas` will calculate the full range of confusion matrix-derived disparity metrics for all of your models, allowing you to explore how your models perform under different conceptualizations of fairness.\n",
+ "\n",
+ "To illustrate the use of a bias audit in our example project, we'll look at the `teacher_prefix` attribute as a proxy for the sex of the teacher, using `Mr.` as a reference group. Note that the `from_obj_table` will be joined using an `entity_id` and `as_of_date`, so you must specify a `knowledge_date_column` in the config, as some attributes (or your knowledge of them) might change over time. `aequitas` will use the most recent value of the attribute it finds for a given entity prior to the specified `as_of_date`. Here's how we turn that into a section in our configuration yaml:\n",
+ "\n",
+ "```\n",
+ "bias_audit_config:\n",
+ " from_obj_table: 'data.projects'\n",
+ " attribute_columns:\n",
+ " - 'teacher_prefix'\n",
+ " knowledge_date_column: 'date_posted'\n",
+ " entity_id_column: 'entity_id'\n",
+ " ref_groups_method: 'predefined'\n",
+ " ref_groups:\n",
+ " 'teacher_prefix': 'Mr.'\n",
+ " thresholds:\n",
+ " percentiles: [5, 10, 15, 20, 25, 50, 100]\n",
+ " top_n: [25, 50, 100]\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7ZSSdXx-NcTE"
+ },
+ "source": [
+ "## Running Triage\n",
+ "\n",
+ "Now that we've walked through the various aspects of configuring triage, we're ready to run our model grid! In order to do so, we need three pieces:\n",
+ "- Our configuration file, pulling together the elements described above into a single yaml file we'll call `experiment_config.yaml` (in `triage`, an \"experiment\" is a run with a set of parameters and model types).\n",
+ "- Credentials for connecting to your database, stored in a configuration file called `database.yaml` (alternatively, you can specify them through environment variables)\n",
+ "- Code to run your `triage` experiment. This can be done via either a command line tool or python interface, the latter of which provides more flexibility so we'll focus on that approach here with a short python script called `run.py`.\n",
+ "\n",
+ "The following three sections provides the contents of each of these three files for our DonorsChoose project. In a real project, of course, these would be stored as separate files on your system, but here we include them inline. The `run.py` sets up logging, connects to the database, loads your configuration file, and creates and runs a `MultiCoreExperiment` object from `triage`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XX-inX6o7QBE"
+ },
+ "source": [
+ "### experiment_config.yaml"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RdDjQCovS1GG"
+ },
+ "source": [
+ "config_yaml = \"\"\"\n",
+ "config_version: 'v7'\n",
+ "\n",
+ "model_comment: 'triage demo'\n",
+ "\n",
+ "random_seed: 1995\n",
+ "\n",
+ "temporal_config:\n",
+ "\n",
+ " # first date our feature data is good\n",
+ " feature_start_time: '2000-01-01'\n",
+ " feature_end_time: '2013-06-01'\n",
+ "\n",
+ " # first date our label data is good\n",
+ " # donorschoose: as far back as we have good donation data\n",
+ " label_start_time: '2011-09-02'\n",
+ " label_end_time: '2013-06-01'\n",
+ "\n",
+ " model_update_frequency: '4month'\n",
+ "\n",
+ " # length of time defining a test set\n",
+ " test_durations: ['3month']\n",
+ " # defines how far back a training set reaches\n",
+ " max_training_histories: ['1y']\n",
+ "\n",
+ " # we sample every day, since new projects are posted\n",
+ " # every day\n",
+ " training_as_of_date_frequencies: ['1day']\n",
+ " test_as_of_date_frequencies: ['1day']\n",
+ " \n",
+ " # when posted project timeout\n",
+ " label_timespans: ['3month']\n",
+ " \n",
+ "\n",
+ "cohort_config:\n",
+ " query: |\n",
+ " SELECT distinct(entity_id)\n",
+ " FROM data.projects\n",
+ " WHERE date_posted = '{as_of_date}'::date - interval '1day'\n",
+ "\n",
+ "label_config:\n",
+ " query: |\n",
+ " WITH cohort_query AS (\n",
+ " SELECT distinct(entity_id)\n",
+ " FROM data.projects\n",
+ " WHERE date_posted = '{as_of_date}'::date - interval '1day'\n",
+ " )\n",
+ " , cohort_donations AS (\n",
+ " SELECT \n",
+ " c.entity_id, \n",
+ " COALESCE(SUM(d.donation_to_project), 0) AS total_donation\n",
+ " FROM cohort_query c\n",
+ " LEFT JOIN data.donations d \n",
+ " ON c.entity_id = d.entity_id\n",
+ " AND d.donation_timestamp \n",
+ " BETWEEN '{as_of_date}'::date - interval '1day'\n",
+ " AND '{as_of_date}'::date + interval '{label_timespan}'\n",
+ " GROUP BY 1\n",
+ " )\n",
+ " SELECT c.entity_id,\n",
+ " CASE \n",
+ " WHEN COALESCE(d.total_donation, 0) >= p.total_asking_price THEN 0\n",
+ " ELSE 1\n",
+ " END AS outcome \n",
+ " FROM cohort_query c\n",
+ " JOIN data.projects p USING(entity_id)\n",
+ " LEFT JOIN cohort_donations d using(entity_id)\n",
+ "\n",
+ " name: 'fully_funded'\n",
+ "\n",
+ "\n",
+ "feature_aggregations:\n",
+ " -\n",
+ " prefix: 'project_features'\n",
+ " from_obj: 'data.projects'\n",
+ " knowledge_date_column: 'date_posted'\n",
+ "\n",
+ " aggregates_imputation:\n",
+ " all:\n",
+ " type: 'zero'\n",
+ "\n",
+ " categoricals_imputation:\n",
+ " all:\n",
+ " type: 'null_category' \n",
+ "\n",
+ " categoricals:\n",
+ " -\n",
+ " column: 'resource_type'\n",
+ " metrics:\n",
+ " - 'max' \n",
+ " choice_query: 'select distinct resource_type from data.projects'\n",
+ " \n",
+ " aggregates:\n",
+ " -\n",
+ " quantity: 'total_asking_price'\n",
+ " metrics:\n",
+ " - 'sum'\n",
+ " \n",
+ " # Since our time-aggregate features are precomputed, feature interval is \n",
+ " # irrelvant. We keep 'all' as a default.\n",
+ " intervals: ['all'] \n",
+ " groups: ['entity_id']\n",
+ "\n",
+ "grid_config:\n",
+ " 'sklearn.ensemble.RandomForestClassifier':\n",
+ " n_estimators: [150]\n",
+ " max_depth: [50]\n",
+ " min_samples_split: [25]\n",
+ " \n",
+ " 'sklearn.tree.DecisionTreeClassifier':\n",
+ " max_depth: [3]\n",
+ " max_features: [null]\n",
+ " min_samples_split: [25]\n",
+ " \n",
+ " 'triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression':\n",
+ " C: [0.1]\n",
+ " penalty: ['l1']\n",
+ " \n",
+ " 'triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature':\n",
+ " rules:\n",
+ " - [{feature: 'project_features_entity_id_all_total_asking_price_sum', low_value_high_score: False}]\n",
+ "\n",
+ "\n",
+ "scoring:\n",
+ " testing_metric_groups:\n",
+ " -\n",
+ " metrics: [precision@, recall@]\n",
+ " thresholds:\n",
+ " percentiles: [1, 2, 3, 4, 5, 6, 7, 8, 9, \n",
+ " 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,\n",
+ " 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, \n",
+ " 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, \n",
+ " 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,\n",
+ " 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,\n",
+ " 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,\n",
+ " 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,\n",
+ " 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,\n",
+ " 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,\n",
+ " 100]\n",
+ " top_n: [25, 50, 100]\n",
+ "\n",
+ " training_metric_groups:\n",
+ " -\n",
+ " metrics: [precision@, recall@]\n",
+ " thresholds:\n",
+ " percentiles: [1, 2, 3, 4, 5, 6, 7, 8, 9, \n",
+ " 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,\n",
+ " 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, \n",
+ " 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, \n",
+ " 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,\n",
+ " 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,\n",
+ " 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,\n",
+ " 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,\n",
+ " 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,\n",
+ " 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,\n",
+ " 100]\n",
+ " top_n: [25, 50, 100]\n",
+ " \n",
+ "bias_audit_config:\n",
+ " from_obj_table: 'data.projects'\n",
+ " attribute_columns:\n",
+ " - 'teacher_prefix'\n",
+ " knowledge_date_column: 'date_posted'\n",
+ " entity_id_column: 'entity_id'\n",
+ " ref_groups_method: 'predefined'\n",
+ " ref_groups:\n",
+ " 'teacher_prefix': 'Mr.'\n",
+ " thresholds:\n",
+ " percentiles: [5, 10, 15, 20, 25, 50, 100]\n",
+ " top_n: [25, 50, 100]\n",
+ "\n",
+ "individual_importance:\n",
+ " methods: [] # empty list means don't calculate individual importances\n",
+ " n_ranks: 1 \n",
+ "\"\"\""
+ ],
+ "execution_count": 9,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qcqeRvUT7V1D"
+ },
+ "source": [
+ "### database.yaml"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3i88ZFQupfp6"
+ },
+ "source": [
+ "database_yaml = \"\"\"\n",
+ "host: localhost\n",
+ "user: postgres\n",
+ "db: donors_choose\n",
+ "pass: postgres\n",
+ "port: 5432\n",
+ "role: postgres\n",
+ "\"\"\""
+ ],
+ "execution_count": 10,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wHR9wnAw7d__"
+ },
+ "source": [
+ "### run.py"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "jYzBKFG3qDhQ",
+ "outputId": "31af303e-74df-4ce9-8332-09a96593ba07"
+ },
+ "source": [
+ "import yaml\n",
+ "\n",
+ "from sqlalchemy.engine.url import URL\n",
+ "from triage.util.db import create_engine\n",
+ "from triage.experiments import MultiCoreExperiment\n",
+ "import logging\n",
+ "\n",
+ "import os\n",
+ "\n",
+ "from sqlalchemy.event import listens_for\n",
+ "from sqlalchemy.pool import Pool\n",
+ "\n",
+ "def run_triage():\n",
+ "\n",
+ " # andrew_id = os.getenv('USER')\n",
+ " # user_path = os.path.join('/data/users/', andrew_id)\n",
+ " user_path = '/content'\n",
+ "\n",
+ " # add logging to a file (it will also go to stdout via triage logging config)\n",
+ " log_filename = os.path.join(user_path, 'triage.log')\n",
+ " logger = logging.getLogger('')\n",
+ " hdlr = logging.FileHandler(log_filename)\n",
+ " hdlr.setLevel(15) # verbose level\n",
+ " hdlr.setFormatter(logging.Formatter('%(name)-30s %(asctime)s %(levelname)10s %(process)6d %(filename)-24s %(lineno)4d: %(message)s', '%d/%m/%Y %I:%M:%S %p'))\n",
+ " logger.addHandler(hdlr)\n",
+ "\n",
+ " # creating database engine\n",
+ " # dbfile = os.path.join(user_path, 'database.yaml')\n",
+ "\n",
+ " # with open(dbfile, 'r') as dbf:\n",
+ " # dbconfig = yaml.safe_load(dbf)\n",
+ "\n",
+ " dbconfig = yaml.safe_load(database_yaml)\n",
+ " print(dbconfig['role'])\n",
+ "\n",
+ " # assume group role to ensure shared permissions\n",
+ " @listens_for(Pool, \"connect\")\n",
+ " def assume_role(dbapi_con, connection_record):\n",
+ " logging.debug(f\"setting role {dbconfig['role']};\")\n",
+ " dbapi_con.cursor().execute(f\"set role {dbconfig['role']};\")\n",
+ " # logging.debug(f\"setting role postres;\")\n",
+ " # dbapi_con.cursor().execute(f\"set role postgres;\")\n",
+ "\n",
+ " db_url = URL(\n",
+ " 'postgres',\n",
+ " host=dbconfig['host'],\n",
+ " username=dbconfig['user'],\n",
+ " database=dbconfig['db'],\n",
+ " password=dbconfig['pass'],\n",
+ " port=dbconfig['port'],\n",
+ " )\n",
+ "\n",
+ " db_engine = create_engine(db_url)\n",
+ "\n",
+ " triage_output_path = os.path.join(user_path, 'triage_output')\n",
+ " os.makedirs(triage_output_path, exist_ok=True)\n",
+ "\n",
+ " # loading config file\n",
+ " # with open('%s_triage_config.yaml' % andrew_id, 'r') as fin:\n",
+ " # config = yaml.safe_load(fin)\n",
+ "\n",
+ " config = yaml.safe_load(config_yaml)\n",
+ "\n",
+ " # creating experiment object\n",
+ " experiment = MultiCoreExperiment(\n",
+ " config = config,\n",
+ " db_engine = db_engine,\n",
+ " project_path = triage_output_path,\n",
+ " n_processes=2,\n",
+ " n_bigtrain_processes=1,\n",
+ " n_db_processes=2,\n",
+ " replace=True,\n",
+ " save_predictions=True\n",
+ " )\n",
+ "\n",
+ " # experiment.validate()\n",
+ " experiment.run()"
+ ],
+ "execution_count": 11,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.\n",
+ " import pandas.util.testing as tm\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FyqBcKHk7lTC"
+ },
+ "source": [
+ "### Let's run triage!\n",
+ "\n",
+ "With these three files in place, we can simply run our model grid by calling `run_triage()`. Doing so will train and validate the four model specifications described above across three temporal validation splits. The run will output a log of its progress and store results into the postgres database."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "ZUAcMwv2qzLe",
+ "outputId": "957d5b08-2230-4b7d-a7f1-4b7399d5d1c5"
+ },
+ "source": [
+ "run_triage()"
+ ],
+ "execution_count": 12,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "postgres\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mMatrices and trained models will be saved in /content/triage_output\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mReplace flag is set to true. Matrices, models, evaluations and predictions (if they exist) will be replaced\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m No results_schema_versions table exists, which means that this installation is fresh. Upgrading db.\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Context impl PostgresqlImpl.\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Will assume transactional DDL.\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade -> 8b3f167d0418, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 8b3f167d0418 -> 0d44655e35fd, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 0d44655e35fd -> 264245ddfce2, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 264245ddfce2 -> 72ac5cbdca05, Change importance to float\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 72ac5cbdca05 -> 7d57d1cf3429, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 7d57d1cf3429 -> 89a8ce240bae, Split results into model_metadata, test_results, and train_resultss\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 89a8ce240bae -> 2446a931de7a, Changing column names and removing redundancies in table names\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 2446a931de7a -> d0ac573eaf1a, model_group_stored_procedure\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade d0ac573eaf1a -> 38f37d013686, Associate experiments with models and matrices\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 38f37d013686 -> 0bca1ba9706e, add_matrix_uuid_to_eval\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 0bca1ba9706e -> 50e1f1bc2cac, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 50e1f1bc2cac -> cfd5c3386014, add_experiment_runs\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade cfd5c3386014 -> 97cf99b7348f, evaluation_randomness\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 97cf99b7348f -> 609c7cc51794, rankify_predictions\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 609c7cc51794 -> b4d7569d31cb, aequitas\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade b4d7569d31cb -> 8cef808549dd, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 8cef808549dd -> a20104116533, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade a20104116533 -> fa1760d35710, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade fa1760d35710 -> 9bbfdcf8bab0, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 9bbfdcf8bab0 -> 4ae804cc0977, empty message\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 4ae804cc0977 -> a98acf92fd48, add nuke triage function\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade a98acf92fd48 -> 45219f25072b, hash-partitioning predictions tables\n",
+ "\u001b[32m2022-06-21 18:43:50\u001b[0m - \u001b[1;30m INFO\u001b[0m PostgreSQL 11 or greater found (PostgreSQL 11): Using hash partitioning\n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 45219f25072b -> 1b990cbc04e4, empty message\n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 1b990cbc04e4 -> 264786a9fe85, add label_value to prodcution table\n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 264786a9fe85 -> ce5b50ffa8e2, Break ties in list predictions\n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade ce5b50ffa8e2 -> 670289044eb2, Add production prediction metadata\n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 670289044eb2 -> cdd0dc9d9870, rename production schema and list_predcitons to triage_predcition and predictions \n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 45219f25072b -> b097e47ba829, empty message\n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade b097e47ba829, cdd0dc9d9870 -> 079a74c15e8b, merge b097e47ba829 with cdd0dc9d9870\n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30m INFO\u001b[0m Running upgrade 079a74c15e8b -> 5dd2ba8222b1, add run_type\n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mUsing random seed [1995] for running the experiment\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:43:51\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mscoring.subsets missing in the configuration file or unrecognized. No subsets will be generated\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:43:52\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mExperiment validation ran to completion with no errors\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:43:52\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mComputed and stored temporal split definitions\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:43:52\u001b[0m - \u001b[1;30m INFO\u001b[0m Setting up cohort\n",
+ "\u001b[32m2022-06-21 18:43:57\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mCohort set up in the table cohort_all_entities_005a7918d4c2be39b7e923a84f33ded2 successfully\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:43:57\u001b[0m - \u001b[1;30m INFO\u001b[0m Setting up labels\n",
+ "\u001b[32m2022-06-21 18:44:10\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mLabels set up in the table labels_fully_funded_1feff6f9a63afed112773010f3dc4254 successfully \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:10\u001b[0m - \u001b[1;30m INFO\u001b[0m Creating features tables (before imputation) \n",
+ "\u001b[32m2022-06-21 18:44:10\u001b[0m - \u001b[1;30m INFO\u001b[0m Creating collate aggregations\n",
+ "\u001b[32m2022-06-21 18:44:10\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mStarting Feature aggregation\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:10\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mImputed feature table project_features_aggregation_imputed did not exist, need to build features\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:10\u001b[0m - \u001b[1;30m INFO\u001b[0m Processing query tasks with 2 processes\n",
+ "\u001b[32m2022-06-21 18:44:10\u001b[0m - \u001b[1;30m INFO\u001b[0m Processing features for project_features_entity_id\n",
+ "\u001b[32m2022-06-21 18:44:10\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:10\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:12\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:12\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:12\u001b[0m - \u001b[1;30m INFO\u001b[0m Beginning insert batch\n",
+ "\u001b[32m2022-06-21 18:44:12\u001b[0m - \u001b[1;30m INFO\u001b[0m Done. successes: 21, failures: 0\n",
+ "\u001b[32m2022-06-21 18:44:12\u001b[0m - \u001b[1;30m INFO\u001b[0m project_features_entity_id completed\n",
+ "\u001b[32m2022-06-21 18:44:12\u001b[0m - \u001b[1;30m INFO\u001b[0m Processing features for project_features_aggregation\n",
+ "\u001b[32m2022-06-21 18:44:24\u001b[0m - \u001b[1;30m INFO\u001b[0m Done. successes: 0, failures: 0\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m project_features_aggregation completed\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mFeatures (before imputation) were stored in the tables \"features\".\"project_features_aggregation\" successfully\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Imputing missing values in features\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mStarting Feature imputation\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Processing query tasks with 2 processes\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Processing features for project_features_aggregation_imputed\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Done. successes: 0, failures: 0\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m project_features_aggregation_imputed completed\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mImputed features were stored in the tables \"features\".\"project_features_aggregation_imputed\" successfully\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mFound 1 total feature subsets\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Building matrices\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mIt is necessary to build 6 matrices\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Starting parallel matrix building: 6 matrices, 2 processes\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Matrix 6a751eabcf4722abda70f77c0d9d712d saved in /content/triage_output/matrices/6a751eabcf4722abda70f77c0d9d712d.csv.gz\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Matrix 495afa5517735f1e336108cd7911b8aa saved in /content/triage_output/matrices/495afa5517735f1e336108cd7911b8aa.csv.gz\n",
+ "\u001b[32m2022-06-21 18:44:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Matrix 051df0ba6431460b81bd18a25fea0d99 saved in /content/triage_output/matrices/051df0ba6431460b81bd18a25fea0d99.csv.gz\n",
+ "\u001b[32m2022-06-21 18:44:30\u001b[0m - \u001b[1;30m INFO\u001b[0m Matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145 saved in /content/triage_output/matrices/67a0cc5dc9ab89cdf0b1ae3a7883b145.csv.gz\n",
+ "\u001b[32m2022-06-21 18:44:30\u001b[0m - \u001b[1;30m INFO\u001b[0m Matrix 10b581471b80ac2c5ca865e56be6cfe7 saved in /content/triage_output/matrices/10b581471b80ac2c5ca865e56be6cfe7.csv.gz\n",
+ "\u001b[32m2022-06-21 18:44:30\u001b[0m - \u001b[1;30m INFO\u001b[0m Matrix 363cae6e28d220afc10d2be99b01a09a saved in /content/triage_output/matrices/363cae6e28d220afc10d2be99b01a09a.csv.gz\n",
+ "\u001b[32m2022-06-21 18:44:30\u001b[0m - \u001b[1;30m INFO\u001b[0m Done. successes: 6, failures: 0\n",
+ "\u001b[32m2022-06-21 18:44:30\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mMatrices were stored in /content/triage_output/matrices successfully\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:30\u001b[0m - \u001b[1;30m INFO\u001b[0m Starting parallel subset creation: 0 subsets, 2 processes\n",
+ "\u001b[32m2022-06-21 18:44:30\u001b[0m - \u001b[1;30m INFO\u001b[0m Done. successes: 0, failures: 0\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mProtected groups stored in the table protected_groups_cf250342c293e834c6fa34f241461aef successfully\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mSplit train/test tasks into three task batches. - each batch has models from all splits\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mBatch 1: Baselines or simple classifiers (e.g. DecisionTree, SLR) (9 tasks total)\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mBatch 2: Heavyweight classifiers. (3 tasks total)\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mBatch 3: All classifiers not found in one of the other batches. (0 tasks total)\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30m INFO\u001b[0m 4 models groups will be trained, tested and evaluated\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30m INFO\u001b[0m Training, testing and evaluating models\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34m3 train/test tasks found.\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30m INFO\u001b[0m Starting parallelizable batch train/testing with 9 tasks, 2 processes\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining sklearn.tree.DecisionTreeClassifier({'max_depth': 3, 'max_features': None, 'min_samples_split': 25}) [de38e91e633948e5a5b4198f520ed837] on train matrix 495afa5517735f1e336108cd7911b8aa\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:32\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression({'C': 0.1, 'penalty': 'l1'}) [e1e70aac9d362ffc6416b38afab7dbe2] on train matrix 495afa5517735f1e336108cd7911b8aa\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:33\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mYou got feature values that are out of the range: (0, 1). The feature values will cutoff to fit in the range (0, 1).\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:33\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 1, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:33\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 2, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:33\u001b[0m - \u001b[1;30mWARNING\u001b[0m \u001b[33mThe selected algorithm, doesn't support a standard way of calculate the importance of each feature used. Falling back to ad-hoc methods (e.g. in LogisticRegression we will return Odd Ratios instead coefficients)\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:33\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 1: sklearn.tree.DecisionTreeClassifier({'max_depth': 3, 'max_features': None, 'min_samples_split': 25}) [de38e91e633948e5a5b4198f520ed837] on train matrix 495afa5517735f1e336108cd7911b8aa. \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:33\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 2: triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression({'C': 0.1, 'penalty': 'l1'}) [e1e70aac9d362ffc6416b38afab7dbe2] on train matrix 495afa5517735f1e336108cd7911b8aa. \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:33\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mYou got feature values that are out of the range: (0, 1). The feature values will cutoff to fit in the range (0, 1).\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:40\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:44:40\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "\u001b[32m2022-06-21 18:44:41\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:44:41\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:41\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:41\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:42\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:42\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:42\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:42\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 1 evaluation on test matrix 6a751eabcf4722abda70f77c0d9d712d completed.\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:43\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:43\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 2 evaluation on test matrix 6a751eabcf4722abda70f77c0d9d712d completed.\n",
+ "\u001b[32m2022-06-21 18:44:43\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mYou got feature values that are out of the range: (0, 1). The feature values will cutoff to fit in the range (0, 1).\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:52\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "\u001b[32m2022-06-21 18:44:53\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:53\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:53\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:53\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:54\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:54\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:54\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 1 evaluation on train matrix 495afa5517735f1e336108cd7911b8aa completed.\n",
+ "\u001b[32m2022-06-21 18:44:54\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature({'rules': [{'feature': 'project_features_entity_id_all_total_asking_price_sum', 'low_value_high_score': False}]}) [942fda8a13abbf322a3a78c8cdb7ba1a] on train matrix 495afa5517735f1e336108cd7911b8aa\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:54\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 3, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:54\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 3: triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature({'rules': [{'feature': 'project_features_entity_id_all_total_asking_price_sum', 'low_value_high_score': False}]}) [942fda8a13abbf322a3a78c8cdb7ba1a] on train matrix 495afa5517735f1e336108cd7911b8aa. \u001b[0m\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:54\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:55\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 2 evaluation on train matrix 495afa5517735f1e336108cd7911b8aa completed.\n",
+ "\u001b[32m2022-06-21 18:44:55\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining sklearn.tree.DecisionTreeClassifier({'max_depth': 3, 'max_features': None, 'min_samples_split': 25}) [468fc2d1aca51d1150deeb77b030c679] on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:55\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 4, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:55\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 4: sklearn.tree.DecisionTreeClassifier({'max_depth': 3, 'max_features': None, 'min_samples_split': 25}) [468fc2d1aca51d1150deeb77b030c679] on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145. \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:44:55\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:44:55\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:56\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:56\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:44:57\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:44:57\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 3 evaluation on test matrix 6a751eabcf4722abda70f77c0d9d712d completed.\n",
+ "\u001b[32m2022-06-21 18:44:59\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:00\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:00\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:01\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:01\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 3 evaluation on train matrix 495afa5517735f1e336108cd7911b8aa completed.\n",
+ "\u001b[32m2022-06-21 18:45:01\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression({'C': 0.1, 'penalty': 'l1'}) [c2f98311493fd3b59c40358764835f19] on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:01\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 5, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:01\u001b[0m - \u001b[1;30mWARNING\u001b[0m \u001b[33mThe selected algorithm, doesn't support a standard way of calculate the importance of each feature used. Falling back to ad-hoc methods (e.g. in LogisticRegression we will return Odd Ratios instead coefficients)\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:01\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 5: triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression({'C': 0.1, 'penalty': 'l1'}) [c2f98311493fd3b59c40358764835f19] on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145. \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:01\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mYou got feature values that are out of the range: (0, 1). The feature values will cutoff to fit in the range (0, 1).\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:05\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:45:05\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:06\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:06\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:07\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:07\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 4 evaluation on test matrix 051df0ba6431460b81bd18a25fea0d99 completed.\n",
+ "\u001b[32m2022-06-21 18:45:12\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:45:12\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:13\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:13\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:14\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:14\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 5 evaluation on test matrix 051df0ba6431460b81bd18a25fea0d99 completed.\n",
+ "\u001b[32m2022-06-21 18:45:20\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:21\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:21\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:22\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:22\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 4 evaluation on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145 completed.\n",
+ "\u001b[32m2022-06-21 18:45:22\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature({'rules': [{'feature': 'project_features_entity_id_all_total_asking_price_sum', 'low_value_high_score': False}]}) [4c850b235b18c32436c2c9158d47316e] on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:22\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 6, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:22\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 6: triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature({'rules': [{'feature': 'project_features_entity_id_all_total_asking_price_sum', 'low_value_high_score': False}]}) [4c850b235b18c32436c2c9158d47316e] on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145. \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:24\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:45:24\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:24\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:25\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:25\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:26\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 6 evaluation on test matrix 051df0ba6431460b81bd18a25fea0d99 completed.\n",
+ "\u001b[32m2022-06-21 18:45:27\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:27\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:28\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "\u001b[32m2022-06-21 18:45:28\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:28\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:28\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:29\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 5 evaluation on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145 completed.\n",
+ "\u001b[32m2022-06-21 18:45:29\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "\u001b[32m2022-06-21 18:45:29\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining sklearn.tree.DecisionTreeClassifier({'max_depth': 3, 'max_features': None, 'min_samples_split': 25}) [6e904c2e00f3ae0b31de9b98ca0c1584] on train matrix 363cae6e28d220afc10d2be99b01a09a\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:29\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 7, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:29\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 7: sklearn.tree.DecisionTreeClassifier({'max_depth': 3, 'max_features': None, 'min_samples_split': 25}) [6e904c2e00f3ae0b31de9b98ca0c1584] on train matrix 363cae6e28d220afc10d2be99b01a09a. \u001b[0m\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:30\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:30\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 6 evaluation on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145 completed.\n",
+ "\u001b[32m2022-06-21 18:45:30\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression({'C': 0.1, 'penalty': 'l1'}) [ee410a6a06ba9e5b6eea7f1794ed1b39] on train matrix 363cae6e28d220afc10d2be99b01a09a\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:30\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 8, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:30\u001b[0m - \u001b[1;30mWARNING\u001b[0m \u001b[33mThe selected algorithm, doesn't support a standard way of calculate the importance of each feature used. Falling back to ad-hoc methods (e.g. in LogisticRegression we will return Odd Ratios instead coefficients)\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:30\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 8: triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression({'C': 0.1, 'penalty': 'l1'}) [ee410a6a06ba9e5b6eea7f1794ed1b39] on train matrix 363cae6e28d220afc10d2be99b01a09a. \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:37\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:45:37\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:38\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:38\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "\u001b[32m2022-06-21 18:45:39\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:45:39\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:39\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:39\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 7 evaluation on test matrix 10b581471b80ac2c5ca865e56be6cfe7 completed.\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:40\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:40\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:41\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:41\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 8 evaluation on test matrix 10b581471b80ac2c5ca865e56be6cfe7 completed.\n",
+ "\u001b[32m2022-06-21 18:45:54\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:54\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:55\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:55\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:56\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 7 evaluation on train matrix 363cae6e28d220afc10d2be99b01a09a completed.\n",
+ "\u001b[32m2022-06-21 18:45:56\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "\u001b[32m2022-06-21 18:45:56\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature({'rules': [{'feature': 'project_features_entity_id_all_total_asking_price_sum', 'low_value_high_score': False}]}) [cec877d278f1ea40c1677dbf2354a9b8] on train matrix 363cae6e28d220afc10d2be99b01a09a\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:56\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 9, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:45:56\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 9: triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature({'rules': [{'feature': 'project_features_entity_id_all_total_asking_price_sum', 'low_value_high_score': False}]}) [cec877d278f1ea40c1677dbf2354a9b8] on train matrix 363cae6e28d220afc10d2be99b01a09a. \u001b[0m\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:57\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:57\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "\u001b[32m2022-06-21 18:45:57\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:45:57\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:58\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:58\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 8 evaluation on train matrix 363cae6e28d220afc10d2be99b01a09a completed.\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:58\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:58\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:45:59\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:45:59\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 9 evaluation on test matrix 10b581471b80ac2c5ca865e56be6cfe7 completed.\n",
+ "\u001b[32m2022-06-21 18:46:00\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:01\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:01\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:02\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:02\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 9 evaluation on train matrix 363cae6e28d220afc10d2be99b01a09a completed.\n",
+ "\u001b[32m2022-06-21 18:46:02\u001b[0m - \u001b[1;30m INFO\u001b[0m Done. successes: 9, failures: 0\n",
+ "\u001b[32m2022-06-21 18:46:02\u001b[0m - \u001b[1;30m INFO\u001b[0m Starting parallelizable batch train/testing with 3 tasks, 1 processes\n",
+ "\u001b[32m2022-06-21 18:46:02\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining sklearn.ensemble.RandomForestClassifier({'max_depth': 50, 'min_samples_split': 25, 'n_estimators': 150}) [b86731b91ced3187be3a876b23303cb8] on train matrix 495afa5517735f1e336108cd7911b8aa\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:02\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 10, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:02\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 10: sklearn.ensemble.RandomForestClassifier({'max_depth': 50, 'min_samples_split': 25, 'n_estimators': 150}) [b86731b91ced3187be3a876b23303cb8] on train matrix 495afa5517735f1e336108cd7911b8aa. \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:04\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:46:04\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:04\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:04\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:05\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:05\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 10 evaluation on test matrix 6a751eabcf4722abda70f77c0d9d712d completed.\n",
+ "\u001b[32m2022-06-21 18:46:06\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:06\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:07\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:07\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:07\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 10 evaluation on train matrix 495afa5517735f1e336108cd7911b8aa completed.\n",
+ "\u001b[32m2022-06-21 18:46:07\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining sklearn.ensemble.RandomForestClassifier({'max_depth': 50, 'min_samples_split': 25, 'n_estimators': 150}) [9c5d0d402e699c5622ae2af9660248de] on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:08\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 11, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:08\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 11: sklearn.ensemble.RandomForestClassifier({'max_depth': 50, 'min_samples_split': 25, 'n_estimators': 150}) [9c5d0d402e699c5622ae2af9660248de] on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145. \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:09\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:46:09\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:10\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:10\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:11\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:11\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 11 evaluation on test matrix 051df0ba6431460b81bd18a25fea0d99 completed.\n",
+ "\u001b[32m2022-06-21 18:46:12\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:12\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:13\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:13\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:13\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 11 evaluation on train matrix 67a0cc5dc9ab89cdf0b1ae3a7883b145 completed.\n",
+ "\u001b[32m2022-06-21 18:46:13\u001b[0m - \u001b[1;30mVERBOSE\u001b[0m \u001b[34mTraining sklearn.ensemble.RandomForestClassifier({'max_depth': 50, 'min_samples_split': 25, 'n_estimators': 150}) [e7e6927dd1de32f2249b42bb7acf8b51] on train matrix 363cae6e28d220afc10d2be99b01a09a\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:14\u001b[0m - \u001b[1;30m NOTICE\u001b[0m \u001b[35mModel 12, not found from previous runs. Adding the new model\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:14\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTrained model id 12: sklearn.ensemble.RandomForestClassifier({'max_depth': 50, 'min_samples_split': 25, 'n_estimators': 150}) [e7e6927dd1de32f2249b42bb7acf8b51] on train matrix 363cae6e28d220afc10d2be99b01a09a. \u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:16\u001b[0m - \u001b[1;30m INFO\u001b[0m NumExpr defaulting to 2 threads.\n",
+ "\u001b[32m2022-06-21 18:46:16\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:16\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:16\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:17\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:17\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 12 evaluation on test matrix 10b581471b80ac2c5ca865e56be6cfe7 completed.\n",
+ "\u001b[32m2022-06-21 18:46:18\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:19\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:19\u001b[0m - \u001b[1;30m INFO\u001b[0m getcrosstabs: attribute columns to perform crosstabs:teacher_prefix\n",
+ "get_disparity_predefined_group()\n",
+ "\u001b[32m2022-06-21 18:46:20\u001b[0m - \u001b[1;30m INFO\u001b[0m get_group_value_fairness...\n",
+ "\u001b[32m2022-06-21 18:46:20\u001b[0m - \u001b[1;30m INFO\u001b[0m Model 12 evaluation on train matrix 363cae6e28d220afc10d2be99b01a09a completed.\n",
+ "\u001b[32m2022-06-21 18:46:20\u001b[0m - \u001b[1;30m INFO\u001b[0m Done. successes: 3, failures: 0\n",
+ "\u001b[32m2022-06-21 18:46:20\u001b[0m - \u001b[1;30m INFO\u001b[0m Starting parallelizable batch train/testing with 0 tasks, 2 processes\n",
+ "\u001b[32m2022-06-21 18:46:20\u001b[0m - \u001b[1;30m INFO\u001b[0m Done. successes: 0, failures: 0\n",
+ "\u001b[32m2022-06-21 18:46:20\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mTraining, testing and evaluatiog models completed\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:20\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mAll matrices that were supposed to be build were built. Awesome!\u001b[0m\n",
+ "\u001b[32m2022-06-21 18:46:20\u001b[0m - \u001b[1;30mSUCCESS\u001b[0m \u001b[1;32mAll models that were supposed to be trained were trained. Awesome!\u001b[0m\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sPXmtJ667osT"
+ },
+ "source": [
+ "## Checking the results\n",
+ "\n",
+ "Running `triage` will generate two types of outputs: objects stored to disk and results stored in the database.\n",
+ "\n",
+ "Two types of objects will be stored to disk in the `project_path` specified in creating the experiment object:\n",
+ "- The matrices used for model training and validation, stored as CSV files and associated metadata in yaml format.\n",
+ "- The trained model objects themselves, stored as `joblib` pickles, which can be loaded and applied to new data.\n",
+ "\n",
+ "In the database, `triage` will store results and metadata in several tables. Below is a very brief tour of the most important of these tables.\n",
+ "\n",
+ "In the **triage_metadata** schema, you'll find information about your run and the models that were created:\n",
+ "- `triage_metadata.triage_runs`: metadata about every time `triage` is run, identified by a `run_id`\n",
+ "- `triage_metadata.experiments`: configuration information for an experiment, identified by an `experiment_hash`. Note that a config file can be run multiple times, so a specific experiment might be associated with multiple `triage_runs` records. The `experiment_hash` can be linked to the `run_hash` in the `triage_runs` table where `run_type='experiment'`\n",
+ "- `triage_metadata.model_groups`: in `triage` a `model_group` represents a full specification of a model type, set of hyperparameters, set of features, and training set parameters\n",
+ "- `triage_metadata.models`: a `model` represents the application of a `model_group` to a given training set, yielding a set of trained parameters (such as the coefficients of a logistic regression, the splits of a decision tree, etc). The models are identified by both a `model_id` and `model_hash` and can be linked to their `model_group` via the `model_group_id`\n",
+ "- `triage_metadata.experiment_models`: the association between models and experiments (linking an `experiment_hash` to a `model_hash`)\n",
+ "\n",
+ "In the **test_results** schema, you'll find information about the validation performance of the models:\n",
+ "- `test_results.evaluations`: performance of each model on the metrics specified in the `scoring` section of your configuration file\n",
+ "- `test_results.predictions`: individual entity-level predicted scores from each model\n",
+ "- `test_results.prediction_metadata`: metadata associated with the predictions\n",
+ "- `test_results.aequitas`: performance of each model on the fairness metrics using the parameters specified in your `bias_audit_config`\n",
+ "\n",
+ "In the **train_results** schema, you'll find model performance on the training set, as well as feature importances:\n",
+ "- `train_results.evaluations`: similar to `test_results.evaluations` but for the training set (often may be overfit, but can be useful for debugging)\n",
+ "- `train_results.predictions`: similar to `test_results.predictions` but for the training set\n",
+ "- `train_results.prediction_metadata`: metadata associated with the predictions\n",
+ "- `train_results.feature_importances`: overall feature importances from model training, usining the built-in method for the classifier (if one exists)\n",
+ "\n",
+ "Finally, a few intermediate tables can be particularly useful for debugging:\n",
+ "- Tables containing your `cohort` and `label` will be generated in the `public` schema and identified by an associated hash that can be found in your logs.\n",
+ "- The `features` schema contains two types of useful tables: tables containing calculated features for each feature group and \"matrix\" tables that provide the mapping from each training/validation matrix to `(entity_id, as_of_date)` pairs. Note, however, that these tables may be overwritten if a new run is performed with different feature logic, cohort, or underlying data and should not be assumed to be persistant across runs.\n",
+ "\n",
+ "Let's take a quick look at some of these outputs to confirm that our models ran as expected. First, we'll need a database connection..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CpfPG-nyq1Nk"
+ },
+ "source": [
+ "import yaml\n",
+ "from sqlalchemy.engine.url import URL\n",
+ "from triage.util.db import create_engine\n",
+ "import pandas as pd\n",
+ "\n",
+ "dbconfig = yaml.safe_load(database_yaml)\n",
+ "db_url = URL(\n",
+ " 'postgres',\n",
+ " host=dbconfig['host'],\n",
+ " username=dbconfig['user'],\n",
+ " database=dbconfig['db'],\n",
+ " password=dbconfig['pass'],\n",
+ " port=dbconfig['port'],\n",
+ " )\n",
+ "\n",
+ "db_engine = create_engine(db_url)"
+ ],
+ "execution_count": 13,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ajIxNg-S57e6"
+ },
+ "source": [
+ "`triage_metadata.model_groups` should contain four records (for each model type/hyperparameter combination specified in our grid), while `triage_metadata.models` should have twelve (each model group trained on the three validation splits):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 358
+ },
+ "id": "8Xjhe5S86V2H",
+ "outputId": "b33a0fb8-130d-4758-e43d-b6ac7431f592"
+ },
+ "source": [
+ "pd.read_sql('SELECT * FROM triage_metadata.model_groups;', db_engine)"
+ ],
+ "execution_count": 14,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ " model_group_id model_type \\\n",
+ "0 1 sklearn.ensemble.RandomForestClassifier \n",
+ "1 2 sklearn.tree.DecisionTreeClassifier \n",
+ "2 3 triage.component.catwalk.estimators.classifier... \n",
+ "3 4 triage.component.catwalk.baselines.rankers.Bas... \n",
+ "\n",
+ " hyperparameters \\\n",
+ "0 {'max_depth': 50, 'n_estimators': 150, 'min_sa... \n",
+ "1 {'max_depth': 3, 'max_features': None, 'min_sa... \n",
+ "2 {'C': 0.1, 'penalty': 'l1'} \n",
+ "3 {'rules': [{'feature': 'project_features_entit... \n",
+ "\n",
+ " feature_list \\\n",
+ "0 [project_features_entity_id_all_resource_type_... \n",
+ "1 [project_features_entity_id_all_resource_type_... \n",
+ "2 [project_features_entity_id_all_resource_type_... \n",
+ "3 [project_features_entity_id_all_resource_type_... \n",
+ "\n",
+ " model_config \n",
+ "0 {'state': 'active', 'label_name': 'fully_funde... \n",
+ "1 {'state': 'active', 'label_name': 'fully_funde... \n",
+ "2 {'state': 'active', 'label_name': 'fully_funde... \n",
+ "3 {'state': 'active', 'label_name': 'fully_funde... "
+ ],
+ "text/html": [
+ "\n",
+ "
\n",
+ " "
+ ]
+ },
+ "metadata": {},
+ "execution_count": 18
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "z_vwKbQo727Q"
+ },
+ "source": [
+ "Finally, if you need to work with the training/validation matrices generated by triage or the model objects themselves, you can find them in your project path (here, `triage_output`). Let's take a quick look..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "Tyx1QXJ3sAtQ",
+ "outputId": "17ef6319-142f-40df-ff76-e04497b8286f"
+ },
+ "source": [
+ "!ls triage_output/"
+ ],
+ "execution_count": 19,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "matrices trained_models\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YnM6vADdvx-N",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "aeeebacd-3c3d-4abd-f54a-6919f8a0dba9"
+ },
+ "source": [
+ "!ls -la triage_output/matrices/\n"
+ ],
+ "execution_count": 20,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "total 128\n",
+ "drwxr-xr-x 2 root root 4096 Jun 21 18:44 .\n",
+ "drwxr-xr-x 4 root root 4096 Jun 21 18:44 ..\n",
+ "-rw-r--r-- 1 root root 10686 Jun 21 18:44 051df0ba6431460b81bd18a25fea0d99.csv.gz\n",
+ "-rw-r--r-- 1 root root 3391 Jun 21 18:44 051df0ba6431460b81bd18a25fea0d99.yaml\n",
+ "-rw-r--r-- 1 root root 6201 Jun 21 18:44 10b581471b80ac2c5ca865e56be6cfe7.csv.gz\n",
+ "-rw-r--r-- 1 root root 3343 Jun 21 18:44 10b581471b80ac2c5ca865e56be6cfe7.yaml\n",
+ "-rw-r--r-- 1 root root 23598 Jun 21 18:44 363cae6e28d220afc10d2be99b01a09a.csv.gz\n",
+ "-rw-r--r-- 1 root root 9979 Jun 21 18:44 363cae6e28d220afc10d2be99b01a09a.yaml\n",
+ "-rw-r--r-- 1 root root 9055 Jun 21 18:44 495afa5517735f1e336108cd7911b8aa.csv.gz\n",
+ "-rw-r--r-- 1 root root 4123 Jun 21 18:44 495afa5517735f1e336108cd7911b8aa.yaml\n",
+ "-rw-r--r-- 1 root root 17532 Jun 21 18:44 67a0cc5dc9ab89cdf0b1ae3a7883b145.csv.gz\n",
+ "-rw-r--r-- 1 root root 7027 Jun 21 18:44 67a0cc5dc9ab89cdf0b1ae3a7883b145.yaml\n",
+ "-rw-r--r-- 1 root root 3990 Jun 21 18:44 6a751eabcf4722abda70f77c0d9d712d.csv.gz\n",
+ "-rw-r--r-- 1 root root 3367 Jun 21 18:44 6a751eabcf4722abda70f77c0d9d712d.yaml\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PlJ017aVMhdS"
+ },
+ "source": [
+ "# clean up the database connection\n",
+ "db_engine.dispose()"
+ ],
+ "execution_count": 21,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KFingR1B32rm"
+ },
+ "source": [
+ "## Model Selection\n",
+ "\n",
+ "`triage` includes a component called `audition` that can help you visualize your model results over time and narrow down your best-performing models. Here we'll provide a quick introduction, but you can find more depth in the [audition tutorial](https://github.com/dssg/triage/blob/master/src/triage/component/audition/Audition_Tutorial.ipynb) as well as the [audition documentation](https://dssg.github.io/triage/audition/audition_intro/) and [model selection concepts overview](https://dssg.github.io/triage/audition/model_selection/). In general, `audition` is best run using a notebook to iteratively explore and narrow down your models."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Csg-HDeiPDma"
+ },
+ "source": [
+ "### Audition Parameters\n",
+ "\n",
+ "To run `audition`, you'll need to specify a few parameters:\n",
+ "\n",
+ "`metric` and `parameter` together specify the evaluation metric of interest for your project. Note that these need to be calculated as part of the scoring section in your `triage` config and should match the values in the columns of the same name in `test_results.evaluations`\n",
+ "\n",
+ "The `run_hash` is an identifier for the run with your complete model grid that you want to evaluate -- the easiest way to find this is from the `triage_metadata.triage_runs` table. This will likely be the `run_hash` associated with the most recent record in that table, but you should be able to figure out which run you want to use from there."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RW1vSrhQP4ww"
+ },
+ "source": [
+ "from triage.component.audition import Auditioner\n",
+ "from triage.component.audition.pre_audition import PreAudition\n",
+ "from triage.component.audition.rules_maker import SimpleRuleMaker, RandomGroupRuleMaker, create_selection_grid\n",
+ "\n",
+ "from matplotlib import pyplot as plt\n",
+ "plt.style.use('ggplot')\n",
+ "%matplotlib inline\n",
+ "\n",
+ "import yaml\n",
+ "from sqlalchemy.engine.url import URL\n",
+ "from triage.util.db import create_engine\n",
+ "\n",
+ "import logging\n",
+ "logging.basicConfig(level=logging.WARNING)\n",
+ "\n",
+ "import pandas as pd\n",
+ "pd.set_option('precision', 4)"
+ ],
+ "execution_count": 22,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Jwg62Co837C4"
+ },
+ "source": [
+ "metric = 'precision@'\n",
+ "parameter = '10_pct'\n",
+ "run_hash = '8440d33cdc0f4cb808d573a05b065d17'"
+ ],
+ "execution_count": 23,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HyeFeaOEPLzk"
+ },
+ "source": [
+ "dbconfig = yaml.safe_load(database_yaml)\n",
+ "db_url = URL(\n",
+ " 'postgres',\n",
+ " host=dbconfig['host'],\n",
+ " username=dbconfig['user'],\n",
+ " database=dbconfig['db'],\n",
+ " password=dbconfig['pass'],\n",
+ " port=dbconfig['port'],\n",
+ " )\n",
+ "\n",
+ "conn = create_engine(db_url)"
+ ],
+ "execution_count": 24,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Sq9qO4-_QPJl"
+ },
+ "source": [
+ "# table where audition results will be stored\n",
+ "best_dist_table = 'audition_best_dist'"
+ ],
+ "execution_count": 25,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZVRftIpfQZxs"
+ },
+ "source": [
+ "### Pre-Audition: Models and Temporal Splits\n",
+ "\n",
+ "Because you may have run several experiments as you iterate, explore, and debug, `audition` needs to know which set of model groups and temporal validation splits to focus on for model selection. While you can specify these directly, `triage` also provides some `pre-audition` tools to help define these.\n",
+ "\n",
+ "For example, `get_model_groups_from_experiment()` and `get_train_end_times()` (note that this will return the `train_end_times` associated with the set of model groups returned by one of the `get_model_groups` methods, so those should be run first). Note that the `baseline_model_types` parameter in the constructor is optional and can be used to identify model groups as baselines rather than candidates for model selection."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5NmsPZT7Rx_J"
+ },
+ "source": [
+ "pre_aud = PreAudition(\n",
+ " conn, \n",
+ " baseline_model_types=[\n",
+ " 'sklearn.dummy.DummyClassifier',\n",
+ " 'triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature',\n",
+ " 'triage.component.catwalk.baselines.thresholders.SimpleThresholder'\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "# select model groups by experiment hash id\n",
+ "model_groups = pre_aud.get_model_groups_from_experiment(run_hash)\n",
+ "\n",
+ "# Note that this will find train_end_times associated with the model groups defined above\n",
+ "end_times = pre_aud.get_train_end_times(after='1900-01-01')"
+ ],
+ "execution_count": 26,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rLpWTrG_SM88"
+ },
+ "source": [
+ "`get_model_groups_from_experiment()` returns a dictionary with keys `model_groups` and `baseline_model_groups`.\n",
+ "\n",
+ "How many of each did we get?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "zFqr_IEBSbui",
+ "outputId": "a1010b01-12f8-47bd-d240-731df28226d1"
+ },
+ "source": [
+ "# Number of non-baseline model groups:\n",
+ "print(len(model_groups['model_groups']))"
+ ],
+ "execution_count": 27,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "3\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "HLuZIZr9SfiY",
+ "outputId": "78a69415-dd7b-487b-e434-187121ef4c94"
+ },
+ "source": [
+ "# Number of baseline model groups:\n",
+ "print(len(model_groups['baseline_model_groups']))"
+ ],
+ "execution_count": 28,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "1\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AOYgNVIGSkl2"
+ },
+ "source": [
+ "`get_train_end_times()` returns a list of `train_end_times`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "YSVYkQNzSqJR",
+ "outputId": "73f7804e-4651-4ee1-f90b-bd1acb480927"
+ },
+ "source": [
+ "end_times"
+ ],
+ "execution_count": 29,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "[Timestamp('2012-04-01 00:00:00'),\n",
+ " Timestamp('2012-08-01 00:00:00'),\n",
+ " Timestamp('2012-12-01 00:00:00')]"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 29
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fHeaYey0S3mr"
+ },
+ "source": [
+ "### Setting Up Your Auditioner\n",
+ "\n",
+ "`Auditioner` is the main API to do the rules selection and model groups selection. It filters model groups using a two-step process.\n",
+ "\n",
+ "- Broad thresholds to filter out truly bad models\n",
+ "- A selection rule grid to find the best model groups over time for each of a variety of methods\n",
+ "\n",
+ "Note that model groups that don't have a full set of `train_end_time` splits associated with them will be excluded from the analysis, so **it's important to ensure that all model groups have been completed across all train/test splits**\n",
+ "\n",
+ "When we set up our auditioner object, we need to give it a database connection, the model groups to consider (and optionally baseline model groups), train_end_times, and tell it how we're going to filter the models. Note that the `initial_metric_filters` parameter specified below tells `Auditioner` what metric and parameter we'll be using and starts off without any initial filtering constraints (which is what you'll typically want):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HRjTq_cITqv1"
+ },
+ "source": [
+ "aud = Auditioner(\n",
+ " db_engine = conn,\n",
+ " model_group_ids = model_groups['model_groups'],\n",
+ " train_end_times = end_times,\n",
+ " initial_metric_filters = [{'metric': metric, 'parameter': parameter, 'max_from_best': 1.0, 'threshold_value': 0.0}],\n",
+ " distance_table = best_dist_table,\n",
+ " baseline_model_group_ids = model_groups['baseline_model_groups'] # optional\n",
+ ")"
+ ],
+ "execution_count": 30,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vnYukR-2WJYI"
+ },
+ "source": [
+ "### Using Audition for Model Selection\n",
+ "\n",
+ "We can use the `plot_model_groups` method to visualize the performance of our model groups over temporal split (note that the plot may take a few minutes to generate). When this method is called, it applies the metric filters specificied to get rid of really bad model groups with respect the metric of interest. A model group is discarded if:\n",
+ "\n",
+ "- It’s never close to the “best” model (based on the `max_from_best` filter) or\n",
+ "- If it’s metric is below a certain number (based on the `threshold_value` filter) at least once\n",
+ "\n",
+ "As a starting point, we don't filter out any models, but can iteratively narrow our grid by refining these filters. Let's take a look at our models:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 811
+ },
+ "id": "x-FZ-8JVYbvl",
+ "outputId": "12912ea9-b213-46f3-fa67-cef7ab7cb43f"
+ },
+ "source": [
+ "aud.plot_model_groups()"
+ ],
+ "execution_count": 31,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "