Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added automl workflows #333

Draft
wants to merge 47 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
093f764
Added automl workflows
myui Jun 9, 2022
799f99e
Added eda workflow
myui Jun 13, 2022
c4bc7ea
Fixed table name
myui Jun 13, 2022
cc3993f
Fixed EDA workflow to load sample datasets
myui Jun 13, 2022
d316f7b
Revised options
myui Jun 17, 2022
57922be
Updated comments
myui Jun 18, 2022
cdf77a2
Revised options
myui Jun 23, 2022
44eb67c
Copyed from ml_experiment.dig
myui Jul 1, 2022
422683d
Added a missing file
myui Jul 1, 2022
41d05ff
Added parameterized automl workflow
myui Jul 1, 2022
250c32d
td.database is required
myui Jul 1, 2022
61c6b8e
Fixed var ref
myui Jul 1, 2022
03a6963
Fixed to properly create output database if missing
myui Jul 7, 2022
4ea4661
Minor comment format change
myui Jul 7, 2022
b0f5a87
Fixed td_ddl
myui Jul 8, 2022
3753711
Add a workaround for input_table is required
myui Jul 12, 2022
2faf3d3
Added NBA and network_analysis notebook sample workflows
myui Dec 1, 2022
8d47420
Added timeseries forecasting example workflow
myui Feb 7, 2023
612fe5e
Set default time_limit
myui Feb 7, 2023
cc32839
Add shepley workflow
myui Feb 9, 2023
eae7002
Add experimental MTA workflow
myui Feb 10, 2023
74892f4
Added a new option
myui Feb 16, 2023
db9a098
Added shared_model option
myui May 18, 2023
6261a34
Added missing '
myui May 18, 2023
dcd73ad
Revised to record AUC
myui May 18, 2023
e8f28e5
Fixed y is missing
myui May 18, 2023
a5ffeaa
Fixed a bug
myui May 18, 2023
6d52e9b
Fixed a bug
myui May 18, 2023
e95cba4
Added vehicle coupon workflow to demonstrate adding an attribute tabl…
myui May 25, 2023
f21dd81
Updated audience script
myui May 26, 2023
711718c
Revised a workflow
myui May 26, 2023
f849caf
Changed parameters to accept multiple attribute columns
myui May 26, 2023
ac28fad
Added two variations for adding attribute to CDP master segment
myui May 27, 2023
03ac633
Added an example to add next_action to CDP master segment
myui May 28, 2023
cae911e
Removed branch setting
myui May 30, 2023
d93474f
Fixed a typo
myui Jun 15, 2023
edb33dc
Fixed a typo
myui Jun 15, 2023
35f343c
Removed to record test table name
myui Jun 15, 2023
b24b27d
Fixed test_table value
myui Jun 15, 2023
2e80b31
Added a drift detection example
myui Jun 15, 2023
b6f758e
Fixed a bug in cdp endpoints
myui Jul 10, 2023
7192e85
Added rfm workflow
myui Aug 3, 2023
7b8065c
Revised not to use custom script
myui Aug 3, 2023
04eb204
Added clustering example
myui Sep 26, 2023
44f72f4
Added CLTV notebook
myui Dec 4, 2023
6d45d53
Added branch option
myui Dec 5, 2023
708c760
Removed branch
myui Dec 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions machine-learning-box/automl/.ruby-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
2.6.3
15 changes: 15 additions & 0 deletions machine-learning-box/automl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
## How to use

Workflow example of AutoML operator.

Note: this feature is still in Beta and available to limited customers.


```sh
# Push project
$ td -c ~/.td/td.conf wf push <project_name> --project .

# Setting td.apikey secret is required for automl operator.

$ td -c ~/.td/td.conf wf secrets --project <project_name> --set td.apikey
```
25 changes: 25 additions & 0 deletions machine-learning-box/automl/cltv.dig
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
_export:
!include : config/params.yaml
td:
engine: presto
database: ${output_database}

+create_db_tbl_if_not_exists:
td_ddl>:
create_databases: ["${output_database}"]

+load_datasets:
ipynb>:
notebook: ml_datasets
output_database: ${input_database}
datasets: online_retail

+run_cltv:
ipynb>:
notebook: CLTV
input_table: ${input_database}.online_retail_txn
output_table: ${output_database}.online_retail_cltv_result
user_column: customerid
tstamp_column: invoicedate
amount_column: purchaseamount
audience_name: online_retail_cltv
23 changes: 23 additions & 0 deletions machine-learning-box/automl/clustering.dig
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
_export:
!include : config/params.yaml
td:
engine: presto
database: ${output_database}

+create_db_tbl_if_not_exists:
td_ddl>:
create_databases: ["${output_database}"]

+load_datasets:
ipynb>:
notebook: ml_datasets
output_database: ml_datasets
datasets: dermatology

+clustering_gluon_new_model:
ipynb>:
notebook: clustering
input_table: ml_datasets.dermatology
output_table: ${output_database}.dermatology_clusters_${session_id}
export_feature_importance: ${output_database}.feature_importance_${session_id}
export_shap_values: ${output_database}.shap_values_${session_id}
10 changes: 10 additions & 0 deletions machine-learning-box/automl/config/params.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
input_database: ml_datasets
output_database: automl_test

train_data_table: gluon_train
target_column: class
test_data_table: gluon_test

fit_time_limit: 60 * 3 # fit timeout in sec. 3 min just for demo. Default: 60 * 60 (1hr).

drift_auc_threshold: 0.93
27 changes: 27 additions & 0 deletions machine-learning-box/automl/eda.dig
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
timezone: Asia/Tokyo
#timezone: PST

+load_datasets:
ipynb>:
notebook: ml_datasets
output_database: ml_datasets
datasets: all
# datasets: gluon, bank_marketing, vehicle_coupon, online_retail, telco_churn, boston_house

+datasets:
for_each>:
table: [gluon_train, bank_marketing_train, vehicle_coupon_train, online_retail_ltv_train, telco_churn_train, boston_house_train]
_parallel:
limit: 3
_do:
+run_eda:
ipynb>:
docker:
task_mem: 128g
notebook: EDA
input_table: ml_datasets.${table}
# The following options are optional ones
eda: all
# eda: pandas-profiling, sweetviz
# target_column: label
sampling_threshold: 1000000
13 changes: 13 additions & 0 deletions machine-learning-box/automl/ml_datasets.dig
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
timezone: Asia/Tokyo
#timezone: PST

_export:
td:
engine: presto

+load_datasets:
ipynb>:
notebook: ml_datasets
output_database: ml_datasets
datasets: all
# datasets: gluon, bank_marketing
71 changes: 71 additions & 0 deletions machine-learning-box/automl/ml_experiment.dig
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
_export:
!include : config/params.yaml
td:
engine: presto
database: ${output_database}

+create_db_tbl_if_not_exists:
td_ddl>:
create_databases: ["${output_database}"]
create_tables: ["automl_experiments", "automl_eval_results"]

+train:
ml_train>:
docker:
task_mem: 128g # 64g/128g/256g/384g/512g
notebook: gluon_train
model_name: gluon_model_${session_id}
input_table: ${input_database}.${train_data_table}
target_column: ${target_column}
time_limit: ${fit_time_limit}
share_model: true
export_leaderboard: ${output_database}.leaderboard_${train_data_table}
export_feature_importance: ${output_database}.feature_importance_${train_data_table}

+track_experiment:
td>: queries/track_experiment.sql
insert_into: ${output_database}.automl_experiments
last_executed_notebook: ${automl.last_executed_notebook}
user_id: ${automl.last_executed_user_id}
user_email: ${automl.last_executed_user_email}
model_name: gluon_model_${session_id}
shared_model: ${automl.shared_model}
task_attempt_id: ${attempt_id}
session_time: ${session_local_time}
engine: presto

# Note: If input_table contains target labels, ml_predict shows evaluation results
+predict:
ml_predict>:
docker:
task_mem: 64g # 64g/128g/256g/384g/512g
notebook: gluon_predict
model_name: gluon_model_${session_id}
input_table: ${input_database}.${test_data_table}
output_table: ${output_database}.predicted_${test_data_table}_${session_id}

+evaluation:
td>: queries/auc.sql
table: ${output_database}.predicted_${test_data_table}_${session_id}
target_column: ${target_column}
positive_class: ' >50K'
store_last_results: true
engine: hive

+alert_if_drift_detected:
if>: ${td.last_results.auc < drift_auc_threshold}
_do:
mail>:
data: Detect drift in model performance. AUC was ${td.last_results.auc}.
subject: Drift detected
to: [[email protected]]
# bcc: [[email protected],[email protected]]

+record_evaluation:
td>: queries/record_evaluation.sql
insert_into: ${output_database}.automl_eval_results
engine: presto
model_name: gluon_model_${session_id}
test_table: ${input_database}.${test_data_table}
session_time: ${session_local_time}
auc: ${td.last_results.auc}
67 changes: 67 additions & 0 deletions machine-learning-box/automl/ml_experiment_demo.dig
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
timezone: Asia/Tokyo
#timezone: PST

_export:
!include : config/params.yaml
td:
engine: presto
database: ${output_database}

+create_db_tbl_if_not_exists:
td_ddl>:
create_databases: ["${output_database}"]
create_tables: ["${expr_tracking_table}"]

+load_datasets:
ipynb>:
notebook: ml_datasets
output_database: ${input_database}
input_table: ${input_database}.dummy
# datasets: gluon, bank_marketing
datasets: gluon

+gluon_train:
ml_train>:
notebook: gluon_train
model_name: gluon_model_${session_id}
input_table: ${input_database}.gluon_train # expect database_name.table_name
target_column: class
# The following options are optional ones
#problem_type: binary # ‘binary’, ‘multiclass’, ‘regression’, or ‘quantile’. autolugon automatically detect problem types
#eval_metric: roc_auc # autolugon automatically select a right eval_metric for a given setting if not specified.
ignore_columns: time,rowid # Note time column is ignored by the default.
time_limit: 60 * 3 # fit timeout. 3 min just for training time. Default: 60 * 60 (1hr). 1hr or more is recommended for production purposes (Note 24 hours at max). Note this is a soft limit, not hard limit.
# timeout: 60 * 3 # timeout for notebook cell-level execution. This is a hard limit. Note it's cell-level timeout. No timeout if not specified.
export_leaderboard: ${output_database}.leaderboard_gluon_train
export_feature_importance: ${output_database}.feature_importance_gluon_train
# hide_table_contents: true

+print_train_result:
echo>: "executed ${automl.last_executed_notebook}.ipynb"

+track_experiment:
td>: queries/track_experiment.sql
insert_into: automl_experiments
last_executed_notebook: ${automl.last_executed_notebook}
user_id: ${automl.last_executed_user_id}
user_email: ${automl.last_executed_user_email}
model_name: gluon_model_${session_id}
task_attempt_id: ${attempt_id}
session_time: ${session_local_time}
engine: presto

+gluon_predict:
ml_predict>:
notebook: gluon_predict
model_name: gluon_model_${session_id}
input_table: ${input_database}.gluon_test # expect database_name.table_name
output_table: ${output_database}.gluon_predicted # expect database_name.table_name. DB will be created if not exists. table is overwrite'd.
# optional
#rowid_column: rowid # Note when rowid_column is specified, only rowid column + prediction result columns are resulted in the output table
#ignore_columns: time # target column should not be in test data
export_leaderboard: ${output_database}.leaderboard_gluon_predict
export_feature_importance: ${output_database}.feature_importance_gluon_predict
# hide_table_contents: true

+print_predict_result:
echo>: "executed ${automl.last_executed_notebook}.ipynb"
41 changes: 41 additions & 0 deletions machine-learning-box/automl/mta.dig
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#timezone: Asia/Tokyo
#timezone: PST

_export:
!include : config/params.yaml
td:
engine: presto
database: sample_datasets # dummy to avoid error on create_databases
output_db: ml_test

+create_db_tbl_if_not_exists:
td_ddl>:
create_databases: ["ml_datasets", "${output_db}"]

+load_datasets:
ipynb>:
docker:
task_mem: 64g
notebook: ml_datasets
output_database: ml_datasets
datasets: mta

+run_mta:
ipynb>:
docker:
task_mem: 128g # 64g/128g/256g/384g/512g
notebook: MTA
# required param
input_table: ml_datasets.mta
# optional param
tstamp_column: tstamp
user_column: user
channel_column: channel
conversion_column: conversion
# optional columns (usually not needed)
analyze_topk_channels: 50
ignore_channels: Facebook
overwrite_channel: Direct
export_channel_interactions: ${output_db}.channel_interactions
export_shapley_attributions: ${output_db}.shapley_attributions
export_attributed_conversions: ${output_db}.attributed_conversions
51 changes: 51 additions & 0 deletions machine-learning-box/automl/nba.dig
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
_export:
!include : config/params.yaml
td:
engine: presto
database: ${output_database}

+create_db_tbl_if_not_exists:
td_ddl>:
create_databases: ["${output_database}"]

+load_datasets:
ipynb>:
notebook: ml_datasets
output_database: ml_datasets
datasets: nba

+nba_only_qtable:
ipynb>:
notebook: NBA
train_table: ml_datasets.nba_train
# optional
export_q_table: ${output_database}.rl_qtable_${session_id}
export_state_action: ${output_database}.rl_state_action_${session_id}

+nba_with_eval:
ipynb>:
notebook: NBA
train_table: ml_datasets.nba_train
test_table: ml_datasets.nba_test
budget: 10000
value_per_cv: 100
# optional
# export_q_table: ${output_database}.rl_qtable_${session_id}
export_channel_ratio: ${output_database}.rl_channel_ratio_${session_id}
export_predictions: ${output_database}.rl_predictions_${session_id}
export_model_performance: ${output_database}.rl_model_performance_${session_id}
ignore_actions: client_domain_organic_visit, organic_search
action_cost: |
{
"display": 2,
"social-social": 1.4,
"social": 2,
"social-paid": 5,
"organic_search": 1,
"emai": 3.2,
"cpc": 3,
"referral": 2,
"linkedin": 3,
"search-paid": 2,
"twitter": 1
}
Loading
Loading