Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V0.15.0 runs hours longer than V0.14.0 #895

Open
Jiaqi-ads opened this issue Jul 4, 2024 · 3 comments
Open

V0.15.0 runs hours longer than V0.14.0 #895

Jiaqi-ads opened this issue Jul 4, 2024 · 3 comments

Comments

@Jiaqi-ads
Copy link

Hi EconML team,

I've just upgraded my EconML package to V0.15.0 and it seems like the new version runs much slower than the V0.14.0, even with one the simplest CATE estimators. For example, I've trained a linear DR model using v0.14.0 within less than 5 minutes but yet it took me hours to train the same linear DR model (i.e. all variables and datasets used remain unchanged). I wonder what has changed in the V0.15.0 that might lead to this problem?

@kbattocchi
Copy link
Collaborator

The only change that I can think of is that we have changed the default first-stage propensity and regression models to do model selection between linear and forest models instead of always just using a linear model.

We made this change because the accuracy of the CATE estimate depends strongly on having good models, and for many datasets we'd expect forest models to fit the data much better. In general, this has not resulted in large slowdowns in our own internal testing, but perhaps you have a much larger number of rows or columns than we've been testing on - what are the shapes of your Y, T, X, and W inputs?

If fitting forest models is the cause of the slowdown, you can explicitly pass first-stage models of your choice instead. However, as I mentioned it is important to use models that can actually fit your data well if you want to get accurate CATE estimates, so I would only fall back on linear models if you are confident that those have good predictive power in your setting.

As a side note, we released v0.15.1 yesterday, which contains some bugfixes, so you may want to upgrade to that, but I don't expect it to affect your performance issues if the cause is what I've outlined above.

@Jiaqi-ads
Copy link
Author

Thanks for your prompt response! @kbattocchi

The dataset I was testing on contains about 500,000 rows and have about 50 columns in X and W combined, which consists of mostly the one-hot encoded categorical variables. So maybe it is because of the changes in the default first stage models?

On the accuracy of the first-stage models though, although I agree that forest models tend to have better accuracy and more accurate first-stage models lead to better CATE estimation, I'm aware that there are some arguments saying that forest models tend to generate more extreme probability scores in classification tasks. This could probably affect both the outputs of propensity model and the "regression model" as well if the outcome variable is binary, which ultimately affects the performance of the final CATE model. May I ask what your thoughts are on this? Thanks in advance.

@Jiaqi-ads
Copy link
Author

Hi, just wanted to follow up on the issue of speed. I've upgraded the module to v0.15.1 and tried to set both the model_propensity and model_regression to 'linear'. It still took hours to finish training on the dataset whereas it took only four minutes with v0.14.0. Besides, the execution time was the same as setting those parameters to 'auto' and changing the parameters to 'forest' doesn't affect the execution time much either. So I wonder if there could be some other issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants