Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why train the model with only one data set ? #8

Open
lingling93 opened this issue Aug 19, 2019 · 1 comment
Open

Why train the model with only one data set ? #8

lingling93 opened this issue Aug 19, 2019 · 1 comment

Comments

@lingling93
Copy link

Hi Daniel:

I'm wondering that, you have four valid data sets which contain drug-disease pairs, why not train the final model with all the data we know ?
Do you think it is a good idea?

Lingling

@dhimmel
Copy link
Owner

dhimmel commented Aug 22, 2019

why not train the final model with all the data we know?

For the Project Rephetio study, we wanted to have some hold-out treatments for evaluating our final performance.

However, since then, I have been wanting to try out training on indications in clinical trials. If you exclude the disease-modifying treatments, there are 5,594 treatments in this clinical trials set. If you still want holdout testing data, you could set aside some proportion of these 5,594 pseudo-treatments. This approach could have several benefits compared to what we did in Rephetio:

  • greater number of positives, which could yield models that draw from more metapaths. If you remember from Figure 2, many of the metapaths through high-throughput/systematic edges were given zero-coefficients in the logistic regression model despite showing predictive ability according to Δ AUROC. I think it is possible that features with smaller effects may be retained by the model were the limiting sample size (that of treatments, i.e. positives) increased.

  • training positives and negatives would never have treatment edges in the hetnet. This would help address edge-dropout contamination. In other words, I think it's best if you can train on compound-disease pairs without any direct connections in the network. This could also result in better models, because IIRC we struggled to avoid the deleterious effects of edge-dropout contamination.

The big downside to training on clinical trials is that they are not all disease-modifying indications. However, I think it's reasonable to assume that clinical trails enrich for true treatments compared to random compound-disease pairs. My understanding is that classifiers like our regularized logistic regression will do just fine with imperfect positives and negatives and that larger sample size is probably more important than the perfection of the class labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants