-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset size and creation #3
Comments
Hey, thanks!
|
Hi thank you for your fast answer. Sorry for the confusion I will try to explain what I mean.
My question was referring to list item number 5 in the Data Set Generation. I assumed that for each molecule in your dataset you computed the similarity to those 10 drugs and if similarity was higher than 0.323 that molecule was discarded. I was curious how you selected this cutoff and what type of similarity was used. As a follow up question: For your pre-trained model you set max_seq_length (smiles length) to 128, but in some tests you set it to 512. If I want to use your pre-trained model (max_seq_length = 128) to embed smiles longer than 128 characters can I simply change the max_seq_lenght argument or would that embedding be incorrect ? |
Hi @LivC182, thanks for your interest in our work. Regarding threshold selection for similarity filtering in the GuacaMol training dataset, I can point you to reference (86) in the GuacaMol paper which is this blogpost, http://rdkit.blogspot.com/2013/10/fingerprint-thresholds.html. I believe the Tanimoto similarity was used (admittedly the relevant figure from the blog seems to have been transcribed as 0.323 instead of 0.321). This is in line with other suggested tanimoto thresholds for the ECFP4 fingerprints (e.g. here). If you have any follow up questions regarding the training dataset, it might be worth asking in the GuacaMol repo (apologies for the slow response here). Regarding the |
Hi, first of all congrats on your article and the NeurIPS workshop.
I have a few questions:
The text was updated successfully, but these errors were encountered: