-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAJOR-BUG for node/edge predictions: Added ordered=True to to_mol
#503
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #503 +/- ##
==========================================
- Coverage 71.52% 69.72% -1.81%
==========================================
Files 93 93
Lines 8707 8710 +3
==========================================
- Hits 6228 6073 -155
- Misses 2479 2637 +158
Flags with carried forward coverage won't be shown. Click here to find out more.
|
to_mol
to_mol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Looking through the rest of the code base we have a number of other places where this pattern is used that haven't been updated, is that intentional or do we need the same logic applied. i.e. is this only needed to be applied to the
mol
objects when usingdm.to_mol
or with the smiles as well? I've listed a few examples below, but searching the repo gives more.
graphs = []
--
for s in tqdm(smiles):
mol = dm.to_mol(s)
graphs.append(mol_to_graph_dict(mol, **featurizer))
print(graphs[0])
mol = dm.to_mol(mol=smiles)
mol_id = dm.unique_id(mol)
if isinstance(mol, str):
mol = dm.to_mol(mol)
- Most obviously in the
test_featurizer.py
- which if that needs changing to be ordered we could add an additional check to make sure that molecules are ordered, or if onlymol
objects need ordering a separate check could be added please?
err_msg = f"\n\tError for params:\n\t\tSMILES: {s}"
mol = dm.to_mol(s)
- I would appreciate it if a test could be added that confirms the ordering and to ensure this can't silently break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for checking!
Thanks @s-maddrellmander for reviewing that. I have added comments on why the other calls to In your examples, in order, here are the reasons:
Regarding the testing, I have create issue #504 . I currently do not have time to test it properly and believe I should merge the fix, and test it later. I am confident that it will work thanks to the much better training dynamics observed above. If you disagree, and think that it should be tested before being merged, let me know. |
A major bug was discovered when working with node, nodepair and edge-level tasks.
The bug is due to RDKIT's weird behaviour in that it takes the AtomMapping and displays them correctly, but doesn't use them as indices. Unless you specify that you want to canonicalize the molecule. In case you have AtomMaps defined, it will NOT search for a canonical form, instead it will use the AtomMaps.
Below is the weird behaviour by RDKIT, the big numbers next to the carbons are the AtomMapping that are used to order the atoms with the same ordering as the labels. The small numbers are the indices. When visualizing the molecules to validate that the code was working, I didn't not look at the intrinsic indices, but only at the big number, assuming that the indices were correct.
I'm sorry for the inconvenience I have caused everyone... This bug was discovered this morning when trying to fix #467 .
I have validated the fix experimentally. Simply adding
ordered=True
indm.to_mol
yields massive improvements on both the PCQM4M and the PM6 datasets. In the plots below, I randomly subsample 20k molecules and run an MPNN model on the node-level tasks only. You can see drastic improvements.What surprises me is that the metrics on PCQM are not that bad, especially since it can reach r2=0.85 on the 4M large-mix dataset. This led me to believe that the ordering was correct. But as you can see here, we can reach r2=0.95 with only 20k molecules and 100 epochs.