Several tensorflow-related issues #14

JeremyMolineau · 2024-03-15T17:20:42Z

Hello,

I'm having a problem during model training (step 5. in USAGE) for the split_data step.
A file uspto_split_indices.npz was not found. After moving from python files to python files, I had difficulty understanding how it was generated and why.

During the previous step to generate the final reaction and reaction templates files (step 4 in USAGE), I followed the instructions of reaction_selection.py in a notebook to understand the process.
Is there the same type of python files to understand the model training process (step 5)?
Thanks for your help.

SGenheden · 2024-03-18T15:41:36Z

Hello.

This file should have been created by the first step of the modelling pipeline, and contains the indices of the data for the training, validation and test set.

The easiest in these situations is usually to run that part of the pipeline in isolation.

Once, you have activated the conda environment you can run this step with

python -m aizynthtrain.modelling.expansion_policy.create_template_lib expansion_model_pipeline_config.yml

where expansion_model_pipeline_config.yml is the name of the config file.

Try that and see if the split file is generated or if you get another error message.

JeremyMolineau · 2024-03-19T09:42:21Z

Hello,
The .npz file has been created. The split_data process has finished.

I'm encountering a new problem during the model_training step with this output:
ValueError: The filepath provided must end in .keras (Keras model format). Received: filepath=uspto_keras_model.hdf5

I don't understand because in the configs.py, the file is specified as : training_checkpoint: str = "keras_model.hdf5"
I tried to run it with the command you proposed and got the same error :
python -m aizynthtrain.modelling.expansion_policy.training expansion_model_pipeline_config.yml

SGenheden · 2024-03-19T12:04:48Z

This seems to be an issue with a new Tensorflow version. I will try to find where this is coming from.

But you could try changing the filename in the config file and see if that helps. It doesn't have to end with "hdf5", that we used historically, but if TF now want it to end with ".keras" you can change the config file.

JeremyMolineau · 2024-03-19T17:04:19Z

It works with ".keras" instead of ".hdf5"
I encountered a small problem with 3 parameters in keras_utils.py in the fit() (max_queue_size=20, workers=1 & use_multiprocessing=False).
They were not recognized as parameters.
The model runs without them. It seems that these parameters are specific to fit_generator and not to fit.

SGenheden · 2024-03-20T09:38:23Z

Glad you got it working.

It is clear that the Tensorflow code needs to be updated to support the latest version, so thank you for your patience and for reporting these issues.

JeremyMolineau · 2024-03-20T13:08:01Z

Thank you for your prompt reply.

I'm encountering an error when converting the Keras model to ONNX.
TypeError: Unable to reconstruct an instance of 'partial' because the class lacks a from_config() method. Full object config: {'module': 'functools', 'class_name': 'partial', 'config': 'top10_acc', 'registered_name': 'partial'}

I've tried to change the way top10_acc & top50_acc are registered and haven't found a solution to load the model with keras.models load_model()

SGenheden · 2024-04-01T09:34:49Z

Hard to help on this last one unless you post a more specific error trace or share the trained model.
This appears to be a Tensorflow issue and changed functionality on the tensorflow end.

JeremyMolineau · 2024-04-02T08:53:47Z

You'll find the error message in the image below.
What is the best way to share the model I've trained based on the USPTO?

Thank you for your help.

SGenheden · 2024-04-04T11:14:35Z

I am curious to what versions of tensorflow and keras you have in your environment.
According to the latest poetry.lock file you ought to have 2.8.1 for tensorflow and 2.8.0 for keras.
I could not reproduce you error with my environment.

You can reproduce the behavior in the pipeline with

import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.metrics import top_k_categorical_accuracy
import functools

top10_acc = functools.partial(top_k_categorical_accuracy, k=10)
top10_acc.__name__ = "top10_acc" 

top50_acc = functools.partial(top_k_categorical_accuracy, k=50)
top50_acc.__name__ = "top50_acc" 

custom_objects = {
        "top10_acc": top10_acc,
        "top50_acc": top50_acc,
        "tf": tf,
}
model = load_model(
    path_to_hdf5, 
    custom_objects=custom_objects
)

and this worked for me with tensorflow=2.8.1 but not with the latest version

But then I realized that you changed the file format, and I am wondering if this is causing the issues.

How big is your model file? You could share it via e.g. Dropbox

JeremyMolineau · 2024-04-09T08:16:08Z

You are right, the versions of tensorflow (2.16.2) & keras (3.0.5) in my environment are different from the poetry.lock file.
This explains many of the problems I encountered in training a new model.

I encountered difficulties installing tensorflow from poetry and the installation of tensorflow followed #13

SBC-ICOA · 2024-04-10T09:09:41Z

Hello,

To follow up what Jeremy wrote, I reinstalled everything, from scratch, on my machine and, with Jeremy, was able to apply the whole process from downloading the uspto data to generating the models uspto_expansion.onnx and uspto_ringbreaker_expansion.onnx.
Applying modifications about siblings was helpful (issue 17).
I've the the same versions as you for tensorflow (2.8.1) and keras (2.8.0).

But I have some interrogations ?

During the installation process I saw the following message several times :
DEPRECATION: pytorch-lightning 1.8.3.post2 has a non-standard dependency specifier torch>=1.9.*.

In addition, when I first used aizynthtrain.pipelines.expansion_model_pipeline I got a message about tensorflow :

TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

Dowgrading protobuf from 5.26 to 3.20.2 solved this problem.

Restarting the pipeline I got the new message :

W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-04-05 17:13:16.883078: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

I think the script is not able to find cuda 11 on my machine because cuda 12 is installed.
The pipeline is finally running but using CPU instead GPU. It doesn't really matter, because it takes less than a day to get the models and I will not have to retrain a model every day !

In my opinion, the problems are linked to the tensorflow version. To solve my problems I had to downgrade some libraries. I've seen that another solution might be to update tensorflow, but in that case I'll probably get the same problems as Jeremy.

Do you plan to integrate the latest versions of tensorflow and keras, although I suppose it could mean a lot of work to adapt your code?

Regards

SGenheden · 2024-04-15T14:16:30Z

Thanks for the feedback and your willingness to live with a few bumps.

It has made us realize that we need to substantially re-factor the code, most likely move tensorflow models to pytorch, which are easier to maintain.

SGenheden changed the title ~~files .npz during split_data step~~ Several tensorflow-related issues Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several tensorflow-related issues #14

Several tensorflow-related issues #14

JeremyMolineau commented Mar 15, 2024

SGenheden commented Mar 18, 2024

JeremyMolineau commented Mar 19, 2024 •

edited

Loading

SGenheden commented Mar 19, 2024

JeremyMolineau commented Mar 19, 2024

SGenheden commented Mar 20, 2024

JeremyMolineau commented Mar 20, 2024 •

edited

Loading

SGenheden commented Apr 1, 2024

JeremyMolineau commented Apr 2, 2024

SGenheden commented Apr 4, 2024 •

edited

Loading

JeremyMolineau commented Apr 9, 2024 •

edited

Loading

SBC-ICOA commented Apr 10, 2024

SGenheden commented Apr 15, 2024

Several tensorflow-related issues #14

Several tensorflow-related issues #14

Comments

JeremyMolineau commented Mar 15, 2024

SGenheden commented Mar 18, 2024

JeremyMolineau commented Mar 19, 2024 • edited Loading

SGenheden commented Mar 19, 2024

JeremyMolineau commented Mar 19, 2024

SGenheden commented Mar 20, 2024

JeremyMolineau commented Mar 20, 2024 • edited Loading

SGenheden commented Apr 1, 2024

JeremyMolineau commented Apr 2, 2024

SGenheden commented Apr 4, 2024 • edited Loading

JeremyMolineau commented Apr 9, 2024 • edited Loading

SBC-ICOA commented Apr 10, 2024

SGenheden commented Apr 15, 2024

JeremyMolineau commented Mar 19, 2024 •

edited

Loading

JeremyMolineau commented Mar 20, 2024 •

edited

Loading

SGenheden commented Apr 4, 2024 •

edited

Loading

JeremyMolineau commented Apr 9, 2024 •

edited

Loading