Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several tensorflow-related issues #14

Open
JeremyMolineau opened this issue Mar 15, 2024 · 12 comments
Open

Several tensorflow-related issues #14

JeremyMolineau opened this issue Mar 15, 2024 · 12 comments

Comments

@JeremyMolineau
Copy link

Hello,

I'm having a problem during model training (step 5. in USAGE) for the split_data step.
A file uspto_split_indices.npz was not found. After moving from python files to python files, I had difficulty understanding how it was generated and why.

npz_file_missing

During the previous step to generate the final reaction and reaction templates files (step 4 in USAGE), I followed the instructions of reaction_selection.py in a notebook to understand the process.
Is there the same type of python files to understand the model training process (step 5)?
Thanks for your help.

@SGenheden
Copy link
Contributor

Hello.

This file should have been created by the first step of the modelling pipeline, and contains the indices of the data for the training, validation and test set.

The easiest in these situations is usually to run that part of the pipeline in isolation.

Once, you have activated the conda environment you can run this step with

python -m aizynthtrain.modelling.expansion_policy.create_template_lib expansion_model_pipeline_config.yml

where expansion_model_pipeline_config.yml is the name of the config file.

Try that and see if the split file is generated or if you get another error message.

@JeremyMolineau
Copy link
Author

JeremyMolineau commented Mar 19, 2024

Hello,
The .npz file has been created. The split_data process has finished.

I'm encountering a new problem during the model_training step with this output:
ValueError: The filepath provided must end in .keras (Keras model format). Received: filepath=uspto_keras_model.hdf5

I don't understand because in the configs.py, the file is specified as : training_checkpoint: str = "keras_model.hdf5"
I tried to run it with the command you proposed and got the same error :
python -m aizynthtrain.modelling.expansion_policy.training expansion_model_pipeline_config.yml

@SGenheden
Copy link
Contributor

This seems to be an issue with a new Tensorflow version. I will try to find where this is coming from.

But you could try changing the filename in the config file and see if that helps. It doesn't have to end with "hdf5", that we used historically, but if TF now want it to end with ".keras" you can change the config file.

@JeremyMolineau
Copy link
Author

It works with ".keras" instead of ".hdf5"
I encountered a small problem with 3 parameters in keras_utils.py in the fit() (max_queue_size=20, workers=1 & use_multiprocessing=False).
They were not recognized as parameters.
The model runs without them. It seems that these parameters are specific to fit_generator and not to fit.

@SGenheden
Copy link
Contributor

Glad you got it working.

It is clear that the Tensorflow code needs to be updated to support the latest version, so thank you for your patience and for reporting these issues.

@JeremyMolineau
Copy link
Author

JeremyMolineau commented Mar 20, 2024

Thank you for your prompt reply.

I'm encountering an error when converting the Keras model to ONNX.
TypeError: Unable to reconstruct an instance of 'partial' because the class lacks a from_config() method. Full object config: {'module': 'functools', 'class_name': 'partial', 'config': 'top10_acc', 'registered_name': 'partial'}

I've tried to change the way top10_acc & top50_acc are registered and haven't found a solution to load the model with keras.models load_model()

@SGenheden
Copy link
Contributor

Hard to help on this last one unless you post a more specific error trace or share the trained model.
This appears to be a Tensorflow issue and changed functionality on the tensorflow end.

@JeremyMolineau
Copy link
Author

You'll find the error message in the image below.
What is the best way to share the model I've trained based on the USPTO?

Thank you for your help.
onnx_convert_failed

@SGenheden
Copy link
Contributor

SGenheden commented Apr 4, 2024

I am curious to what versions of tensorflow and keras you have in your environment.
According to the latest poetry.lock file you ought to have 2.8.1 for tensorflow and 2.8.0 for keras.
I could not reproduce you error with my environment.

You can reproduce the behavior in the pipeline with

import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.metrics import top_k_categorical_accuracy
import functools

top10_acc = functools.partial(top_k_categorical_accuracy, k=10)
top10_acc.__name__ = "top10_acc" 

top50_acc = functools.partial(top_k_categorical_accuracy, k=50)
top50_acc.__name__ = "top50_acc" 

custom_objects = {
        "top10_acc": top10_acc,
        "top50_acc": top50_acc,
        "tf": tf,
}
model = load_model(
    path_to_hdf5, 
    custom_objects=custom_objects
)

and this worked for me with tensorflow=2.8.1 but not with the latest version

But then I realized that you changed the file format, and I am wondering if this is causing the issues.

How big is your model file? You could share it via e.g. Dropbox

@JeremyMolineau
Copy link
Author

JeremyMolineau commented Apr 9, 2024

You are right, the versions of tensorflow (2.16.2) & keras (3.0.5) in my environment are different from the poetry.lock file.
This explains many of the problems I encountered in training a new model.

I encountered difficulties installing tensorflow from poetry and the installation of tensorflow followed #13

@SBC-ICOA
Copy link

Hello,

To follow up what Jeremy wrote, I reinstalled everything, from scratch, on my machine and, with Jeremy, was able to apply the whole process from downloading the uspto data to generating the models uspto_expansion.onnx and uspto_ringbreaker_expansion.onnx.
Applying modifications about siblings was helpful (issue 17).
I've the the same versions as you for tensorflow (2.8.1) and keras (2.8.0).

But I have some interrogations ?

During the installation process I saw the following message several times :
DEPRECATION: pytorch-lightning 1.8.3.post2 has a non-standard dependency specifier torch>=1.9.*.

In addition, when I first used aizynthtrain.pipelines.expansion_model_pipeline I got a message about tensorflow :

TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

Dowgrading protobuf from 5.26 to 3.20.2 solved this problem.

Restarting the pipeline I got the new message :

W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-04-05 17:13:16.883078: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

I think the script is not able to find cuda 11 on my machine because cuda 12 is installed.
The pipeline is finally running but using CPU instead GPU. It doesn't really matter, because it takes less than a day to get the models and I will not have to retrain a model every day !

In my opinion, the problems are linked to the tensorflow version. To solve my problems I had to downgrade some libraries. I've seen that another solution might be to update tensorflow, but in that case I'll probably get the same problems as Jeremy.

Do you plan to integrate the latest versions of tensorflow and keras, although I suppose it could mean a lot of work to adapt your code?

Regards

@SGenheden
Copy link
Contributor

Thanks for the feedback and your willingness to live with a few bumps.

It has made us realize that we need to substantially re-factor the code, most likely move tensorflow models to pytorch, which are easier to maintain.

@SGenheden SGenheden changed the title files .npz during split_data step Several tensorflow-related issues Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants