Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the code compatible with other keras backends #2137

Merged
merged 9 commits into from
Dec 6, 2024

Conversation

scarlehoff
Copy link
Member

@scarlehoff scarlehoff commented Jul 26, 2024

This deals with #2134

It exchanges tf imports to keras (with small changes when the functions signatures change). It was necessary to add a few explicit output shapes here and there so that Keras knows how to compile the graph (and then we can also use the jit_compile flag, at least with the tensorflow backend).

Now, for some comparisons:

  1. In this report master is compared to this branch https://vp.nnpdf.science/ZGmx7ro7SNCESqFEmR2IcA==, TensorFlow backend and 70 replicas running in GPU (A6000) (53 min vs 43 min)
  2. Tensorflow Vs Pytorch (both running in this branch and GPU) https://vp.nnpdf.science/zKyfSvs8TxSI1FXGR72f3w== (43 min vs 178 min)

Turns out that the code is mostly general enough to work with pytorch ootb. Very few changes are needed* (the second commit, and not even, because some of the changes are tf. that I missed).

However, I've run a few quick benchmarks and turns out that it is about 4 times slower than the tensorflow backend. This is not surprising since we have been using tensorflow for a long time so we might be doing thing that just happen to be better for tensorflow. Also, while jit_compile is set to auto, I think the default for pytorch is False.

@scarlehoff scarlehoff added the n3fit Issues and PRs related to n3fit label Jul 26, 2024
@scarlehoff scarlehoff marked this pull request as draft July 26, 2024 18:47
Base automatically changed from disable_xla_tf216 to master July 28, 2024 11:12
@scarlehoff scarlehoff force-pushed the make_code_tf_independent branch from f2484d3 to 4448ba4 Compare August 16, 2024 09:38
@scarlehoff scarlehoff force-pushed the make_code_tf_independent branch 3 times, most recently from 502ac27 to 2f150e6 Compare November 22, 2024 14:42
@scarlehoff scarlehoff changed the title [WIP] Make the code compatible with other keras backends Make the code compatible with other keras backends Nov 22, 2024
@scarlehoff scarlehoff marked this pull request as ready for review November 22, 2024 16:20
@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Nov 22, 2024
@scarlehoff scarlehoff force-pushed the make_code_tf_independent branch 2 times, most recently from 45fe562 to b3c5b27 Compare November 22, 2024 21:11
@scarlehoff
Copy link
Member Author

scarlehoff commented Nov 22, 2024

This makes the code:

  1. Keras-only (so no more tensorflow functions in the middle)
  2. Compilable wtih xla (the fit is very much not the bottleneck anymore, but for reference, running in GPU it goes from 20 replicas in 23 minutes to 20 replicas in 18 minutes)
  3. Compatible with pytorch in both cpu and gpu (however running pytorch in graph mode crashed*)
  4. Almost compatible with jax. For jax there are 2 problems: a) the preprocessing layer b) the computation of the losses. a) is fixable but b) is too much work imho so I won't do anything for that unless it turns out to be the best thing ever smh.
  5. And we can now test in conda with python 3.12 :) (fair enough, probably this can be achieved also in master removing the pin for tf 2.17)

*I don't think we want to run with pytorch, but having the possibility is nice. If we start using it for whatever reason I can debug that.

@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Nov 22, 2024
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@RoyStegeman RoyStegeman force-pushed the make_code_tf_independent branch from 6b94412 to b859c07 Compare November 25, 2024 14:47
.github/workflows/pytorch_test.yml Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/operations.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/layers/mask.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/operations.py Outdated Show resolved Hide resolved
@scarlehoff scarlehoff force-pushed the make_code_tf_independent branch 2 times, most recently from 59464e9 to ee35538 Compare November 27, 2024 16:34
@scarlehoff scarlehoff changed the base branch from master to rerun_fitbot_241127 November 27, 2024 16:35
@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Nov 27, 2024
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Base automatically changed from rerun_fitbot_241127 to master November 27, 2024 18:20
@scarlehoff scarlehoff force-pushed the make_code_tf_independent branch from ee35538 to bd05e0f Compare November 27, 2024 18:20
@scarlehoff scarlehoff removed the run-fit-bot Starts fit bot from a PR. label Nov 28, 2024
@scarlehoff scarlehoff force-pushed the make_code_tf_independent branch from 3ca453d to 15e9f34 Compare November 29, 2024 07:28
@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Nov 29, 2024
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff
Copy link
Member Author

Ok, this is ready for review. Some comparisons in the first post.

I've heard you are completely free this weekend @RoyStegeman in case you want to have another look :P

@scarlehoff scarlehoff force-pushed the make_code_tf_independent branch from 15e9f34 to 23e7620 Compare December 4, 2024 15:36
@RoyStegeman RoyStegeman force-pushed the make_code_tf_independent branch from 23e7620 to da1d6bd Compare December 5, 2024 10:21
@RoyStegeman RoyStegeman linked an issue Dec 5, 2024 that may be closed by this pull request
Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good. Main point is that it should be documented

.github/workflows/pytorch_test.yml Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/internal_state.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/callbacks.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/operations.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/model_gen.py Outdated Show resolved Hide resolved
@RoyStegeman RoyStegeman force-pushed the make_code_tf_independent branch from f97a928 to af832de Compare December 5, 2024 13:59
scarlehoff and others added 7 commits December 6, 2024 12:22
make the code compatible with pytorch

make it more keras only, working also with sum rules

update keras limits

test with newer tf

fix for normalized distributions
test

change workflow

test in 3.12 in conda
remove other instances of kops

Update n3fit/src/n3fit/backends/keras_backend/MetaModel.py

Co-authored-by: Roy Stegeman <[email protected]>
@RoyStegeman RoyStegeman force-pushed the make_code_tf_independent branch from 300c865 to 59ce25a Compare December 6, 2024 12:22
doc/sphinx/source/n3fit/methodology.rst Outdated Show resolved Hide resolved
doc/sphinx/source/tutorials/run-fit.rst Outdated Show resolved Hide resolved
@RoyStegeman RoyStegeman merged commit f4467d0 into master Dec 6, 2024
7 checks passed
@RoyStegeman RoyStegeman deleted the make_code_tf_independent branch December 6, 2024 13:29
@scarlehoff scarlehoff mentioned this pull request Dec 6, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
n3fit Issues and PRs related to n3fit run-fit-bot Starts fit bot from a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move from tensorflow.keras to keras
2 participants