-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rm np_config.enable_numpy_behavior() #1093
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,9 +4,6 @@ | |
import numpy as np | ||
import tensorflow as tf | ||
|
||
# Enables printing of loss tensors during custom training loop | ||
from tensorflow.python.ops.numpy_ops import np_config | ||
|
||
import hls4ml.optimization.dsp_aware_pruning.keras.utils as utils | ||
from hls4ml.optimization.dsp_aware_pruning.config import SUPPORTED_STRUCTURES | ||
from hls4ml.optimization.dsp_aware_pruning.keras.builder import build_optimizable_model, remove_custom_regularizers | ||
|
@@ -15,7 +12,6 @@ | |
from hls4ml.optimization.dsp_aware_pruning.keras.reduction import reduce_model | ||
from hls4ml.optimization.dsp_aware_pruning.scheduler import OptimizationScheduler | ||
|
||
np_config.enable_numpy_behavior() | ||
default_regularization_range = np.logspace(-6, -2, num=16).tolist() | ||
|
||
|
||
|
@@ -121,7 +117,7 @@ def optimize_model( | |
model.compile(optimizer, loss_fn, metrics=[validation_metric]) | ||
baseline_performance = model.evaluate(validation_dataset, verbose=0, return_dict=False)[-1] | ||
if verbose: | ||
print(f'Baseline performance on validation set: {baseline_performance}') | ||
tf.print(f'Baseline performance on validation set: {baseline_performance}') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this actually print the statement, or will it "maybe print" depending on some external case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In tf execution environment (e.g., tf.function with jit), it is supposed to print, and it worked for me in other cases. Did not have the setup locally to run this part specifically, thus not tested here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The NumPy flag is only required for printing the loss tensors during training. Everything else should work fine with normal Python prints. However, there are no tests that run the full dsp-optimization flow and train a model with sparsity, so we need to make sure that the behaviour still stays the same and that the loss tensors are printed. |
||
|
||
# Save best weights | ||
# Always save weights to a file, to reduce memory utilization | ||
|
@@ -222,7 +218,7 @@ def optimize_model( | |
|
||
# Train model with weight freezing [pruning] | ||
if verbose: | ||
print(f'Pruning with a target sparsity of {target_sparsity * 100.0}% [relative to objective]') | ||
tf.print(f'Pruning with a target sparsity of {target_sparsity * 100.0}% [relative to objective]') | ||
for epoch in range(epochs - rewinding_epochs): | ||
start_time = time.time() | ||
epoch_loss_avg = tf.keras.metrics.Mean() | ||
|
@@ -237,14 +233,14 @@ def optimize_model( | |
val_res = optimizable_model.evaluate(validation_dataset, verbose=0, return_dict=False) | ||
t = time.time() - start_time | ||
avg_loss = round(epoch_loss_avg.result(), 3) | ||
print(f'Epoch: {epoch + 1} - Time: {t}s - Average training loss: {avg_loss}') | ||
print(f'Epoch: {epoch + 1} - learning_rate: {optimizable_model.optimizer.learning_rate.numpy()}') | ||
print(f'Epoch: {epoch + 1} - Validation loss: {val_res[0]} - Performance on validation set: {val_res[1]}') | ||
tf.print(f'Epoch: {epoch + 1} - Time: {t}s - Average training loss: {avg_loss}') | ||
tf.print(f'Epoch: {epoch + 1} - learning_rate: {optimizable_model.optimizer.learning_rate.numpy()}') | ||
tf.print(f'Epoch: {epoch + 1} - Validation loss: {val_res[0]} - Performance on validation set: {val_res[1]}') | ||
|
||
# Check if model works after pruning | ||
pruned_performance = optimizable_model.evaluate(validation_dataset, verbose=0, return_dict=False)[-1] | ||
if verbose: | ||
print(f'Optimized model performance on validation set, after fine-tuning: {pruned_performance}') | ||
tf.print(f'Optimized model performance on validation set, after fine-tuning: {pruned_performance}') | ||
|
||
if __compare__(pruned_performance, rtol * baseline_performance, not increasing): | ||
bad_trials = 0 | ||
|
@@ -260,7 +256,7 @@ def optimize_model( | |
|
||
# Train model without weight freezing [rewinding] | ||
if verbose: | ||
print(f'Starting weight rewinding for {rewinding_epochs} epochs') | ||
tf.print(f'Starting weight rewinding for {rewinding_epochs} epochs') | ||
optimizable_model.fit( | ||
train_dataset, | ||
validation_data=validation_dataset, | ||
|
@@ -293,7 +289,7 @@ def optimize_model( | |
# Evaluate final optimized model [purely for debugging / informative purposes] | ||
if verbose: | ||
pruned_performance = optimizable_model.evaluate(validation_dataset, verbose=0, return_dict=False)[-1] | ||
print(f'Optimized model performance on validation set: {pruned_performance}') | ||
tf.print(f'Optimized model performance on validation set: {pruned_performance}') | ||
|
||
return optimizable_model | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more details as to why this breaks the flow on some setups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some indexing and other behaviors are changed, with some undocumented effects, e.g.: in default mode
tf.reduce_all(a>b)
is fine, but will raise if this is set. This causes some qkeras quantizers to return dicts for unknown reason, and I did not track down the direct cause.Failed tests:
qkeras po2 quantizer:
failed for condition on x>=y pattern
hgq activation:
failure in jitted code, not located yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this break occur? Is it with a newer version of NumPy or TensorFlow? Because the CI/CD on GitHub doesn't shows this, so I am just wondering under what environment this occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gitlab CI test is not affected because this line is not executed before those (qkeras po2 and hgq) tests: only dsp aware pruning related tests will import the optimization api and trigger this line, and those don't run before the broken tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what set-up would cause this error? Importing both the dsp_aware_pruning and hgq / po2 in the same script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some tests are run with numpy behavior enabled. Full failure pattern: https://gitlab.cern.ch/fastmachinelearning/hls4ml/-/pipelines/8406093 (numpy behavior enabled at hls4ml import)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I have 2 comments on this but in general, I am okay with merging this PR as long as the loss is actually printed with
tf.print
. First, the error causing seems a bit artificial - is it ever likely that a user is going to import bothdsp_aware_pruning
andhgq
/po2
in the same module. Secondly, looking at the test logs this seems like this is a TensorFlow issue in some recent update? Looking at the docs for this function: https://www.tensorflow.org/api_docs/python/tf/experimental/numpy/experimental_enable_numpy_behavior maybe the parameterdtype_conversion_mode
can help solve this issue without changing the print statements.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Response to your comments:
tf.print
to avoid affecting things outside, or pack it into a context manager to restore the original state after exiting the module.Regarding if losses are actually printed, while I believe it should do so, I don't have the setup to validate it on my side.