Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm np_config.enable_numpy_behavior() #1093

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 8 additions & 12 deletions hls4ml/optimization/dsp_aware_pruning/keras/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,6 @@
import numpy as np
import tensorflow as tf

# Enables printing of loss tensors during custom training loop
from tensorflow.python.ops.numpy_ops import np_config

import hls4ml.optimization.dsp_aware_pruning.keras.utils as utils
from hls4ml.optimization.dsp_aware_pruning.config import SUPPORTED_STRUCTURES
from hls4ml.optimization.dsp_aware_pruning.keras.builder import build_optimizable_model, remove_custom_regularizers
Expand All @@ -15,7 +12,6 @@
from hls4ml.optimization.dsp_aware_pruning.keras.reduction import reduce_model
from hls4ml.optimization.dsp_aware_pruning.scheduler import OptimizationScheduler

np_config.enable_numpy_behavior()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more details as to why this breaks the flow on some setups?

Copy link
Contributor Author

@calad0i calad0i Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some indexing and other behaviors are changed, with some undocumented effects, e.g.: in default mode tf.reduce_all(a>b) is fine, but will raise if this is set. This causes some qkeras quantizers to return dicts for unknown reason, and I did not track down the direct cause.
Failed tests:
qkeras po2 quantizer:
failed for condition on x>=y pattern
hgq activation:
failure in jitted code, not located yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this break occur? Is it with a newer version of NumPy or TensorFlow? Because the CI/CD on GitHub doesn't shows this, so I am just wondering under what environment this occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gitlab CI test is not affected because this line is not executed before those (qkeras po2 and hgq) tests: only dsp aware pruning related tests will import the optimization api and trigger this line, and those don't run before the broken tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what set-up would cause this error? Importing both the dsp_aware_pruning and hgq / po2 in the same script?

Copy link
Contributor Author

@calad0i calad0i Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some tests are run with numpy behavior enabled. Full failure pattern: https://gitlab.cern.ch/fastmachinelearning/hls4ml/-/pipelines/8406093 (numpy behavior enabled at hls4ml import)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I have 2 comments on this but in general, I am okay with merging this PR as long as the loss is actually printed with tf.print. First, the error causing seems a bit artificial - is it ever likely that a user is going to import both dsp_aware_pruning and hgq / po2 in the same module. Secondly, looking at the test logs this seems like this is a TensorFlow issue in some recent update? Looking at the docs for this function: https://www.tensorflow.org/api_docs/python/tf/experimental/numpy/experimental_enable_numpy_behavior maybe the parameter dtype_conversion_mode can help solve this issue without changing the print statements.

Copy link
Contributor Author

@calad0i calad0i Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response to your comments:

  1. I don't think the errors are artificial -- if one runs the tests on one node, this is what to expect. I spend at least half an hour just to debug why some irrelevant tests breaks after some editing, and just to find out that the test execution order changed and things breaks because of this.
  2. The issue is that global behavior is silently overridden by accessing a submodule of a package, and this should be avoided to the max extent possible. In this specific case, it only breaks hgq/po2 conversion hls4ml, but it could introduce undefined behavior on other places. I would suggest changing the print statement using tf.print to avoid affecting things outside, or pack it into a context manager to restore the original state after exiting the module.

Regarding if losses are actually printed, while I believe it should do so, I don't have the setup to validate it on my side.

default_regularization_range = np.logspace(-6, -2, num=16).tolist()


Expand Down Expand Up @@ -121,7 +117,7 @@ def optimize_model(
model.compile(optimizer, loss_fn, metrics=[validation_metric])
baseline_performance = model.evaluate(validation_dataset, verbose=0, return_dict=False)[-1]
if verbose:
print(f'Baseline performance on validation set: {baseline_performance}')
tf.print(f'Baseline performance on validation set: {baseline_performance}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this actually print the statement, or will it "maybe print" depending on some external case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tf execution environment (e.g., tf.function with jit), it is supposed to print, and it worked for me in other cases. Did not have the setup locally to run this part specifically, thus not tested here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NumPy flag is only required for printing the loss tensors during training. Everything else should work fine with normal Python prints. However, there are no tests that run the full dsp-optimization flow and train a model with sparsity, so we need to make sure that the behaviour still stays the same and that the loss tensors are printed.


# Save best weights
# Always save weights to a file, to reduce memory utilization
Expand Down Expand Up @@ -222,7 +218,7 @@ def optimize_model(

# Train model with weight freezing [pruning]
if verbose:
print(f'Pruning with a target sparsity of {target_sparsity * 100.0}% [relative to objective]')
tf.print(f'Pruning with a target sparsity of {target_sparsity * 100.0}% [relative to objective]')
for epoch in range(epochs - rewinding_epochs):
start_time = time.time()
epoch_loss_avg = tf.keras.metrics.Mean()
Expand All @@ -237,14 +233,14 @@ def optimize_model(
val_res = optimizable_model.evaluate(validation_dataset, verbose=0, return_dict=False)
t = time.time() - start_time
avg_loss = round(epoch_loss_avg.result(), 3)
print(f'Epoch: {epoch + 1} - Time: {t}s - Average training loss: {avg_loss}')
print(f'Epoch: {epoch + 1} - learning_rate: {optimizable_model.optimizer.learning_rate.numpy()}')
print(f'Epoch: {epoch + 1} - Validation loss: {val_res[0]} - Performance on validation set: {val_res[1]}')
tf.print(f'Epoch: {epoch + 1} - Time: {t}s - Average training loss: {avg_loss}')
tf.print(f'Epoch: {epoch + 1} - learning_rate: {optimizable_model.optimizer.learning_rate.numpy()}')
tf.print(f'Epoch: {epoch + 1} - Validation loss: {val_res[0]} - Performance on validation set: {val_res[1]}')

# Check if model works after pruning
pruned_performance = optimizable_model.evaluate(validation_dataset, verbose=0, return_dict=False)[-1]
if verbose:
print(f'Optimized model performance on validation set, after fine-tuning: {pruned_performance}')
tf.print(f'Optimized model performance on validation set, after fine-tuning: {pruned_performance}')

if __compare__(pruned_performance, rtol * baseline_performance, not increasing):
bad_trials = 0
Expand All @@ -260,7 +256,7 @@ def optimize_model(

# Train model without weight freezing [rewinding]
if verbose:
print(f'Starting weight rewinding for {rewinding_epochs} epochs')
tf.print(f'Starting weight rewinding for {rewinding_epochs} epochs')
optimizable_model.fit(
train_dataset,
validation_data=validation_dataset,
Expand Down Expand Up @@ -293,7 +289,7 @@ def optimize_model(
# Evaluate final optimized model [purely for debugging / informative purposes]
if verbose:
pruned_performance = optimizable_model.evaluate(validation_dataset, verbose=0, return_dict=False)[-1]
print(f'Optimized model performance on validation set: {pruned_performance}')
tf.print(f'Optimized model performance on validation set: {pruned_performance}')

return optimizable_model

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def __call__(self, weights):
# The matrix is transposed, according to Resource strategy and reshaped into (pattern_offset, pattern_number)
# Pattern offset corresponds to the number of patterns is equivalent to RF
if (np.prod(weights.shape)) % self.pattern_offset != 0:
print(np.prod(weights.shape), self.pattern_offset)
tf.print(np.prod(weights.shape), self.pattern_offset)
raise Exception(f'{self.__class__.__name__}: pattern offset needs to be a factor of matrix size')

if self.pattern_offset % self.consecutive_patterns != 0:
Expand Down
Loading