Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] validation check failed is always sent prematurely #550

Closed
sadsfae opened this issue Nov 6, 2024 · 5 comments
Closed

[BUG] validation check failed is always sent prematurely #550

sadsfae opened this issue Nov 6, 2024 · 5 comments

Comments

@sadsfae
Copy link
Member

sadsfae commented Nov 6, 2024

This is a much older and minor issue but the validation check notification email is always sent.

Validation check failed for cloud61 / wfoster / 3885

A post allocation test has failed for:

   cloud: cloud61
   owner: wfoster
   ticket: 3885

The release of the environment is contingent on the success
of the validation scripts.  Before users are notified, make
sure to address the environment and ensure all validation steps
succeed to ensure a timely release to the end user.

DevOps Team

Results:
Unable to query Foreman for cloud: cloud61
Verify Foreman password is correct: ***@****

We should not be sending any notification emails from validate_environment phases while in the configurable grace window.

Fix

I think that we should run an pre-emptive quads --foreman-rbac here as in testing this makes sure the message is actually relevant (just tested this with another env)

@kambiz-aghaiepour
Copy link
Contributor

Given we only run this out of cron on Monday through Saturday:

-=>>crontab -l | grep foreman
0 */3 * * 1-6 ( echo "=== Foreman RBAC == @ " $(date) ; flock -n /tmp/foremanrbac.lock -c "quads --foreman-rbac" ) 1>/dev/null 2>&1

and new schedules go out on Sundays, maybe we want to also run the cron on Sundays as well? Also, for the hour, */3 means it runs at 0, 3, 6, 9, 12, 15, 18, and 21. The schedules typically start at 22 hours, so the next check won't run until 2 hours after the schedules start, and our grace period for starting the check is 30 minutes after the schedule starts. So with the validation check running at:

*/20 * * * * ( echo "=== Validating == @ " $(date) ; flock -n /tmp/validateenv.lock -c "quads --validate-env" ) 1>>/var/log/validate-env.log 2>&1

it means the next validation run will start at 22:40 and be after the end of the grace period. This is still well before the next rbac run. Depending on the cost of running quads --foreman-rbac, I think we can fix this by setting the rbac cron to run more frequently with:

5 */2 * * * ( echo "=== Foreman RBAC == @ " $(date) ; flock -n /tmp/foremanrbac.lock -c "quads --foreman-rbac" ) 1>/dev/null 2>&1

which should have the rbac run to trigger at 22:05, well before the next validation run.

@sadsfae
Copy link
Member Author

sadsfae commented Nov 8, 2024

@kambiz-aghaiepour running more frequently wouldn't be a bad thing but I think we should also run a pre-emptive quads --foreman-rbac too somewhere that makes sense. Let's just go with the cron adjustments though for now.

I'd want to discuss with @grafuls on how to handle this, but likely we'd want an assignment table field in the database to ensure it's run only once and then just run it out of move_and_rebuild. Like the notification labels this can also be reset when the assignment is reclaimed.

@sadsfae
Copy link
Member Author

sadsfae commented Nov 8, 2024

Currently this message gets generated here in the notify_failure method: https://github.com/redhat-performance/quads/blob/latest/src/quads/tools/validate_env.py#L43 and it will always false-positive unless we're exactly spot on with cron runs (which seems like a bandaid).

If we had an one-time-per-release --foreman-rbac call in M&R before VE gets processed it will avoid getting the failure message entirely rather than increase the frequency from cron.

At that point we may be able to just completely remove the cron.

@sadsfae sadsfae added the foreman label Nov 8, 2024
@sadsfae
Copy link
Member Author

sadsfae commented Nov 8, 2024

Lastly, we want a check that we're not in grace period before notify_failure gets processed anyway: https://github.com/redhat-performance/quads/blob/latest/src/quads/tools/validate_env.py#L43

e.g.

if time_delta.total_seconds() // 60 > Config["validation_grace_period"]:

https://github.com/redhat-performance/quads/blob/latest/conf/quads.yml#L154

@sadsfae
Copy link
Member Author

sadsfae commented Nov 8, 2024

@github-project-automation github-project-automation bot moved this from To do to Done in QUADS 2.1 Series Nov 11, 2024
@sadsfae sadsfae self-assigned this Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

2 participants