-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove redundant alerts docs #4703
Draft
richardTowers
wants to merge
20
commits into
main
Choose a base branch
from
remove-redundant-alerts-docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
These were fairly self explanatory anyway, and there's no longer an Icinga to alert on them.
There's already a troubleshooting section here https://docs.publishing.service.gov.uk/manual/fall-back-to-mirror.html#troubleshooting And there's no Icinga check for this anymore.
These were under "Icinga alerts", but those aren't a thing anymore. There's still quite a bit of potentially useful information here though, so I think it's best to keep it but move it under the Content Data section (which already exists)
This app is hardly used now, and we haven't got this alert in place anymore. If it breaks, we'll just have to work it out.
There's no longer an Incinga instance checking the health of the elasticsearch cluster for us, but a lot of this documentation is still salvageable. I've been lazy and not fixed the guidance on tunneling, instead just linking to a similar idea for OpenSearch. We should probably make that documentation more generic, as it's already used for Manage Amazon MQ and Manage OpenSearch on AWS.
Once you remove the bits about icinga, and other bits of infrastructure which are no longer relevant, what remains of this documentation is mostly just generic advice on how to debug things on GOV.UK.
This alert doesn't exist anymore, and there's already good documentation on how to query fastly logs. Nothing to be salvaged here.
This alert doesn't exist anymore, and if we accidentally block traffic from getting to the apps we'll find out about it in other ways (pingdom, smoke tests, 5xx rates).
The docs at least say that we'll be paged in the pingdom check fails, so I think this should be in that section.
As far as I can see, we don't have a pingdom check for the search results page. And even if we did, this documentation is pretty bad... "This is not as critical a problem as you might assume" - how do the docs know how critical a problem someone would assume this is??
There's no alert for this anymore, and I feel like this documentation is probably no better than the existence of the rake task we have to fix the issue.
There's no alert for these anymore, so let's not mention alerts. Moving this into the Publishing section feels like the right thing to do. I also couldn't resist simplifying the k exec call - rails runner will run the statements you give it non interactively.
I don't think this is really all that different from any other app which has a healthcheck that might fail. The steps for investigating are either obvious (look at the logs) or just point to other docs (elasticsearch cluster health), which the user probably would have found straightaway through search if these docs weren't here.
Both the alert and the learn to rank process are now pushing up the daisies.
There's no longer an alert for this. I don't think there's really anything here that's specific to search, and not just general sidekiq guidance (which we already have).
There's no longer an alert for this. I suspect we also don't have a scheduled job runnin on Monday at 9pm anymore, because without the alert to let us know it had failed that would be pretty pointless.
There's no longer an alert for this. I'm reasonably confident we're also not using the machine learning it refers to. There are no sagemaker endpoints in our AWS account.
There's no longer an alert for this, but the docs still seem useful. Moving them to the Publishing section so they're not lost, and making some slight changes to wording / updating a link.
These alerts no longer exist, and most of this documentation is talking about what the alerts are, and what they mean. There are some specific suggestions at the end of the doc, but they involve running things in the rails console which is a very icky anyway. The rest of the documentation is generic enough I'm confortable deleting it.
richardTowers
force-pushed
the
remove-redundant-alerts-docs
branch
from
July 16, 2024 12:28
a4cb68f
to
99658b0
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🔥
Best reviewed commit by commit, as I've explained my reasoning for each deletion / change / move.
I've done my best to check that the files I've deleted are not linked to anywhere, and that the files I've moved have redirects and have had any links within the docs updated.
This is a precurser to going through and fixing all the links we've got to old grafanas, jenkinss etc. - no point trying to fix those links in obsolete docs.
I probably need to double check some of my own commits, because I've just worked out that whitehall scheduled publishing does have an alert and that it still pages - https://github.com/alphagov/govuk-helm-charts/blob/main/charts/monitoring-config/rules/whitehall.yaml#L9
A few things it would be good to follow up on afterwards: