Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant alerts docs #4703

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

Conversation

richardTowers
Copy link
Contributor

@richardTowers richardTowers commented Jun 21, 2024

🔥

Best reviewed commit by commit, as I've explained my reasoning for each deletion / change / move.

I've done my best to check that the files I've deleted are not linked to anywhere, and that the files I've moved have redirects and have had any links within the docs updated.

This is a precurser to going through and fixing all the links we've got to old grafanas, jenkinss etc. - no point trying to fix those links in obsolete docs.

I probably need to double check some of my own commits, because I've just worked out that whitehall scheduled publishing does have an alert and that it still pages - https://github.com/alphagov/govuk-helm-charts/blob/main/charts/monitoring-config/rules/whitehall.yaml#L9

A few things it would be good to follow up on afterwards:

@richardTowers richardTowers marked this pull request as draft June 21, 2024 19:29
These were fairly self explanatory anyway, and there's no longer an
Icinga to alert on them.
There's already a troubleshooting section here
https://docs.publishing.service.gov.uk/manual/fall-back-to-mirror.html#troubleshooting

And there's no Icinga check for this anymore.
These were under "Icinga alerts", but those aren't a thing anymore.
There's still quite a bit of potentially useful information here though,
so I think it's best to keep it but move it under the Content Data
section (which already exists)
This app is hardly used now, and we haven't got this alert in place
anymore. If it breaks, we'll just have to work it out.
There's no longer an Incinga instance checking the health of the
elasticsearch cluster for us, but a lot of this documentation is still
salvageable.

I've been lazy and not fixed the guidance on tunneling, instead just
linking to a similar idea for OpenSearch. We should probably make that
documentation more generic, as it's already used for Manage Amazon MQ
and Manage OpenSearch on AWS.
Once you remove the bits about icinga, and other bits of infrastructure
which are no longer relevant, what remains of this documentation is
mostly just generic advice on how to debug things on GOV.UK.
This alert doesn't exist anymore, and there's already good documentation
on how to query fastly logs. Nothing to be salvaged here.
This alert doesn't exist anymore, and if we accidentally block traffic
from getting to the apps we'll find out about it in other ways (pingdom,
smoke tests, 5xx rates).
The docs at least say that we'll be paged in the pingdom check fails, so
I think this should be in that section.
As far as I can see, we don't have a pingdom check for the search
results page. And even if we did, this documentation is pretty bad...
"This is not as critical a problem as you might assume" - how do the
docs know how critical a problem someone would assume this is??
There's no alert for this anymore, and I feel like this documentation is
probably no better than the existence of the rake task we have to fix
the issue.
There's no alert for these anymore, so let's not mention alerts. Moving
this into the Publishing section feels like the right thing to do.

I also couldn't resist simplifying the k exec call - rails runner will
run the statements you give it non interactively.
I don't think this is really all that different from any other app which
has a healthcheck that might fail. The steps for investigating are
either obvious (look at the logs) or just point to other docs
(elasticsearch cluster health), which the user probably would have found
straightaway through search if these docs weren't here.
Both the alert and the learn to rank process are now pushing up the daisies.
There's no longer an alert for this. I don't think there's really
anything here that's specific to search, and not just general sidekiq
guidance (which we already have).
There's no longer an alert for this. I suspect we also don't have a
scheduled job runnin on Monday at 9pm anymore, because without the alert
to let us know it had failed that would be pretty pointless.
There's no longer an alert for this.

I'm reasonably confident we're also not using the machine learning it
refers to. There are no sagemaker endpoints in our AWS account.
There's no longer an alert for this, but the docs still seem useful.

Moving them to the Publishing section so they're not lost, and making
some slight changes to wording / updating a link.
These alerts no longer exist, and most of this documentation is talking
about what the alerts are, and what they mean.

There are some specific suggestions at the end of the doc, but they
involve running things in the rails console which is a very icky anyway.
The rest of the documentation is generic enough I'm confortable deleting
it.
@richardTowers richardTowers force-pushed the remove-redundant-alerts-docs branch from a4cb68f to 99658b0 Compare July 16, 2024 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant