docs: ✏️ fix broken links, some updates to out-of-date info

Relates to [link-checker-report](#6429)
ministryofjustice · Nov 11, 2024 · 8faf919 · 8faf919
1 parent 7ffb207
commit 8faf919
Show file tree

Hide file tree

Showing 4 changed files with 13 additions and 8 deletions.
diff --git a/architecture-decision-record/022-EKS.md b/architecture-decision-record/022-EKS.md
@@ -1,6 +1,6 @@
 # EKS
 
-Date: 02/05/2021
+Date: 11/11/2024
 
 ## Status
 
@@ -32,7 +32,7 @@ We already run the Manager cluster on EKS, and have gained a lot of insight and
 
 Developers in service teams need to use the k8s auth, and GitHub continues to be the most common SSO amongst them with good tie-in to JML processes - see [ADR 6 Use GitHub as our identity provider](006-Use-github-as-user-directory.md)
 
-Auth0 is useful as a broker, for a couple of important [rules that it runs at login time](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/global-resources/resources/auth0-rules):
+Auth0 is useful as a broker, for a couple of important [rules that it runs at login time](https://github.com/ministryofjustice/cloud-platform-terraform-global-resources-auth0):
 
 * it ensures that the user is in the ministryofjustice GitHub organization, so only staff can get a kubeconfig and login to CP websites like Grafana
 * it inserts the user's GitHub teams into the OIDC ID token as claims. These are used by k8s RBAC to authorize the user for the correct namespaces
@@ -157,7 +157,7 @@ Advantages of AWS's CNI:
 * it is the default with EKS, native to AWS, is fully supported by AWS - low management overhead
 * offers good network performance
 
-The concern with AWS's CNI would be that it uses an IP address for every pod, and there is a [limit per node](https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt), depending on the EC2 instance type and the number of ENIs it supports. The calculations in [Node Instance Types](#node-instance-types) show that with a change of instance type, the cost of the cluster increases by 17% or $8k, which is acceptable - likely less than the engineering cost of maintaining and supporting full Calico networking and custom node image.
+The concern with AWS's CNI would be that it uses an IP address for every pod, and there is a [limit per node](https://github.com/awslabs/amazon-eks-ami/blob/main/nodeadm/internal/kubelet/eni-max-pods.txt), depending on the EC2 instance type and the number of ENIs it supports. The calculations in [Node Instance Types](#node-instance-types) show that with a change of instance type, the cost of the cluster increases by 17% or $8k, which is acceptable - likely less than the engineering cost of maintaining and supporting full Calico networking and custom node image.
 
 The alternative considered was [Calico networking](https://docs.projectcalico.org/getting-started/kubernetes/managed-public-cloud/eks#install-eks-with-calico-networking). This has the advantage of not needing an IP address per pod, and associated instance limit. And it is open source. However:
 

diff --git a/architecture-decision-record/023-Logging.md b/architecture-decision-record/023-Logging.md
@@ -1,14 +1,17 @@
 # 23 Logging
 
-Date: 02/06/2021
+Date: 11/11/2024
 
 ## Status
 
 ✅ Accepted
 
 ## Context
 
-Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch). This allows [service teams](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/access-logs.html#accessing-application-log-data) and Cloud Platform team to use Kibana's search and browse functionality, for the purpose of debug and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days.
+> Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch). 
+
+As of November 2024, we have migrated the logging service over to AWS OpenSearch, with ElasticSearch due for retirement (pending some decisions and actions on how to manage existing data retention on that cluster).
+Service teams can use OpenSearch's [search and browse functionality](https://app-logs.cloud-platform.service.justice.gov.uk/_dashboards/app/home#/) for the purposes of debugging and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days.
 
 Concerns with existing ElasticSearch logging:
 

diff --git a/architecture-decision-record/026-Managed-Prometheus.md b/architecture-decision-record/026-Managed-Prometheus.md
@@ -1,6 +1,6 @@
 # 26 Managed Prometheus
 
-Date: 2021-10-08
+Date: 2024-11-11
 
 ## Status
 
@@ -67,7 +67,9 @@ We also need to address:
 
 **Sharding**: We could split/shard the Prometheus instance: perhaps dividing into two - tenants and platform. Or if we did multi-cluster we could have one Prometheus instance per cluster. This appears relatively straightforward to do. There would be concern that however we split it, as we scale in the future we'll hit future scaling thresholds, where it will be necessary to change how to divide it into shards, so a bit of planning would be needed.
 
-**High Availability**: The recommended approach would be to run multiple instances of Prometheus configured the same, scraping the same endpoints independently. [Source](https://prometheus-operator.dev/docs/operator/high-availability/#prometheus) There is a `replicas` option to do this. However for HA we would also need to have a load balancer for the PromQL queries to the Prometheus API, to fail-over if the primary is unresponsive. And it's not clear how this works with duplicate alerts being sent to AlertManager. This doesn't feel like a very paved path, with Prometheus Operator [saying](https://prometheus-operator.dev/docs/operator/high-availability/) "We are currently implementing some of the groundwork to make this possible, and figuring out the best approach to do so, but it is definitely on the roadmap!" - Jan 2017, and not updated since.
+**High Availability**: We are now running Prometheus in HA mode [with 3 replicas](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/pull/239). Keeping the findings below as we may have some additional elements of HA to consider in the future: 
+
+> [Source](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md#prometheus) There is a `replicas` option to do this. However for HA we would also need to have a load balancer for the PromQL queries to the Prometheus API, to fail-over if the primary is unresponsive. And it's not clear how this works with duplicate alerts being sent to AlertManager. This doesn't feel like a very paved path, with Prometheus Operator [saying](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md) "We are currently implementing some of the groundwork to make this possible, and figuring out the best approach to do so, but it is definitely on the roadmap!" - Jan 2017, and not updated since.
 
 **Managed Prometheus**: Using a managed service of prometheus, such as AMP, would address most of these concerns, and is evaluated in detail in the next section.
 

diff --git a/runbooks/source/leavers-guide.html.md.erb b/runbooks/source/leavers-guide.html.md.erb
@@ -70,7 +70,7 @@ Below are the list of 3rd party accounts that need to be removed when a member l
 
 4. [Pagerduty](https://moj-digital-tools.pagerduty.com/users)
 
-5. [DockerHub MoJ teams](https://cloud.docker.com/orgs/ministryofjustice/teams)
+5. DockerHub MoJ teams
 
 6. [Pingdom](https://www.pingdom.com)