Skip to content

Commit

Permalink
docs: ✏️ fix broken links, some updates to out-of-date info
Browse files Browse the repository at this point in the history
Relates to
[link-checker-report](#6429)
  • Loading branch information
sj-williams committed Nov 11, 2024
1 parent 7ffb207 commit 8faf919
Show file tree
Hide file tree
Showing 4 changed files with 13 additions and 8 deletions.
6 changes: 3 additions & 3 deletions architecture-decision-record/022-EKS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# EKS

Date: 02/05/2021
Date: 11/11/2024

## Status

Expand Down Expand Up @@ -32,7 +32,7 @@ We already run the Manager cluster on EKS, and have gained a lot of insight and

Developers in service teams need to use the k8s auth, and GitHub continues to be the most common SSO amongst them with good tie-in to JML processes - see [ADR 6 Use GitHub as our identity provider](006-Use-github-as-user-directory.md)

Auth0 is useful as a broker, for a couple of important [rules that it runs at login time](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/global-resources/resources/auth0-rules):
Auth0 is useful as a broker, for a couple of important [rules that it runs at login time](https://github.com/ministryofjustice/cloud-platform-terraform-global-resources-auth0):

* it ensures that the user is in the ministryofjustice GitHub organization, so only staff can get a kubeconfig and login to CP websites like Grafana
* it inserts the user's GitHub teams into the OIDC ID token as claims. These are used by k8s RBAC to authorize the user for the correct namespaces
Expand Down Expand Up @@ -157,7 +157,7 @@ Advantages of AWS's CNI:
* it is the default with EKS, native to AWS, is fully supported by AWS - low management overhead
* offers good network performance

The concern with AWS's CNI would be that it uses an IP address for every pod, and there is a [limit per node](https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt), depending on the EC2 instance type and the number of ENIs it supports. The calculations in [Node Instance Types](#node-instance-types) show that with a change of instance type, the cost of the cluster increases by 17% or $8k, which is acceptable - likely less than the engineering cost of maintaining and supporting full Calico networking and custom node image.
The concern with AWS's CNI would be that it uses an IP address for every pod, and there is a [limit per node](https://github.com/awslabs/amazon-eks-ami/blob/main/nodeadm/internal/kubelet/eni-max-pods.txt), depending on the EC2 instance type and the number of ENIs it supports. The calculations in [Node Instance Types](#node-instance-types) show that with a change of instance type, the cost of the cluster increases by 17% or $8k, which is acceptable - likely less than the engineering cost of maintaining and supporting full Calico networking and custom node image.

The alternative considered was [Calico networking](https://docs.projectcalico.org/getting-started/kubernetes/managed-public-cloud/eks#install-eks-with-calico-networking). This has the advantage of not needing an IP address per pod, and associated instance limit. And it is open source. However:

Expand Down
7 changes: 5 additions & 2 deletions architecture-decision-record/023-Logging.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
# 23 Logging

Date: 02/06/2021
Date: 11/11/2024

## Status

✅ Accepted

## Context

Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch). This allows [service teams](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/access-logs.html#accessing-application-log-data) and Cloud Platform team to use Kibana's search and browse functionality, for the purpose of debug and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days.
> Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch).
As of November 2024, we have migrated the logging service over to AWS OpenSearch, with ElasticSearch due for retirement (pending some decisions and actions on how to manage existing data retention on that cluster).
Service teams can use OpenSearch's [search and browse functionality](https://app-logs.cloud-platform.service.justice.gov.uk/_dashboards/app/home#/) for the purposes of debugging and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days.

Concerns with existing ElasticSearch logging:

Expand Down
6 changes: 4 additions & 2 deletions architecture-decision-record/026-Managed-Prometheus.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 26 Managed Prometheus

Date: 2021-10-08
Date: 2024-11-11

## Status

Expand Down Expand Up @@ -67,7 +67,9 @@ We also need to address:

**Sharding**: We could split/shard the Prometheus instance: perhaps dividing into two - tenants and platform. Or if we did multi-cluster we could have one Prometheus instance per cluster. This appears relatively straightforward to do. There would be concern that however we split it, as we scale in the future we'll hit future scaling thresholds, where it will be necessary to change how to divide it into shards, so a bit of planning would be needed.

**High Availability**: The recommended approach would be to run multiple instances of Prometheus configured the same, scraping the same endpoints independently. [Source](https://prometheus-operator.dev/docs/operator/high-availability/#prometheus) There is a `replicas` option to do this. However for HA we would also need to have a load balancer for the PromQL queries to the Prometheus API, to fail-over if the primary is unresponsive. And it's not clear how this works with duplicate alerts being sent to AlertManager. This doesn't feel like a very paved path, with Prometheus Operator [saying](https://prometheus-operator.dev/docs/operator/high-availability/) "We are currently implementing some of the groundwork to make this possible, and figuring out the best approach to do so, but it is definitely on the roadmap!" - Jan 2017, and not updated since.
**High Availability**: We are now running Prometheus in HA mode [with 3 replicas](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/pull/239). Keeping the findings below as we may have some additional elements of HA to consider in the future:

> [Source](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md#prometheus) There is a `replicas` option to do this. However for HA we would also need to have a load balancer for the PromQL queries to the Prometheus API, to fail-over if the primary is unresponsive. And it's not clear how this works with duplicate alerts being sent to AlertManager. This doesn't feel like a very paved path, with Prometheus Operator [saying](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md) "We are currently implementing some of the groundwork to make this possible, and figuring out the best approach to do so, but it is definitely on the roadmap!" - Jan 2017, and not updated since.
**Managed Prometheus**: Using a managed service of prometheus, such as AMP, would address most of these concerns, and is evaluated in detail in the next section.

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/leavers-guide.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ Below are the list of 3rd party accounts that need to be removed when a member l

4. [Pagerduty](https://moj-digital-tools.pagerduty.com/users)

5. [DockerHub MoJ teams](https://cloud.docker.com/orgs/ministryofjustice/teams)
5. DockerHub MoJ teams

6. [Pingdom](https://www.pingdom.com)

Expand Down

0 comments on commit 8faf919

Please sign in to comment.