Skip to content

Monitoring, Alerts, and Dashboards

Joel Thibault edited this page Jan 11, 2023 · 17 revisions

Monitoring, Alerts, and Dashboards

See also the Workbench Reporting Dataset (WRD) which works similarly.

Overview

The API server periodically records various metrics which power Stackdriver dashboards and alerts.

Stackdriver/Opencensus (TODO: what is the distinction between these?)

Stackdriver metrics come in 3 Kinds: Gauge, Delta, and Cumulative. From the Google Cloud monitoring docs:

A gauge metric, in which the value measures a specific instant in time. For example, metrics measuring CPU utilization are gauge metrics; each point records the CPU utilization at the time of measurement. Another example of a gauge metric is the current temperature.

A delta metric, in which the value measures the change since it was last recorded. For example, metrics measuring request counts are delta metrics; each value records how many requests were received since the last data point was recorded.

A cumulative metric, in which the value constantly increases over time. For example, a metric for “sent bytes” might be cumulative; each value records the total number of bytes sent by a service at that time.

TODO: but wait, we implement Gauge, Event, and Distribution ...

ideas for Stackdriver topics:

  • What is a metric
  • How do we create a Custom Metric

API Code Structure

Definitions

Metric (AKA Measurement?) - a numerical value. Something we are measuring. There are a few subtypes of Metric such as GaugeMetric.

Tag, MetricLabel, Attachment (TODO: distinguish?) - a string label used to categorize and give context to a Metric.

Implementation

A MeasurementBundle consists of Metrics and Tags. It has specific validation rules (see MeasurementBundle.java):

  1. It must contain at least one Metric/Measurement
  2. Every Tag/MetricLabel/Attachment must be supported by all Metrics in the bundle
  3. Some MetricLabels are restricted to a finite set of TagValues

Example: DirectoryServiceImpl.addDomainCountMeasurement()

MeasurementBundle.builder()
  .addMeasurement(GaugeMetric.GSUITE_USER_COUNT, domainUserCount)
  .addTag(MetricLabel.GSUITE_DOMAIN, gSuiteDomain)
  .build());

Therefore the GaugeMetric GSUITE_USER_COUNT must include MetricLabel.GSUITE_DOMAIN in its allowedAttachments.

Dashboard naming coordination

Metrics and Labels have well-defined names which must match the names in the Terraform Dashboard config. If you change the structure of a metric, you should also increment its name, e.g. from workspace_count_2 to workspace_count_3.

Example:

  BILLING_BUFFER_PROJECT_COUNT(
      "billing_buffer_project_count",
      "Number of projects in the billing buffer for each status",
      ImmutableList.of(MetricLabel.BUFFER_ENTRY_STATUS, MetricLabel.ACCESS_TIER_SHORT_NAMES)),

Matches a billing_buffer_project_count entry in custom_gauge_metrics.json

Development

Making a monitoring change may involve interacting with two logical components:

  1. The API server code - generally must be touched for all monitoring changes
  2. Stackdriver configuration, managed via Terraform - must be modified if there are any structural changes to the metrics

Note that modifying Terraform typically requires two PRs: one to update the module definitions, and another to pull in the new modules (and apply them) in all environments.

Example PRs:

Testing

  • If you need to update a dashboard or an alert, apply your changes to a local dashboard like "Custom Gauge Metrics [local]"
  • Get a local API server running
    • Make any necessary changes to the reporting server code
    • Run the local API server: ./project.rb dev-up
    • Ensure you have data locally - the easiest way to create this is by connecting a local UI and creating users/workspaces
  • The monitoring cron will run locally automatically
  • Wait a bit (~5-10 min) while continuing to run the local API, then verify that the new Dashboard is updated

Deployment

  • Send PRs for the server code change
  • If Terraform changes are needed, follow the process for applying changes in Terraform and apply this to all environments
    • TODO: check if necessary File a PD ticket for security approval if Preprod/Prod is affected by this change
  • Merge the server changes