-
Notifications
You must be signed in to change notification settings - Fork 10
Monitoring, Alerts, and Dashboards
See also the Workbench Reporting Dataset (WRD) which works similarly.
The API server periodically records various metrics which power Stackdriver dashboards and alerts.
Stackdriver metrics come in 3 Kinds: Gauge, Delta, and Cumulative. From the Google Cloud monitoring docs:
A gauge metric, in which the value measures a specific instant in time. For example, metrics measuring CPU utilization are gauge metrics; each point records the CPU utilization at the time of measurement. Another example of a gauge metric is the current temperature.
A delta metric, in which the value measures the change since it was last recorded. For example, metrics measuring request counts are delta metrics; each value records how many requests were received since the last data point was recorded.
A cumulative metric, in which the value constantly increases over time. For example, a metric for “sent bytes” might be cumulative; each value records the total number of bytes sent by a service at that time.
TODO: but wait, we implement Gauge, Event, and Distribution ...
ideas for Stackdriver topics:
- What is a metric
- How do we create a Custom Metric
Metric (AKA Measurement?) - a numerical value. Something we are measuring. There are a few subtypes of Metric such as GaugeMetric.
Tag, MetricLabel, Attachment (TODO: distinguish?) - a string label used to categorize and give context to a Metric.
A MeasurementBundle consists of Metrics and Tags. It has specific validation rules (see MeasurementBundle.java):
- It must contain at least one Metric/Measurement
- Every Tag/MetricLabel/Attachment must be supported by all Metrics in the bundle
- Some MetricLabels are restricted to a finite set of TagValues
Example: DirectoryServiceImpl.addDomainCountMeasurement()
MeasurementBundle.builder()
.addMeasurement(GaugeMetric.GSUITE_USER_COUNT, domainUserCount)
.addTag(MetricLabel.GSUITE_DOMAIN, gSuiteDomain)
.build());
Therefore the GaugeMetric GSUITE_USER_COUNT must include MetricLabel.GSUITE_DOMAIN in its allowedAttachments.
Metrics and Labels have well-defined names which must match the names in the Terraform Dashboard config. If you change the structure of a metric, you should also increment its name, e.g. from workspace_count_2 to workspace_count_3.
Example:
BILLING_BUFFER_PROJECT_COUNT(
"billing_buffer_project_count",
"Number of projects in the billing buffer for each status",
ImmutableList.of(MetricLabel.BUFFER_ENTRY_STATUS, MetricLabel.ACCESS_TIER_SHORT_NAMES)),
Matches a billing_buffer_project_count
entry in custom_gauge_metrics.json
Making a monitoring change may involve interacting with two logical components:
- The API server code - generally must be touched for all monitoring changes
- Stackdriver configuration, managed via Terraform - must be modified if there are any structural changes to the metrics
Note that modifying Terraform typically requires two PRs: one to update the module definitions, and another to pull in the new modules (and apply them) in all environments.
Example PRs:
- RW-6137 Update billing buffer, user, and workspace count metrics for multi-tier: API server, Terraform modules, Terraform version bump (TODO)
- If you need to update a dashboard or an alert, apply your changes to a local dashboard like "Custom Gauge Metrics [local]"
- TODO: overview of Gauge vs. not-Gauge Metrics
- Make any structural changes in the monitoring Terraform modules
- Push up a branch with these changes
- Locally in workbench-devops, temporarily change the module reference to point to your terraform-modules branch
- e.g. change
?ref=v0.1.4
->?ref=my/branch-123
- e.g. change
- Apply the terraform change to the local environment (instructions)
- The Custom Gauge Metrics [local] dashboard in the test environment should now reflect your changes
- Get a local API server running
- Make any necessary changes to the reporting server code
- Run the local API server:
./project.rb dev-up
- Ensure you have data locally - the easiest way to create this is by connecting a local UI and creating users/workspaces
- The monitoring cron will run locally automatically
- Wait a bit (~5-10 min) while continuing to run the local API, then verify that the new Dashboard is updated
- Send PRs for the server code change
- If Terraform changes are needed, follow the process for applying changes in Terraform and apply this to all environments
- TODO: check if necessary File a PD ticket for security approval if Preprod/Prod is affected by this change
- Merge the server changes