This plugin allows you to check certain AWS Cloudwatch metrics and set alerts on certain values.
The script is written in bash. It is tested on OSX and Ubuntu 16.04.
This plugin fetches the data from X minutes back until now.
To run this script you should have installed the following packages:
- jq - json processor
- awscli - AWS command line interface
- bc - used for working with floating point numbers
We assume that the user who execute this script has configured his account so that he/she can connect to Amazon.
If not, please do this first. See here for more info: http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
See the help message:
-h or --help Show this message
-v or --verbose Optional: Show verbose output
--profile=x Optional: Which AWS profile should be used to connect to aws?
--namespace=x Required: Enter the AWS namespace where you want to check your metrics for. The "AWS/" prefix can be
left out, this is the default namespace prefix. See below. Example: "CloudFront", "EC2" or "Firehose".
More information: http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/aws-namespaces.html
--namespace-prefix=x The prefix for the namespace which should be used. By default this is "AWS". If you do not want to use this prefix
you should pass this parameter with an empty value.
Default: AWS
--mins=x Required: Supply the minutes (time window) of which you want to check the AWS metrics. We will fetch the data
between NOW-%mins and NOW.
--region=x Required: Enter the AWS region which we need to use. For example: "eu-west-1"
--metric=x Required: The metric name which you want to check. For example "IncomingBytes"
--timeout=x Optional: Specify the max duration in seconds of this script.
When the timeout is reached, we will return a UNKNOWN alert status.
--statistics=x Required: The statistics which you want to fetch.
Possible values: Sum, Average, Maximum, Minimum, SampleCount
Default: Average
--dimensions=x Required: The dimensions which you want to fetch.
Examples:
Name=DBInstanceIdentifier,Value=i-1235534
Name=DeliveryStreamName,Value=MyStream
See also: http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#dimension-combinations
--warning=x:x Required: The warning threshold. You can supply min:max or just max value. Use the format: [@]min:max
When no minimal value is given, a default min value of 0 is used.
By default we will raise a warning alert when the value is outside the given range. You can start the range
with an @ sign to change this logic. We then will alert when the value is inside the range.
See below for some examples.
--critical=x:x Required: The critical threshold. You can supply min:max or just max value. Use the format: [@]min:max
When no minimal value is given, a default min value of 0 is used.
By default we will raise a critical alert when the value is outside the given range. You can start the range
with an @ sign to change this logic. We then will alert when the value is inside the range.
See below for some examples.
--default="x" When no data points are returned, it could be because there is no data. By default this script will return
the nagios state UNKNOWN. You could also supply a default value here (like 0). In that case we will work
with that value when no data points are returned.
--http_proxy="x" When you use a proxy to connect to the AWS Cli, you can use this option. See for more information
this link: http://docs.aws.amazon.com/cli/latest/userguide/cli-http-proxy.html
--https_proxy="x" When you use a proxy to connect to the AWS Cli, you can use this option. See for more information
this link: http://docs.aws.amazon.com/cli/latest/userguide/cli-http-proxy.html
--last-known When given, we will fetch the last known values up to 20 minutes ago. Cloudwatch metrics are not always up to date.
By specifying this option we will walk back in 1 minute steps when no data is known for max 20 minutes.
Example threshold values:
--critical=10
We will raise an alert when the value is < 0 or > 10
--critical=5:10
We will raise an alert when the value is < 5 or > 10
--critical=@5:10
We will raise an alert when the value is >= 5 and <= 10
--critical=~:10
We will raise an alert when the value is > 10 (there is no lower limit)
--critical=10:~
We will raise an alert when the value is < 10 (there is no upper limit)
--critical=10:
(Same as above) We will raise an alert when the value is < 10 (there is no upper limit)
--critical=@1:~
Alert when the value is >= 1. Zero is OK.
--critical=@~:0
Alert when the value is <= 0. So 0.1 or higher is okay.
See for more info: https://www.monitoring-plugins.org/doc/guidelines.html#THRESHOLDFORMAT
This plugin uses the AWS Command Line Interface to retrieve the metrics data from Amazon. To make this plugin work you should make sure that the user who execute's this plugin can use the Amazon CLI.
The AWS CLI will automatically search for your credentials (access key id and secret access key) in a few places. See also here: http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#config-settings-and-precedence
I would suggest that you add the credentials in a file like ~/.aws/credentials
, where ~
is the home directory
of the user who will execute the plugin. This will likely be the nagios user, so then the file will be
~nagios/.aws/credentials
.
If you run nagios on an EC2 machine you can also apply a IAM role to the machine with the correct security rights.
To make use of this plugin, you should checkout this script in your nagios plugins directory.
cd /usr/lib/nagios/plugins/
git clone https://github.com/level23/nagios-cloudwatch-metrics.git
Then you should define a command to your nagios configuration. Some example commands:
#
# Generic cloudwatch_check
# $ARG1$: Namespace (i.e., ELB, EC2, RDS, etc.)
# $ARG2$: Metric
# $ARG3$: Dimension (i.e., InstanceId)
# $ARG4$: Dimension Value (i.e., i-1a2b3c4d)
# $ARG5$: Warning Level
# $ARG6$: Critical Level
# $ARG7$: Time Interval
# $ARG8$: Default (0 if null)
define command {
command_name check_aws
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --region=us-east-1 --namespace="$ARG1$" --metric="$ARG2$" --statistics="Average" --mins=$ARG7$ --dimensions="Name=$ARG3$,Value=$ARG4$" --warning=$ARG5$ --critical=$ARG6$ --default=$ARG8$
}
#
# Check check_aws_firehose
# $ARG1$: Metric, for example: IncomingBytes
# $ARG2$: DeliveryStreamName
# $ARG3$: Warning value
# $ARG4$: Critical value
define command {
command_name check_aws_firehose
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --region=eu-west-1 --namespace="Firehose" --metric="$ARG1$" --statistics="Average" --mins=15 --dimensions="Name=DeliveryStreamName,Value=$ARG2$" --warning=$ARG3$ --critical=$ARG4$
}
#
# Check check_aws_lambda
# $ARG1$: Metric, for example: Duration
# $ARG2$: FunctionName
# $ARG3$: Warning value
# $ARG4$: Critical value
define command {
command_name check_aws_lambda
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --region=eu-west-1 --namespace="Lambda" --metric="$ARG1$" --statistics="Average" --mins=15 --dimensions="Name=FunctionName,Value=$ARG2$" --warning=$ARG3$ --critical=$ARG4$
}
#
# Check check_aws_sqs
# $ARG1$: Metric, for example: NumberOfMessagesReceived
# $ARG2$: QueueName
# $ARG3$: Warning value
# $ARG4$: Critical value
define command {
command_name check_aws_sqs
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --region=eu-west-1 --namespace="SQS" --metric="$ARG1$" --statistics="Sum" --mins=15 --dimensions="Name=QueueName,Value=$ARG2$" --warning=$ARG3$ --critical=$ARG4$
}
#
# Check check_aws_sns
# $ARG1$: Metric, for example: NumberOfNotificationsFailed
# $ARG2$: TopicName
# $ARG3$: Warning value
# $ARG4$: Critical value
define command {
command_name check_aws_sns
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --region=eu-west-1 --namespace="SNS" --metric="$ARG1$" --statistics="Sum" --mins=15 --dimensions="Name=TopicName,Value=$ARG2$" --warning=$ARG3$ --critical=$ARG4$
}
#
# Check check_aws_elb
# $ARG1$: Metric, for example: UnHealthyHostCount or HTTPCode_ELB_5XX
# $ARG2$: LoadBalancerName
# $ARG3$: Warning value
# $ARG4$: Critical value
define command {
command_name check_aws_elb
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --region=eu-west-1 --namespace="ELB" --metric="$ARG1$" --statistics="Maximum" --mins=1 --dimensions="Name=LoadBalancerName,Value=$ARG2$" --warning=$ARG3$ --critical=$ARG4$
}
#
# Check check_aws_elasticache
# $ARG1$: Metric, for example: CPUUtilization
# $ARG2$: CacheClusterId
# $ARG3$: Warning
# $ARG4$: Critical value
define command {
command_name check_aws_elasticache
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --region=eu-west-1 --namespace="ElastiCache" --metric="$ARG1$" --statistics="Average" --mins=15 --dimensions="Name=CacheClusterId,Value=$ARG2$" --warning=$ARG3$ --critical=$ARG4$
}
#
# Check check_aws_cloudfront
# $ARG1$: Metric, for example: 4xxErrorRate
# $ARG2$: DistributionId
# $ARG3$: Warning
# $ARG4$: Critical value
define command {
command_name check_aws_cloudfront
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --region=us-east-1 --namespace="CloudFront" --metric="$ARG1$" --statistics="Average" --mins=15 --dimensions="Name=DistributionId,Value=$ARG2$ Name=Region,Value=Global" --warning=$ARG3$ --critical=$ARG4$
}
#
# Check check_aws_rds over last 5 minutes
# $ARG1$: Metric, for example: CPUUtilization
# $ARG2$: ClusterId
# $ARG2$: READER/WRITER
# $ARG3$: Warning
# $ARG4$: Critical value
define command {
command_name check_aws_rds
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --region=eu-west-1 --namespace="RDS" --metric="$ARG1$" --statistics="Average" --mins=5 --dimensions="Name=DBClusterIdentifier,Value=$ARG2$ Name=Role,Value=$ARG3$" --warning=$ARG4$ --critical=$ARG5$
}
# Defined in HOST configuration are:
# $HOSTALIAS$ = InstanceId
# $HOSTNOTES$ = region
#
# Check check_aws_ec2
# $ARG1$: Metric, for example CPUUtilization
# $ARG2$: Data statistics. Possible values: Maximum, Minimum, Sum, Average
# $ARG3$: Minutes timewindow
# $ARG4$: Warning value
# $ARG5$: Critical value
define command {
command_name check_aws_ec2
command_line $USER1$/nagios-cloudwatch-metrics/check_cloudwatch.sh --timeout=30 --region="$HOSTNOTES$" --namespace="EC2" --metric="$ARG1$" --statistics="$ARG2$" --mins="$ARG3$" --dimensions="Name=InstanceId,Value=$HOSTALIAS$" --warning=$ARG4$ --critical=$ARG5$
}
In these examples we have hard-coded defined our region and the X minutes time window.
Then, you can configure your nagios services like this:
#
# We assume that there is at least an average of 100 bytes per minute for myStream. If lower, then a warning.
# If lower than 50 Bytes, then it's critical and we should receive an SMS!
#
define service {
use generic-service
hostgroup_name cloudwatch
service_description Firehose: Incoming Bytes for myStream
max_check_attempts 2
normal_check_interval 5
retry_check_interval 5
contact_groups group_sms
notification_interval 30
check_command check_aws_firehose!IncomingBytes!myStream!100!50
}
#
# We assume that myFunction does not run longer than 60000 ms (60s). If so, trigger a warning.
# If it runs longer than 120000 ms (120s), trigger an critical notification.
#
define service {
use generic-service
hostgroup_name cloudwatch
service_description Lambda: duration of myFunction
max_check_attempts 2
normal_check_interval 5
retry_check_interval 5
contact_groups group_sms
notification_interval 30
check_command check_aws_lambda!Duration!myFunction!0:60000!0:120000
}
# etc.