Monitor Type: collectd/cpu
(Source)
Accepts Endpoints: No
Multiple Instances Allowed: No
This monitor collects cpu usage data using the
collectd cpu
plugin. It aggregates the per-core CPU data into a single
metric and sends it to the SignalFx Metadata plugin in collectd, where the
raw jiffy counts from the cpu
plugin are converted to percent utilization
(the cpu.utilization
metric).
See https://collectd.org/wiki/index.php/Plugin:CPU
To activate this monitor in the Smart Agent, add the following to your agent config:
monitors: # All monitor config goes under this key
- type: collectd/cpu
... # Additional config
For a list of monitor options that are common to all monitors, see Common Configuration.
This monitor has no configuration options.
These are the metrics available for this monitor. Metrics that are categorized as container/host (default) are in bold and italics in the list below.
-
cpu.idle
(cumulative)
CPU time spent not in any other state. In order to get a percentage this value must be compared against the sum of all CPU states. -
cpu.interrupt
(cumulative)
CPU time spent while servicing hardware interrupts. A hardware interrupt happens at the physical layer. When this occurs, the CPU will stop whatever else it is doing and service the interrupt. This metric measures how many jiffies were spent handling these interrupts. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by faulty hardware such as a broken peripheral. -
cpu.nice
(cumulative)
CPU time spent in userspace running 'nice'-ed processes. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by: 1) The server not having enough CPU capacity for a process, 2) A programming error which causes a process to use an unexpected amount of CPU -
cpu.softirq
(cumulative)
CPU time spent while servicing software interrupts. Unlike a hardware interrupt, a software interrupt happens at the sofware layer. Usually it is a userspace program requesting a service of the kernel. This metric measures how many jiffies were spent by the CPU handling these interrupts. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by a programming error which causes a process to unexpectedly request too many services from the kernel. -
cpu.steal
(cumulative)
CPU time spent waiting for a hypervisor to service requests from other virtual machines. This metric is only present on virtual machines. This metric records how much time this virtual machine had to wait to have the hypervisor kernel service a request. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by: 1) Another VM on the same hypervisor using too many resources, or 2) An underpowered hypervisor -
cpu.system
(cumulative)
CPU time spent running in the kernel. This value reflects how often processes are calling into the kernel for services (e.g to log to the console). In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by: 1) A process that needs to be re-written to use kernel resources more efficiently, or 2) A userspace driver that is broken -
cpu.user
(cumulative)
CPU time spent running in userspace. In order to get a percentage this value must be compared against the sum of all CPU states. If this value is high: 1) A process requires more CPU to run than is available on the server, or 2) There is an application programming error which is causing the CPU to be used unexpectedly. -
cpu.wait
(cumulative)
Amount of total CPU time spent idle while waiting for an I/O operation to complete. In order to get a percentage this value must be compared against the sum of all CPU states. A high value for a sustained period may be caused by: 1) A slow hardware device that is taking too long to service requests, or 2) Too many requests being sent to an I/O device
The following information applies to the agent version 4.7.0+ that has
enableBuiltInFiltering: true
set on the top level of the agent config.
To emit metrics that are not default, you can add those metrics in the
generic monitor-level extraMetrics
config option. Metrics that are derived
from specific configuration options that do not appear in the above list of
metrics do not need to be added to extraMetrics
.
To see a list of metrics that will be emitted you can run agent-status monitors
after configuring this monitor in a running agent instance.
The following information only applies to agent version older than 4.7.0. If
you have a newer agent and have set enableBuiltInFiltering: true
at the top
level of your agent config, see the section above. See upgrade instructions in
Old-style whitelist filtering.
If you have a reference to the whitelist.json
in your agent's top-level
metricsToExclude
config option, and you want to emit metrics that are not in
that whitelist, then you need to add an item to the top-level
metricsToInclude
config option to override that whitelist (see Inclusion
filtering. Or you can just
copy the whitelist.json, modify it, and reference that in metricsToExclude
.