Skip to content

Latest commit

 

History

History
91 lines (56 loc) · 6.02 KB

collectd-cpu.md

File metadata and controls

91 lines (56 loc) · 6.02 KB

collectd/cpu

Monitor Type: collectd/cpu (Source)

Accepts Endpoints: No

Multiple Instances Allowed: No

Overview

This monitor collects cpu usage data using the collectd cpu plugin. It aggregates the per-core CPU data into a single metric and sends it to the SignalFx Metadata plugin in collectd, where the raw jiffy counts from the cpu plugin are converted to percent utilization (the cpu.utilization metric).

See https://collectd.org/wiki/index.php/Plugin:CPU

Configuration

To activate this monitor in the Smart Agent, add the following to your agent config:

monitors:  # All monitor config goes under this key
 - type: collectd/cpu
   ...  # Additional config

For a list of monitor options that are common to all monitors, see Common Configuration.

This monitor has no configuration options.

Metrics

These are the metrics available for this monitor. Metrics that are categorized as container/host (default) are in bold and italics in the list below.

  • cpu.idle (cumulative)
    CPU time spent not in any other state. In order to get a percentage this value must be compared against the sum of all CPU states.

  • cpu.interrupt (cumulative)
    CPU time spent while servicing hardware interrupts. A hardware interrupt happens at the physical layer. When this occurs, the CPU will stop whatever else it is doing and service the interrupt. This metric measures how many jiffies were spent handling these interrupts. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by faulty hardware such as a broken peripheral.

  • cpu.nice (cumulative)
    CPU time spent in userspace running 'nice'-ed processes. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by: 1) The server not having enough CPU capacity for a process, 2) A programming error which causes a process to use an unexpected amount of CPU

  • cpu.softirq (cumulative)
    CPU time spent while servicing software interrupts. Unlike a hardware interrupt, a software interrupt happens at the sofware layer. Usually it is a userspace program requesting a service of the kernel. This metric measures how many jiffies were spent by the CPU handling these interrupts. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by a programming error which causes a process to unexpectedly request too many services from the kernel.

  • cpu.steal (cumulative)
    CPU time spent waiting for a hypervisor to service requests from other virtual machines. This metric is only present on virtual machines. This metric records how much time this virtual machine had to wait to have the hypervisor kernel service a request. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by: 1) Another VM on the same hypervisor using too many resources, or 2) An underpowered hypervisor

  • cpu.system (cumulative)
    CPU time spent running in the kernel. This value reflects how often processes are calling into the kernel for services (e.g to log to the console). In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by: 1) A process that needs to be re-written to use kernel resources more efficiently, or 2) A userspace driver that is broken

  • cpu.user (cumulative)
    CPU time spent running in userspace. In order to get a percentage this value must be compared against the sum of all CPU states. If this value is high: 1) A process requires more CPU to run than is available on the server, or 2) There is an application programming error which is causing the CPU to be used unexpectedly.

  • cpu.wait (cumulative)
    Amount of total CPU time spent idle while waiting for an I/O operation to complete. In order to get a percentage this value must be compared against the sum of all CPU states. A high value for a sustained period may be caused by: 1) A slow hardware device that is taking too long to service requests, or 2) Too many requests being sent to an I/O device

Non-default metrics (version 4.7.0+)

The following information applies to the agent version 4.7.0+ that has enableBuiltInFiltering: true set on the top level of the agent config.

To emit metrics that are not default, you can add those metrics in the generic monitor-level extraMetrics config option. Metrics that are derived from specific configuration options that do not appear in the above list of metrics do not need to be added to extraMetrics.

To see a list of metrics that will be emitted you can run agent-status monitors after configuring this monitor in a running agent instance.

Legacy non-default metrics (version < 4.7.0)

The following information only applies to agent version older than 4.7.0. If you have a newer agent and have set enableBuiltInFiltering: true at the top level of your agent config, see the section above. See upgrade instructions in Old-style whitelist filtering.

If you have a reference to the whitelist.json in your agent's top-level metricsToExclude config option, and you want to emit metrics that are not in that whitelist, then you need to add an item to the top-level metricsToInclude config option to override that whitelist (see Inclusion filtering. Or you can just copy the whitelist.json, modify it, and reference that in metricsToExclude.