Skip to content

Latest commit

 

History

History
183 lines (149 loc) · 21.7 KB

collectd-consul.md

File metadata and controls

183 lines (149 loc) · 21.7 KB

collectd/consul

Monitor Type: collectd/consul (Source)

Accepts Endpoints: Yes

Multiple Instances Allowed: Yes

Overview

Monitors the Consul data store by using the Consul collectd Python plugin, which collects metrics from Consul instances by hitting these endpoints:

Supports Consul 0.7.0+.

Agent Statsd listener

If running Consul version below 0.9.1, configure the Consul agents that are to be monitored to send telemetry to a SignalFx Agent instance by adding the below configuration to Consul agents configuration file:

{"telemetry":
  {"statsd_address": "<agent host>:<agent port, default 8125>"}
}

This monitor should then be be configured with the telemetryServer: true option set. This will start a UDP server listening on 0.0.0.0:8125 by default.

Configuration

To activate this monitor in the Smart Agent, add the following to your agent config:

monitors:  # All monitor config goes under this key
 - type: collectd/consul
   ...  # Additional config

For a list of monitor options that are common to all monitors, see Common Configuration.

Config option Required Type Description
pythonBinary no string Path to a python binary that should be used to execute the Python code. If not set, a built-in runtime will be used. Can include arguments to the binary as well.
host yes string
port yes integer
aclToken no string Consul ACL token
useHTTPS no bool Set to true to connect to Consul using HTTPS. You can figure the certificate for the server with the caCertificate config option. (default: false)
telemetryServer no bool (default: false)
telemetryHost no string IP address or DNS to which Consul is configured to send telemetry UDP packets. Relevant only if telemetryServer is set to true. (default: 0.0.0.0)
telemetryPort no integer Port to which Consul is configured to send telemetry UDP packets. Relevant only if telemetryServer is set to true. (default: 8125)
enhancedMetrics no bool Set to true to enable collecting all metrics from Consul's runtime telemetry send via UDP or from the /agent/metrics endpoint. (default: false)
caCertificate no string If Consul server has HTTPS enabled for the API, specifies the path to the CA's Certificate.
clientCertificate no string If client-side authentication is enabled, specifies the path to the certificate file.
clientKey no string If client-side authentication is enabled, specifies the path to the key file.
signalFxAccessToken no string

Metrics

These are the metrics available for this monitor. Metrics that are categorized as container/host (default) are in bold and italics in the list below.

  • consul.dns.stale_queries (gauge)
    Number of times an agent serves a DNS query based on information from a server that is more than 5 seconds out of date. This metric has the dimensions datacenter, consul_node and consul_mode.
  • consul.memberlist.msg.suspect (gauge)
    This increments when an agent suspects another as failed when executing random probes as part of the gossip protocol. These can be an indicator of overloaded agents, network problems, or configuration errors where agents can not connect to each other on the required ports. This metric has the dimensions datacenter, consul_node and consul_mode.
  • consul.serf.member.flap (gauge)
    This metric increments when an agent is marked dead and then recovers within a short time period. This can be an indicator of overloaded agents, network problems, or configuration errors where agents can not connect to each other on the required ports. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.catalog.nodes.total (gauge)
    The total number of nodes in the Consul datacenter. This metric is common to the cluster and, therefore, reported by leader only. This metric is reported with the dimension datacenter, consul_node name and consul_mode to indicate which mode - server or client - is the reporting consul agent.
  • gauge.consul.catalog.nodes_by_service (gauge)
    Number of nodes providing a given service. This metric is reported by the leader only. The dimension consul_service indicates which service the metric corresponds too. Additionally, the metric also has the datacenter and consul_mode dimension.
  • gauge.consul.catalog.services.total (gauge)
    The total number of services registered with Consul in the given datacenter. This metric is common to the cluster and, therefore, reported by leader only. This metric is reported with the dimension datacenter, consul_node name and consul_mode to indicate which mode - server or client - is the reporting consul agent.
  • gauge.consul.catalog.services_by_node (gauge)
    Number of services registered with a node. This metric is reported by the leader only. The dimension consul_node indicates which node the metric corresponds too. Additionally, the metric also has the datacenter and consul_mode dimension.
  • gauge.consul.consul.dns.domain_query.AGENT.avg (gauge)
    This tracks how long it takes to service forward DNS lookups on the given Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.consul.dns.domain_query.AGENT.max (gauge)
    This tracks maximum time takes to service forward DNS lookups on the given Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.consul.dns.domain_query.AGENT.min (gauge)
    This tracks minimum time it takes to service forward DNS lookups on the given Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.consul.dns.ptr_query.AGENT.avg (gauge)
    This tracks average time it takes to service reverse DNS lookups on the given Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.consul.dns.ptr_query.AGENT.max (gauge)
    This tracks maximum time it takes to service reverse DNS lookups on the given Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.consul.dns.ptr_query.AGENT.min (gauge)
    This tracks minimum time it takes to service reverse DNS lookups on the given Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.consul.leader.reconcile.avg (gauge)
    Time it takes the leader to reconcile the differences between Serf membership and Consul's store. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.consul.rpc.query (gauge)
    A general measure of all read volume. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.health.nodes.critical (gauge)
    Number of nodes for which health checks are reporting Critical state. This metric is reported by leader only. This metric is reported with the dimension datacenter, consul_node name and consul_mode.
  • gauge.consul.health.nodes.passing (gauge)
    Number of nodes which health checks are reporting to be in Passing state. This metric is reported by leader only. This metric is reported with the dimension datacenter, consul_node name and consul_mode.
  • gauge.consul.health.nodes.warning (gauge)
    Number of nodes which health checks are reporting to be in Warning state. This metric is reported by leader only. This metric is reported with the dimension datacenter, consul_node name and consul_mode.
  • gauge.consul.health.services.critical (gauge)
    Number of services for which health checks are reporting Critical state. This metric is reported by leader only. This metric is reported with the dimension datacenter, consul_node name and consul_mode.
  • gauge.consul.health.services.passing (gauge)
    Number of services which health checks are reporting to be in Passing state. This metric is reported by leader only. This metric is reported with the dimension datacenter, consul_node name and consul_mode.
  • gauge.consul.health.services.warning (gauge)
    Number of services which health checks are reporting to be in Warning state. This metric is reported by leader only. This metric is reported with the dimension datacenter, consul_node name and consul_mode.
  • gauge.consul.is_leader (gauge)
    Metric to map consul server's in leader or follower state. A follower instance returns value of 0 and leader returns a value of 1. Used by a Heat Map in the dashboard which makes recognizing the leader from followers visually easy. This metric comes with the dimension - consul_server_state which can be either leader or follower. Also has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.network.dc.latency.avg (gauge)
    Average datacenter latency between 2 datacenters. This metric has the additional dimension destination_dc dimension. The latency is calculated between this destination datacenter and the agent's datacenter given by the datacenter dimension. Only the leader in the source datacenter calculates this metric. The metric also has the dimensions consul_mode and consul_node.
  • gauge.consul.network.dc.latency.max (gauge)
    Maximum datacenter latency between 2 datacenters. This metric has the additional dimension destination_dc dimension. The latency is calculated between this destination datacenter and the agent's datacenter given by the datacenter dimension. Only the leader in the source datacenter calculates this metric. The metric also has the dimensions consul_mode and consul_node.
  • gauge.consul.network.dc.latency.min (gauge)
    Minimum datacenter latency between 2 datacenters. This metric has the additional dimension destination_dc dimension. The latency is calculated between this destination datacenter and the agent's datacenter given by the datacenter dimension. Only the leader in the source datacenter calculates this metric. The metric also has the dimensions consul_mode and consul_node.
  • gauge.consul.network.node.latency.avg (gauge)
    Average network latency between given node and other nodes in the datacenter. The dimension consul_node corresponds to the source node. The metric also has the dimensions datacenter and consul_mode.
  • gauge.consul.network.node.latency.max (gauge)
    Minimum network latency between given node and other nodes in the datacenter. The dimension consul_node corresponds to the source node. The metric also has the dimensions datacenter and consul_mode.
  • gauge.consul.network.node.latency.min (gauge)
    Minimum network latency between given node and other nodes in the datacenter. The dimension consul_node corresponds to the source node. The metric also has the dimensions datacenter and consul_mode.
  • gauge.consul.peers (gauge)
    Number of consul Raft peers or consul agents in server mode in a given datacenter. This metric is reported by the leader only. This metric is reported with the dimension datacenter, consul_node name and consul_mode
  • gauge.consul.raft.apply (gauge)
    This metric is a general indicator of the write load on the Consul servers. This metric has the global dimensions consul_node, consul_mode and datacenter.
  • gauge.consul.raft.commitTime.avg (gauge)
    This measures the mean time it takes to commit a new entry to the Raft log on the leader. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.commitTime.max (gauge)
    This measures the max time it takes to commit a new entry to the Raft log on the leader. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.commitTime.min (gauge)
    This measures the minimum time it takes to commit a new entry to the Raft log on the leader. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.leader.dispatchLog.avg (gauge)
    This measures the mean time it takes for the leader to write log entries to disk. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.leader.dispatchLog.max (gauge)
    This measures the maximum time it takes for the leader to write log entries to disk. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.leader.dispatchLog.min (gauge)
    This measures the minimum time it takes for the leader to write log entries to disk. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.leader.lastContact.avg (gauge)
    This measures the time since the leader was last able to contact the follower nodes when checking its leader lease. It can be used as a measure for how stable the Raft timing is and how close the leader is to timing out its lease. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.leader.lastContact.max (gauge)
    This measures the maximum time since the leader was last able to contact the follower nodes when checking its leader lease. It can be used as a measure for how stable the Raft timing is and how close the leader is to timing out its lease. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.leader.lastContact.min (gauge)
    This measures the minimum time since the leader was last able to contact the follower nodes when checking its leader lease. It can be used as a measure for how stable the Raft timing is and how close the leader is to timing out its lease. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.replication.appendEntries.rpc.AGENT.avg (gauge)
    This measures the time it takes to replicate log entries to followers. This is a general indicator of the load pressure on the Consul servers, as well as the performance of the communication between the servers. This metric is sent by the leader for each follower. The metric has the followers ip or hostname added to the metric name. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.replication.appendEntries.rpc.AGENT.max (gauge)
    This measures the maximum time it takes to replicate log entries to followers. This is a general indicator of the load pressure on the Consul servers, as well as the performance of the communication between the servers. This metric is sent by the leader for each follower. The metric has the followers ip or hostname added to the metric name. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.replication.appendEntries.rpc.AGENT.min (gauge)
    This measures the minimum time it takes to replicate log entries to followers. This is a general indicator of the load pressure on the Consul servers, as well as the performance of the communication between the servers. This metric is sent by the leader for each follower. The metric has the followers ip or hostname added to the metric name. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.state.candidate (gauge)
    Tracks the number of times given node enters the candidate state, i.e., the number of times the Consul server starts a leader election. If this increments without a leadership change occurring it could indicate that a single server is overloaded or is experiencing network connectivity issues. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.raft.state.leader (gauge)
    This metric increments whenever a Consul server becomes a leader. If there are frequent leadership changes this may be indication that the servers are overloaded and aren't meeting the soft real-time requirements for Raft, or that there are networking problems between the servers. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.rpc.query (gauge)
  • gauge.consul.runtime.alloc_bytes (gauge)
    Number of bytes allocated to Consul process on the node. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.runtime.heap_objects (gauge)
    Number of heap objects allocated to Consul, indicates memory pressure on a Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.runtime.num_goroutines (gauge)
    Number of GO routines run by Consul process on the node. Gives the general load pressure indicator for Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.serf.events (gauge)
    Number of serf events processed by Consul. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.serf.events.consul:new-leader (gauge)
  • gauge.consul.serf.member.join (gauge)
    This metric tracks successful node joins to the Serf memberlist. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.serf.member.left (gauge)
    This metric tracks successful node leaves to the Serf memberlist. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.serf.queue.Event.avg (gauge)
    Average number of serf events in queue yet to be processed by Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.serf.queue.Event.max (gauge)
    Maximum number of serf events in queue yet to be processed by Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.serf.queue.Event.min (gauge)
    Minimum number of serf events in queue yet to be processed by Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.serf.queue.Query.avg (gauge)
    Average number of serf queries in queue yet to be processed by Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.serf.queue.Query.max (gauge)
    Maximum number of serf queries in queue yet to be processed by Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.
  • gauge.consul.serf.queue.Query.min (gauge)
    Minimum number of serf queries in queue yet to be processed by Consul agent. This metric has the dimensions datacenter, consul_node and consul_mode.

Non-default metrics (version 4.7.0+)

The following information applies to the agent version 4.7.0+ that has enableBuiltInFiltering: true set on the top level of the agent config.

To emit metrics that are not default, you can add those metrics in the generic monitor-level extraMetrics config option. Metrics that are derived from specific configuration options that do not appear in the above list of metrics do not need to be added to extraMetrics.

To see a list of metrics that will be emitted you can run agent-status monitors after configuring this monitor in a running agent instance.

Legacy non-default metrics (version < 4.7.0)

The following information only applies to agent version older than 4.7.0. If you have a newer agent and have set enableBuiltInFiltering: true at the top level of your agent config, see the section above. See upgrade instructions in Old-style whitelist filtering.

If you have a reference to the whitelist.json in your agent's top-level metricsToExclude config option, and you want to emit metrics that are not in that whitelist, then you need to add an item to the top-level metricsToInclude config option to override that whitelist (see Inclusion filtering. Or you can just copy the whitelist.json, modify it, and reference that in metricsToExclude.

Dimensions

The following dimensions may occur on metrics emitted by this monitor. Some dimensions may be specific to certain metrics.

Name Description
consul_mode Whether this consul instance is running as a server or client
consul_node The name of the consul node
datacenter The name of the consul datacenter