Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf should log if InfluxDB returns a 400 #16386

Open
btasker opened this issue Jan 9, 2025 · 3 comments · May be fixed by #16388
Open

Telegraf should log if InfluxDB returns a 400 #16386

btasker opened this issue Jan 9, 2025 · 3 comments · May be fixed by #16388
Labels
bug unexpected problem or unintended behavior

Comments

@btasker
Copy link
Contributor

btasker commented Jan 9, 2025

Use Case

Upstream InfluxDB servers will return a HTTP 400 status code if there some kind of issue with the request (usually, but not always) something to do with the format of the line protocol.

In #16380 it wasn't clear that InfluxDB was rejecting the writes, because Telegraf claimed that the metrics had been written successfully

Note: Whilst trying to build an easy repro for this ticket I noticed that it also doesn't log if a 404 is returned (usually indicating the specified bucket doesn't exist)

As a somewhat contrived example for reproing, we can force a 400 by telling Telegraf to gzip the request body and then overriding the Content-Encoding header with an invalid value

[[outputs.influxdb_v2]]
  urls = ["http://192.168.3.5:18086"]
  http_headers = {"Content-Encoding" = "NotReal"}
  token = "<MY TOKEN>"
  organization = "1f493ade9638cc83"
  bucket = "testingdb"
  insecure_skip_verify = true
  content_encoding = "gzip"

This will cause an InfluxDB2 upstream to return a 400:

HTTP/1.1 400 Bad Request
Content-Type: application/json; charset=utf-8
X-Influxdb-Build: OSS
X-Influxdb-Version: v2.3.0+SNAPSHOT.090f681737
X-Platform-Error-Code: invalid
Date: Thu, 09 Jan 2025 10:54:47 GMT
Transfer-Encoding: chunked

c9c
{"code":"invalid","message":"unable to parse '\u001f\ufffd\

Expected behavior

Telegraf should log that an error was received

Actual behavior

Telegraf reports that the batch was written

2025-01-09T10:55:37Z D! [outputs.influxdb_v2] Wrote batch of 13 metrics in 5.009877ms
2025-01-09T10:55:37Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T10:55:42Z D! [outputs.influxdb_v2] Wrote batch of 13 metrics in 5.598315ms
2025-01-09T10:55:42Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T10:55:47Z D! [outputs.influxdb_v2] Wrote batch of 13 metrics in 3.626823ms
2025-01-09T10:55:47Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T10:55:52Z D! [outputs.influxdb_v2] Wrote batch of 13 metrics in 4.33643ms
2025-01-09T10:55:52Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics

Additional info

No response

@btasker btasker added the bug unexpected problem or unintended behavior label Jan 9, 2025
@btasker
Copy link
Contributor Author

btasker commented Jan 9, 2025

Interestingly, this doesn't occur if there's a 1.x instance upstream (so we're writing in using the compatability API).

In that instance, Telegraf correctly logs that there was an issue (though it's not terminal safe with the repro above):

image

@btasker
Copy link
Contributor Author

btasker commented Jan 9, 2025

It turns out the repo above isn't as helpful as intended.

Although it generates a 400, the response body is different because it's complaining about not being able to parse the LP.

In #16380 the response body was

{"error":"Authorization header was malformed and should be in the form 'Authorization: Bearer <token>'"}

The existing handling appears to be in support of Partial writes

@jdstrand
Copy link
Contributor

jdstrand commented Jan 9, 2025

I was able to confirm in my testing of #16388 that telegraf is now logging the 400. See #16350 (comment):

$ telegraf --debug --config ./telegraf.conf 
2025-01-09T22:03:40Z I! Loading config: ./telegraf.conf
2025-01-09T22:03:40Z I! Starting Telegraf 1.34.0-1304dd90 brought to you by InfluxData the makers of InfluxDB
2025-01-09T22:03:40Z I! Available plugins: 236 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
2025-01-09T22:03:40Z I! Loaded inputs: cpu
2025-01-09T22:03:40Z I! Loaded aggregators:
2025-01-09T22:03:40Z I! Loaded processors:
2025-01-09T22:03:40Z I! Loaded secretstores:
2025-01-09T22:03:40Z I! Loaded outputs: influxdb_v2
2025-01-09T22:03:40Z I! Tags enabled: host=tg16388
2025-01-09T22:03:40Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"tg16388", Flush Interval:10s
2025-01-09T22:03:40Z W! [agent] The default value of 'skip_processors_after_aggregators' will change to 'true' with Telegraf v1.40.0! If you need the current default behavior, please explicitly set the option to 'false'!
2025-01-09T22:03:40Z D! [agent] Initializing plugins
2025-01-09T22:03:40Z D! [agent] Connecting outputs
2025-01-09T22:03:40Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2025-01-09T22:03:40Z D! [agent] Successfully connected to outputs.influxdb_v2
2025-01-09T22:03:40Z D! [agent] Starting service inputs
2025-01-09T22:03:50Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T22:04:00Z E! [outputs.influxdb_v2] When writing to [http://127.0.0.1:8000/api/v2/write]: failed to write metric to Telegraf_bucket_name (will be dropped: 400 Bad Request)
2025-01-09T22:04:00Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T22:04:00Z E! [agent] Error writing to outputs.influxdb_v2: failed to write metric to Telegraf_bucket_name (will be dropped: 400 Bad Request)
2025-01-09T22:04:10Z E! [outputs.influxdb_v2] When writing to [http://127.0.0.1:8000/api/v2/write]: failed to write metric to Telegraf_bucket_name (will be dropped: 400 Bad Request)
2025-01-09T22:04:10Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T22:04:10Z E! [agent] Error writing to outputs.influxdb_v2: failed to write metric to Telegraf_bucket_name (will be dropped: 400 Bad Request)
...

To be extra sure, I modified the 'server' script:

#!/usr/bin/env python3

import getopt
from http.server import BaseHTTPRequestHandler, HTTPServer
import sys

class ErrorServerHandler(BaseHTTPRequestHandler):
    status: int = 200
    def fake_request(self):
        self.send_response(self.status)
        if self.status == 200:
            self.send_header("Content-type", "application/json")
        elif self.status == 400:
            self.send_header("Content-type", "application/json; charset=utf-8")
            self.send_header("X-Influxdb-Build", "OSS")
            self.send_header("X-Influxdb-Version", "v2.3.0+SNAPSHOT.090f681737")
            self.send_header("X-Platform-Error-Code", "invalid")
            self.send_header("Transfer-Enncoding", "chunked")
        self.end_headers()
        if self.status == 200:
            try:
                self.wfile.write(b"{}\n")
            except Exception:
                print("%s: broken pipe (client disconnect?)" % self.path)
        if self.status == 400:
            try:
                self.wfile.write(b'c9c\n{"code":"invalid","message":"unable to parse \'\u001f\ufffd\n')
            except Exception:
                print("%s: broken pipe (client disconnect?)" % self.path)

    def do_GET(self):
        self.fake_request()

    def do_POST(self):
        self.fake_request()

def run(port: int, status: int = 200):
    # XXX: validate delay and port
    addr = ('', port)
    httpd = HTTPServer(addr, ErrorServerHandler)
    ErrorServerHandler.status = status
    if status == 400:
        ErrorServerHandler.protocol_version = "HTTP/1.1"
    httpd.serve_forever()

if __name__ == '__main__':
    global delay
    try:
        opts, rest = getopt.getopt(sys.argv[1:], "hp:s:", [])
    except Exception:
        print("Bad args. See -h")
        sys.exit(1)

    port: int = 8000
    status: int = 200
    for opt, arg in opts:
        if opt == "-h":
            print("""Usage: server.py ARGS
 -h     this message
 -p N   listen on port N
 -s N   return status code N

GET and POST are supported an all paths and simply return an empty JSON
document.
""")
            sys.exit(0)
        elif opt == "-p":
            port = int(arg)
        elif opt == "-s":
            status = int(arg)
    try:
        run(port, status)
    except KeyboardInterrupt:
        pass

To reproduce:

  1. start the server:
$ python3 ./server.py -s 400
...
  1. use this telegraf.conf:
[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[outputs.influxdb_v2]]
  urls = ["http://127.0.0.1:8000"]
  token = "mytoken"
  organization = "myorganization"
  bucket = "Telegraf"
  bucket_tag = "destination"
  exclude_bucket_tag = true
  1. start telegraf (no 400s logged with 1.33.0):
$ telegraf --debug --config ./telegraf.conf 
2025-01-09T22:30:49Z I! Loading config: ./telegraf.conf
2025-01-09T22:30:49Z I! Starting Telegraf 1.33.0 brought to you by InfluxData the makers of InfluxDB
2025-01-09T22:30:49Z I! Available plugins: 236 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
2025-01-09T22:30:49Z I! Loaded inputs: cpu
2025-01-09T22:30:49Z I! Loaded aggregators:
2025-01-09T22:30:49Z I! Loaded processors:
2025-01-09T22:30:49Z I! Loaded secretstores:
2025-01-09T22:30:49Z I! Loaded outputs: influxdb_v2
2025-01-09T22:30:49Z I! Tags enabled: host=tg16388
2025-01-09T22:30:49Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"tg16388", Flush Interval:10s
2025-01-09T22:30:49Z D! [agent] Initializing plugins
2025-01-09T22:30:49Z D! [agent] Connecting outputs
2025-01-09T22:30:49Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2025-01-09T22:30:49Z D! [agent] Successfully connected to outputs.influxdb_v2
2025-01-09T22:30:49Z D! [agent] Starting service inputs
2025-01-09T22:30:59Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T22:31:09Z D! [outputs.influxdb_v2] Wrote batch of 21 metrics in 4.395281ms
2025-01-09T22:31:09Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
^C2025-01-09T22:31:12Z D! [agent] Stopping service inputs
2025-01-09T22:31:12Z D! [agent] Input channel closed
2025-01-09T22:31:12Z I! [agent] Hang on, flushing any cached metrics before shutdown
2025-01-09T22:31:12Z D! [outputs.influxdb_v2] Wrote batch of 21 metrics in 1.947469ms
2025-01-09T22:31:12Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T22:31:12Z I! [agent] Stopping running outputs
2025-01-09T22:31:12Z D! [agent] Stopped Successfully

Using the telegraf binary from PR #16388, it now logs the 400:

$ telegraf --debug --config ./telegraf.conf 
2025-01-09T22:32:03Z I! Loading config: ./telegraf.conf
2025-01-09T22:32:03Z I! Starting Telegraf 1.34.0-1304dd90 brought to you by InfluxData the makers of InfluxDB
2025-01-09T22:32:03Z I! Available plugins: 236 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
2025-01-09T22:32:03Z I! Loaded inputs: cpu
2025-01-09T22:32:03Z I! Loaded aggregators:
2025-01-09T22:32:03Z I! Loaded processors:
2025-01-09T22:32:03Z I! Loaded secretstores:
2025-01-09T22:32:03Z I! Loaded outputs: influxdb_v2
2025-01-09T22:32:03Z I! Tags enabled: host=tg16388
2025-01-09T22:32:03Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"tg16388", Flush Interval:10s
2025-01-09T22:32:03Z W! [agent] The default value of 'skip_processors_after_aggregators' will change to 'true' with Telegraf v1.40.0! If you need the current default behavior, please explicitly set the option to 'false'!
2025-01-09T22:32:03Z D! [agent] Initializing plugins
2025-01-09T22:32:03Z D! [agent] Connecting outputs
2025-01-09T22:32:03Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2025-01-09T22:32:03Z D! [agent] Successfully connected to outputs.influxdb_v2
2025-01-09T22:32:03Z D! [agent] Starting service inputs
2025-01-09T22:32:13Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T22:32:23Z E! [outputs.influxdb_v2] When writing to [http://127.0.0.1:8000/api/v2/write]: failed to write metric to Telegraf (will be dropped: 400 Bad Request)
2025-01-09T22:32:23Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T22:32:23Z E! [agent] Error writing to outputs.influxdb_v2: failed to write metric to Telegraf (will be dropped: 400 Bad Request)
^C2025-01-09T22:32:26Z D! [agent] Stopping service inputs
2025-01-09T22:32:26Z D! [agent] Input channel closed
2025-01-09T22:32:26Z I! [agent] Hang on, flushing any cached metrics before shutdown
2025-01-09T22:32:26Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2025-01-09T22:32:26Z I! [agent] Stopping running outputs
2025-01-09T22:32:26Z D! [agent] Stopped Successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants