Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux Agent gets unhealthy on adding Linux integration. #6155

Closed
amolnater-qasource opened this issue Nov 27, 2024 · 23 comments · Fixed by elastic/beats#41825 or elastic/beats#41930
Closed
Assignees
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@amolnater-qasource
Copy link

Kibana Build details:

VERSION: 8.17.0 BC1
BUILD: 80364
COMMIT: e3c75d19d796c366aedc5788960b2c6cc868014f

Artifact Link: https://staging.elastic.co/8.17.0-8031025a/downloads/beats/elastic-agent/elastic-agent-8.17.0-linux-x86_64.tar.gz

Host OS:
SLES15

Preconditions:

  1. 8.17.0-BC1 Kibana cloud environment should be available.

Steps to reproduce:

  1. Install Linux agent.
  2. Add linux integration to this agent.
  3. Observe agent gets unhealthy with errors in Linux integration:
Degraded
Error fetching data for metricset system.raid: failed to parse sysfs: no matches from path /sys/block

Degraded
Error fetching data for metricset linux.conntrack: error fetching conntrack stats: open /proc/net/stat/nf_conntrack: no such file or directory

Expected Result:
Linux Agent should remain healthy on adding Linux integration.

Screenshot:
Image

Agent Logs:
elastic-agent-diagnostics-2024-11-27T08-39-33Z-00.zip

@amolnater-qasource amolnater-qasource added bug Something isn't working impact:medium Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Nov 27, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@ycombinator ycombinator added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Nov 27, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@mauri870
Copy link
Member

mauri870 commented Nov 28, 2024

@amolnater-qasource Can you confirm you have the conntrack module loaded on this system?

lsmod | grep conntrack

Also, was this system upgraded without a restart? This can cause failures sometimes.

@mauri870
Copy link
Member

I tried deploying a SUSE 15 instance in Azure to debug this, I'm running the 8.17.0 version in the staging environment, I deployed the Beta Linux Metrics integration and couldn't find any errors from the agent:

$ uname -a
Linux mauri-suse 5.14.21-150500.33.66-azure #1 SMP PREEMPT_DYNAMIC Wed Sep 4 05:47:04 UTC 2024 (4885a53) x86_64 x86_64 x86_64 GNU/Linux
$ sudo elastic-agent status
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   └─ status: (HEALTHY) Running
azureuser@mauri-suse:~>

Image

Image

Image

@mauri870
Copy link
Member

Aha, after enabling Collect system metrics from Linux instances > Linux host raid metrics the agent went to a degraded state:

$ sudo elastic-agent status
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a degraded state
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '3031'
      └─ system/metrics-default-system/metrics-system-7afde2f3-7310-4e56-8554-4847fa1c1567
         └─ status: (DEGRADED) Error fetching data for metricset system.raid: failed to parse sysfs: no matches from path /sys/block

@mauri870
Copy link
Member

I just managed to reproduce the same issue on arch linux as well, so definitely not a SUSE issue.

@mauri870
Copy link
Member

Could it be that we expect to find devices in raid mode, and if we find zero then we return an error? We should probably return nil in this case or treat the error accordingly.

https://github.com/elastic/beats/blob/42e25f7216862b6779c2e8a87a82c1ae30d9a6e1/metricbeat/module/system/raid/blockinfo/getdev.go#L46-L48

@pierrehilbert
Copy link
Contributor

Wouldn't we have face this issue before if this is the root cause?
From what I can see, we didn't change anything there for a while.

Could it be a change in the newest version of Linux kernels that are disabling nf_conntrack by default?

@mauri870
Copy link
Member

Does not seem to be the case. I confirmed that on SUSE and arch conntrack is loaded. SUSE is on the 5.14 kernel and Arch on 6.12, so I doubt this is related to the kernel itself.

cc @fearful-symmetry since you worked on the raid metrics.

@mauri870
Copy link
Member

I have pushed a fix for the metricbeat system module. To my understanding, there is no point for a customer to enable RAID metrics on a system that does not have a RAID configuration. But if it does so, the agent should not go into a degraded state because of this, it should simply report no metrics at all. That is basically the solution I'm going with.

Please feel free to comment on other ways to handle this.

@mauri870
Copy link
Member

mauri870 commented Dec 3, 2024

From my testing, this is partially fixed. The error is not causing the agent to go into a degraded state anymore, and it is properly shown in the logs:

Image

In the pull request we decided to use PartialMetricsError, to make the error reported in the output of elastic-agent status, but it is not showing it for me. I'll investigate and see why that is the case.

@VihasMakwana
Copy link
Contributor

@mauri870 can you do elastic-agent status --output full and see?

@mauri870
Copy link
Member

mauri870 commented Dec 3, 2024

Thanks, here is the full output. From my understanding, it should be reporting this message, right?

elastic-agent status --output=full
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 78e62940-b597-4bc0-afa4-91000d164ccb
   │  ├─ version: 8.17.0
   │  └─ commit: 8a91d5c2306860fa88a1bae9bb7b37b7eabeddf5
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96884'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96864'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96913'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ linux/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96838'
   │  ├─ linux/metrics-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ linux/metrics-default-linux/metrics-system-d192f191-c94f-4c99-9363-ff4e8cfb68a5
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ log-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96786'
   │  ├─ log-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ log-default-logfile-system-fa497a42-ebd4-4117-8c4b-dde7ce717735
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '96812'
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ system/metrics-default-system/metrics-system-d192f191-c94f-4c99-9363-ff4e8cfb68a5
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ system/metrics-default-system/metrics-system-fa497a42-ebd4-4117-8c4b-dde7ce717735
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

@VihasMakwana
Copy link
Contributor

It should...

Wondering why it doesn't report the error? 🤔

@mauri870
Copy link
Member

mauri870 commented Dec 3, 2024

That is quite intriguing. I'm fairly certain the error I see in the logs originates from the logp line below, suggesting that we have updated the agent's status, but for some reason, it is not being displayed.

// mark module as running if metrics are partially available and display the error message
msw.module.UpdateStatus(status.Running, fmt.Sprintf("Error fetching data for metricset %s.%s: %v", msw.module.Name(), msw.MetricSet.Name(), err))
logp.Err("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.Name(), err)

@mauri870
Copy link
Member

mauri870 commented Dec 3, 2024

I spoke with Vihas on slack and I have opened elastic/beats#41867 to track this bug. Will keep this issue closed as the reported bug with a degraded agent state is now fixed.

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Dec 6, 2024

Hi Team,
We have revalidated this issue on latest 8.17.0 BC5 kibana cloud environment and found it still reproducible.

Observations:

  • Linux agent still gets unhealthy with error: Error fetching data for metricset linux.conntrack: error fetching conntrack stats: open /proc/net/stat/nf_conntrack: no such file or directory.

  • No output is observed on running lsmod | grep conntrack.
    Image

  • (DEGRADED) Error fetching data for metricset linux.conntrack: error fetching conntrack stats: open /proc/net/stat/nf_conntrack: no such file or directory is observed on running sudo elastic-agent status --output full.

  • We have observed data for linux.conntrack dataset under Data Streams tab.

Screenshots:
Image

Image
Image

Logs:
elastic-agent-diagnostics-2024-12-06T04-47-57Z-00.zip

Build details:
VERSION: 8.17.0 BC5
BUILD: 80495
COMMIT: 5c78fb5e4e9b5063bd83ae9bd1e5b32c63f5cc34
Artifact Link: https://staging.elastic.co/8.17.0-a18e6540/downloads/beats/elastic-agent/elastic-agent-8.17.0-linux-x86_64.tar.gz

Please let us know if this is expected.

For now we are reopening this issue until further clarity.

Thanks!

@VihasMakwana
Copy link
Contributor

@mauri870 @cmacknz Isn't this expected? The user is trying to use conntrack module without loading the appropriate kernel module.
We can suppress this error, but as a long term solution, I would rather have error thrown in metricset's New(...) method.

@mauri870
Copy link
Member

mauri870 commented Dec 6, 2024

Looks like this issue covers two different errors, the RAID metrics and conntrack metrics. My fix was only for the RAID metrics as I couldn't reproduce the conntrack one. It makes sense that the conntrack failure is due to the module not being loaded.

I think this can be a partial metrics error as well and perhaps more descriptive, the error for /proc/net/stat/nf_conntrack missing could be appended with "conntrack module not loaded/found"

@mauri870
Copy link
Member

mauri870 commented Dec 6, 2024

I have opened a PR to fix this. I agree with @VihasMakwana that we should probably check this in the New call. I'm not that familiar with the metricset initialization, but what happens if the New method from a metricset fails?

@cmacknz
Copy link
Member

cmacknz commented Dec 6, 2024

Big +1 that us displaying an error when we are told to collect data from a source and collecting data from that source is impossible without modifying the host system in some way. Showing the user this is the point of the feature that does this.

but what happens if the New method from a metricset fails?

Try it and find out :) What will likely happen is the input in the UI shows as failed with an error that it couldn't reload the configuration because the module couldn't be created.

I would expect the error to pop out here in the Beats code.

@mauri870
Copy link
Member

mauri870 commented Dec 9, 2024

I have filled elastic/beats#41963 to look into handling these in the New call instead of during metric fetching. We should probably look into the other system metricsets to see if they fall into the same category.

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Dec 11, 2024

Hi Team,
We have revalidated this issue on latest 8.17.0 BC6 kibana cloud environment and found it fixed now.

Observations:

  • Linux Agent remains healthy on adding Linux integration.

Screenshots:
Image
Image
Image

Logs:
elastic-agent-diagnostics-2024-12-11T07-55-02Z-00.zip

Build details:
VERSION: 8.17.0 BC6
BUILD: 80521
COMMIT: e8a820624a03a412433584d3e3df951838e4c63c
Artifact Link: https://staging.elastic.co/8.17.0-6b31e673/downloads/beats/elastic-agent/elastic-agent-8.17.0-amd64.deb

Hence we are marking this issue as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
7 participants