-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logs missing during heavy log volume #4693
Comments
Consistently seeing the following errors in the logs (changed the wait time to 60s): 2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log" inode=100695028 inode_in_pos_file=0
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log Contents of containers.log.pos file:
|
Thanks for this report. |
Thanks. I've tried different combinations of settings since opening this issue, e.g. using a file buffer, increasing the chunk size, increasing the mem/CPU allocated to the fluentd daemonset, etc. None of them seems to have an impact on Fluentd's ability to tail the logs. It's as if it's losing track of the files it's supposed to tail. I have the notebook I've been using to find gaps in the sequence. Let me know if you want me to post it here. |
@daipom I just ran a test where I set the kubelet's containerLogMaxSize to 50Mi (the default is 10Mi). After doing that I saw zero log loss. I'm not totally sure why that would be. My only guess is that the files are being rotated less often and so there are fewer files for fluentd to keep track of. |
@daipom Do you think increasing the number of workers and allocating them to source block for |
I tried it briefly at my local environment, but I could not reproduce this. @jicowan Can you reproduce this without Kubernetes? |
I only tied this on k8s. I ran multiple replicas of it (at least 10). When the logs grew to 10MB, they were rotated by the kubelet. That's where I saw the issue. Fluentd lost track of the inodes because the files were being rotated so quickly. |
@jicowan |
I think I need to have a file like |
First you need a Kubernetes cluster (try not to use KIND, MiniKube, or another single node version of Kubernetes). Then you need to install the Fluentd DaemonSet. You can download the manifests from here. I used the version for Amazon CloudWatch, but you can use a different backend if you like. So long as it can absorb the volume of logs that you're sending to it, the choice of backend shouldn't effect the results of the tests. The default log file size is 10MB. At 10MB the kubelet (the Kubernetes "agent") will rotate the log file. You can use the Kubernetes Deployment I created to deploy the logging application: apiVersion: apps/v1
kind: Deployment
metadata:
name: logger-deployment
labels:
app: logger
spec:
replicas: 1 # Adjust the number of replicas as needed
selector:
matchLabels:
app: logger
template:
metadata:
labels:
app: logger
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- logger
topologyKey: "kubernetes.io/hostname"
containers:
- name: logger
image: jicowan/logger:v3.0
resources:
requests:
cpu: 4
memory: 128Mi
limits:
cpu: 4
memory: 256Mi
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name The configuration for Fluentd is typically stored in a ConfigMap. If this isn't descriptive enough, I can walk you through the configuration during a web conference. |
I can't verify this is happening yet, but it may be that the files are being rotated so fast that fluentd doesn't have enough time to read them before they are compressed. As the kubelet rotates the logs, it renames the file 0.log to 0.log.. It keeps that log uncompressed for 1 log rotation and then compresses it. If fluentd falls too far behind, it may not be able to read the log before it is compressed. Here is the kubelet code where this happens, https://github.com/kubernetes/kubernetes/blob/f545438bd347d2ac853b03983576bf0a6f1cc98b/pkg/kubelet/logs/container_log_manager.go#L400-L421. I assume there would be an error in the fluentd logs it could no longer read the log file, but I haven't seen such an error in my testing so far. There is no way to disable compression so I don't have a way to test my theory. |
I'm trying to reproduce on local environment using I'm going to try to reproduce on that environment. |
Yes, the environment was on AWS. You can use this eksctl configuration file to provision a similar environment. You can adjust the maximum size of the log file by changing the value of containerLogMaxSize. The default is 10Mi. The default containerLogMaxWorkers is 1. I also changed the storage type from gp3 to io1 because i was using a file buffer and wanted disk with better IO characteristics. You can change it back to gp3 if you want. # An advanced example of ClusterConfig object with customised nodegroups:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: logging
region: us-west-2
version: "1.30"
nodeGroups:
- name: ng3
instanceType: m5.4xlarge
desiredCapacity: 2
privateNetworking: true
ssh:
enableSsm: true
kubeletExtraConfig:
containerLogMaxWorkers: 5
containerLogMaxSize: "50Mi"
ebsOptimized: true
volumeType: io1
iam:
withOIDC: true
accessConfig:
authenticationMode: API_AND_CONFIG_MAP
vpc:
nat:
gateway: Single If you send the logs to CloudWatch, you'll need to use IRSA or pod identities to assign an IAM role to the pod. |
If the log files are rotated in a shorter time than specified in |
If you increse the size limit of the rotation file, due the fact fluentd read slower than the logs are written, in one moment you lost one of the rotation files. |
Describe the bug
During heavy log volumes, e.g. >10k log entries per second, fluentd consistently drops logs. It may be related to log rotation (on Kubernetes). When I ran a load test, I see the following entries in the fluentd logs:
When I added
follow_inodes=true
androtate_wait=0
to the container configuration, the errors went away, but large chunks of logs were still missing and the following entries appeared in the fluentd logs.I am running the latest version of the fluentd kubernetes daemonset for cloudwatch,
fluent/fluentd-kubernetes-daemonset:v1.17.1-debian-cloudwatch-1.2
.During the test, both memory and CPU utilization for fluentd remained fairly low.
To Reproduce
Run multiple replicas of the following program:
Here's the deployment for the test application:
Here's the container.conf file for fluentd:
Expected behavior
The test application assigns an sequence number to each log entry. I have a Python notebook that flattens the json log output, sorts the logs by sequence number, then finds gaps in the sequence. This is how I know that fluentd is dropping logs. If everything is working as it should there should be no log loss.
I ran the same tests with fluent bit and experience no log loss.
Your Environment
Your Configuration
Your Error Log
Additional context
No response
The text was updated successfully, but these errors were encountered: