Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to troubleshoot memory issues in etcd #82

Open
iamnst19 opened this issue Jun 26, 2024 · 6 comments
Open

How to troubleshoot memory issues in etcd #82

iamnst19 opened this issue Jun 26, 2024 · 6 comments

Comments

@iamnst19
Copy link

Hi, I would like to know how we can troubleshoot memory issue in etcd and how and how to mitigate such memory issues?

@Quentin-M
Copy link
Owner

Hey!

Like you said - you'd be looking at etcd itself - as the operator's own memory usage is going to be very minimal, best to refer to their repository / docs / code. Etcd is started as an embedded server though as part of the etcd-cloud-operator, so it may first seem as if the operator is taking up memory.

@iamnst19
Copy link
Author

I think the memory spike is due to S3 backup. How do I disable S3 backup? Also how and where do I need to add profiling --> https://github.com/google/pprof to check the memory profile?

@Quentin-M
Copy link
Owner

Th snapshot providers streams the data from etcd towards the snapshot destination, so I'd think it'd be ok if everything is implemented alright - unless etcd itself has a memory spike as part of the save somehow. Do you have a memory chart?

Disabling S3 snapshots is not recommended as this will cripple your ability to do disaster recovery, unless you enable the file backup provider with a separate and reliable storage to use. By default, the operator requires a snapshot provider.

To enable pprof, you'd want to inject it in the main here behind a command-line flag:

import (
  pprof "net/http/pprof"
)

if flagPprof != nil && len(flagPprof) > 0 {
  go func() {
    zap.S().Infof("enabling pprof on %s", flagPprof)
    pprof.ListenAndServe(flagPprof, nil)
  }
}

@iamnst19
Copy link
Author

Screenshot 2024-07-11 at 11 18 51 AM

The baseline has shifted and memory is heaping and I can see that these spike happening during the backup to S3 can I like make an adjustment to this

snapshot:
    provider: s3 # This should be configured to S3 in any real environments.
    interval: 30m
    ttl: 24h

So the backup is not very aggressive? Maybe increase the interval or reduce the TTL. If then what need to be the desired values here?

@iamnst19
Copy link
Author

Ideally this backup activity should be happening in non peak hours. How to set the time to do the backup once in a week during off peak hours?

@iamnst19
Copy link
Author

Can you please help here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants