Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus cannot operate stably for a long time #38745

Open
1 task done
hou718945431 opened this issue Dec 25, 2024 · 9 comments
Open
1 task done

[Bug]: Milvus cannot operate stably for a long time #38745

hou718945431 opened this issue Dec 25, 2024 · 9 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@hou718945431
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.3.9
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After running milvus 2.3.9 for six months, I found that the etcd storage reached 2g. According to the git issue, the etcd compression policy can be set, but I found that the space after etcd compression is not released and I need to execute defrag to release the space. Moreover, the etcd service is unavailable during the defrag process, which means that running milvus for a long time must start and stop milvus regularly and release defrag etcd?

Expected Behavior

I hope Milvus can operate stably for a long time

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@hou718945431 hou718945431 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 25, 2024
@yanliang567
Copy link
Contributor

/assign @LoveEachDay

/unassign

@yanliang567 yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 25, 2024
@tosone
Copy link

tosone commented Dec 26, 2024

@hou718945431 Based on my experience, you need to back up Milvus first and then proceed with the import.

@xiaofan-luan
Copy link
Collaborator

Please upgrade to 2.4.19. The older version leaked some meta information without cleaned.

It's recommend to upgrade to 2.3.20, running for a while and then upgrade to 2.4.19.

@xiaofan-luan
Copy link
Collaborator

Defrag might not be the only reason. You can use birdwatcher to check what is the directory left on etcd

@xiaofan-luan
Copy link
Collaborator

Also you can defrag your etcd one by one. @hou718945431 May I know how much storage you saved after etcd defrag?
Defrag do help but I guess this is not the main cause.

@hou718945431
Copy link
Author

[{"Endpoint":"http://172.29.235.44:2378","Status":{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":14557492,"raft_term":2},"version":"3.5.5","dbSize":2139557888,"leader":10276657743932975437,"raftIndex":4399568,"raftTerm":2,"raftAppliedIndex":4399568,"dbSizeInUse":1486848}}]

After using the comparison strategy of ectd, I saw that the db_use_Size of etcd is much smaller than the db_size, but the db file is still 2g, unless we use defrag to repeatedly refactor the db file

@hou718945431
Copy link
Author

I use the ECTD visualization tool and see that the content in ETCD is normal according to the source code, but there are many keys with multiple versions. I guess it may be due to too many versions. Can ETCD reuse the space generated by the comparison?

@xiaofan-luan
Copy link
Collaborator

xiaofan-luan commented Dec 30, 2024

etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.0
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
In the etcd configuration, consider adjusting the value of ETCD_AUTO_COMPACTION_RETENTION.

Additionally, does your use case involve a high frequency of creating tables or partitions?

There is nothing we can do if it's due to etcd fragment, the only thing we can do is probably defrag one etcd server at one time to reduce BOLT fragment

@xiaofan-luan
Copy link
Collaborator

maybe we can some frequent update key and see if there is a way to improve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants