[Bug]: Milvus cannot operate stably for a long time #38745

hou718945431 · 2024-12-25T09:19:20Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:2.3.9
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After running milvus 2.3.9 for six months, I found that the etcd storage reached 2g. According to the git issue, the etcd compression policy can be set, but I found that the space after etcd compression is not released and I need to execute defrag to release the space. Moreover, the etcd service is unavailable during the defrag process, which means that running milvus for a long time must start and stop milvus regularly and release defrag etcd?

Expected Behavior

I hope Milvus can operate stably for a long time

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

yanliang567 · 2024-12-25T09:26:26Z

/assign @LoveEachDay

/unassign

tosone · 2024-12-26T13:39:58Z

@hou718945431 Based on my experience, you need to back up Milvus first and then proceed with the import.

xiaofan-luan · 2024-12-29T14:55:34Z

Please upgrade to 2.4.19. The older version leaked some meta information without cleaned.

It's recommend to upgrade to 2.3.20, running for a while and then upgrade to 2.4.19.

xiaofan-luan · 2024-12-29T14:56:03Z

Defrag might not be the only reason. You can use birdwatcher to check what is the directory left on etcd

xiaofan-luan · 2024-12-29T14:57:36Z

Also you can defrag your etcd one by one. @hou718945431 May I know how much storage you saved after etcd defrag?
Defrag do help but I guess this is not the main cause.

hou718945431 · 2024-12-30T01:20:39Z

[{"Endpoint":"http://172.29.235.44:2378","Status":{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":14557492,"raft_term":2},"version":"3.5.5","dbSize":2139557888,"leader":10276657743932975437,"raftIndex":4399568,"raftTerm":2,"raftAppliedIndex":4399568,"dbSizeInUse":1486848}}]

After using the comparison strategy of ectd, I saw that the db_use_Size of etcd is much smaller than the db_size, but the db file is still 2g, unless we use defrag to repeatedly refactor the db file

hou718945431 · 2024-12-30T01:25:58Z

I use the ECTD visualization tool and see that the content in ETCD is normal according to the source code, but there are many keys with multiple versions. I guess it may be due to too many versions. Can ETCD reuse the space generated by the comparison?

xiaofan-luan · 2024-12-30T13:57:16Z

etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.0
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
In the etcd configuration, consider adjusting the value of ETCD_AUTO_COMPACTION_RETENTION.

Additionally, does your use case involve a high frequency of creating tables or partitions?

There is nothing we can do if it's due to etcd fragment, the only thing we can do is probably defrag one etcd server at one time to reduce BOLT fragment

xiaofan-luan · 2024-12-30T14:01:55Z

maybe we can some frequent update key and see if there is a way to improve

hou718945431 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 25, 2024

hou718945431 assigned yanliang567 Dec 25, 2024

sre-ci-robot assigned LoveEachDay and unassigned yanliang567 Dec 25, 2024

yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Milvus cannot operate stably for a long time #38745

[Bug]: Milvus cannot operate stably for a long time #38745

hou718945431 commented Dec 25, 2024

yanliang567 commented Dec 25, 2024

tosone commented Dec 26, 2024

xiaofan-luan commented Dec 29, 2024

xiaofan-luan commented Dec 29, 2024

xiaofan-luan commented Dec 29, 2024

hou718945431 commented Dec 30, 2024

hou718945431 commented Dec 30, 2024

xiaofan-luan commented Dec 30, 2024 •

edited

Loading

xiaofan-luan commented Dec 30, 2024

[Bug]: Milvus cannot operate stably for a long time #38745

[Bug]: Milvus cannot operate stably for a long time #38745

Comments

hou718945431 commented Dec 25, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Dec 25, 2024

tosone commented Dec 26, 2024

xiaofan-luan commented Dec 29, 2024

xiaofan-luan commented Dec 29, 2024

xiaofan-luan commented Dec 29, 2024

hou718945431 commented Dec 30, 2024

hou718945431 commented Dec 30, 2024

xiaofan-luan commented Dec 30, 2024 • edited Loading

xiaofan-luan commented Dec 30, 2024

xiaofan-luan commented Dec 30, 2024 •

edited

Loading