Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Query after Insertion timed out in v2.5.0-beta #38585

Open
1 task done
Andy6132024 opened this issue Dec 19, 2024 · 2 comments
Open
1 task done

[Bug]: Query after Insertion timed out in v2.5.0-beta #38585

Andy6132024 opened this issue Dec 19, 2024 · 2 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@Andy6132024
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.5.0-beta
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2): v2.5.0
- OS(Ubuntu or CentOS): RockyLinux8
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Executed tasks to insert around 2k entities into a collection concurrently (concurrency is 12). Each task inserted only one entity. In the stage environment where Milvus has been upgraded to v2.5.0-beta, query got timed out after roughly 10 minutes. However, the query almost immediately returned results in Prod environment where Milvus is still at v2.4.15.

Enabled TimeTick protection in the Stage environment and noticed that the TimeTick lag went up to more than 3 minutes during the insertion and lasted about 10 minutes before subsiding gradually to the normal level. At the same time, the TimeTick lag recorded at QueryNode (for consumed insert) also went up to a couple minutes. All of the evidence seems to point to a slow-down of consumption from dml channel in QueryNode.

tt-delay

Appreciate anyone looking into this issue since it could be a blocker to upgrade to v2.5+ in Prod environment.

Expected Behavior

Timetick lag should not have obvious increase during insertion.

Steps To Reproduce

No response

Milvus Log

[2024/12/18 10:57:54.092 +00:00] [WARN] [querynodev2/handlers.go:227] ["failed to query on delegator"] [traceID=62bcacd674186cd5d570ed04369a103f] [msgID=454692378587327825] [collectionID=454692378588586550] [channel=by-dev-rootcoord-dml_12_454692378588586550v0] [scope=All] [error="context canceled"]
[2024/12/18 10:57:54.092 +00:00] [WARN] [delegator/delegator.go:563] ["delegator query failed to wait tsafe"] [traceID=62bcacd674186cd5d570ed04369a103f] [collectionID=454692378588586550] [channel=by-dev-rootcoord-dml_12_454692378588586550v0] [replicaID=454692378786922498] [error="context canceled"]

Anything else?

No response

@Andy6132024 Andy6132024 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2024
@yanliang567
Copy link
Contributor

yanliang567 commented Dec 20, 2024

@Andy6132024 May I ask the reason you only insert 1 entity in an insert request? I am asking because Milvus would generate many some segments if you did that, which makes the system is busy in compaction and tt sync.

/assign @Andy6132024

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 20, 2024
@yanliang567 yanliang567 removed their assignment Dec 24, 2024
@xiaofan-luan
Copy link
Collaborator

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.5.0-beta
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2): v2.5.0
- OS(Ubuntu or CentOS): RockyLinux8
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Executed tasks to insert around 2k entities into a collection concurrently (concurrency is 12). Each task inserted only one entity. In the stage environment where Milvus has been upgraded to v2.5.0-beta, query got timed out after roughly 10 minutes. However, the query almost immediately returned results in Prod environment where Milvus is still at v2.4.15.

Enabled TimeTick protection in the Stage environment and noticed that the TimeTick lag went up to more than 3 minutes during the insertion and lasted about 10 minutes before subsiding gradually to the normal level. At the same time, the TimeTick lag recorded at QueryNode (for consumed insert) also went up to a couple minutes. All of the evidence seems to point to a slow-down of consumption from dml channel in QueryNode.

tt-delay

Appreciate anyone looking into this issue since it could be a blocker to upgrade to v2.5+ in Prod environment.

Expected Behavior

Timetick lag should not have obvious increase during insertion.

Steps To Reproduce

No response

Milvus Log

[2024/12/18 10:57:54.092 +00:00] [WARN] [querynodev2/handlers.go:227] ["failed to query on delegator"] [traceID=62bcacd674186cd5d570ed04369a103f] [msgID=454692378587327825] [collectionID=454692378588586550] [channel=by-dev-rootcoord-dml_12_454692378588586550v0] [scope=All] [error="context canceled"] [2024/12/18 10:57:54.092 +00:00] [WARN] [delegator/delegator.go:563] ["delegator query failed to wait tsafe"] [traceID=62bcacd674186cd5d570ed04369a103f] [collectionID=454692378588586550] [channel=by-dev-rootcoord-dml_12_454692378588586550v0] [replicaID=454692378786922498] [error="context canceled"]

Anything else?

No response

I thought this is definitely a potential problem .
Could you offer logs so we can investigate on that? especially for the querynode that has this tt log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants