Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] Load collection(100million data) raises error show collection failed: memory limit exceeded[predict=3.9875015e+10][limit=3.435974e+10] #38839

Open
1 task done
wangting0128 opened this issue Dec 30, 2024 · 3 comments
Assignees
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4-20241227-2f208ebc-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.10rc9
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: weekly-stab-1735430400
test case name: test_concurrent_locust_100m_diskann_ddl_dql_filter_cluster

server:

NAME                                                              READY   STATUS                   RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
weekly-stab-17330400-2-37-9397-etcd-0                             1/1     Running                  0                 7h28m   10.104.34.85    4am-node37   <none>           <none>
weekly-stab-17330400-2-37-9397-etcd-1                             1/1     Running                  0                 7h28m   10.104.16.183   4am-node21   <none>           <none>
weekly-stab-17330400-2-37-9397-etcd-2                             1/1     Running                  0                 7h28m   10.104.21.123   4am-node24   <none>           <none>
weekly-stab-17330400-2-37-9397-milvus-datanode-79b7dc95c9-bc5q8   1/1     Running                  3 (7h27m ago)     7h28m   10.104.13.194   4am-node16   <none>           <none>
weekly-stab-17330400-2-37-9397-milvus-indexnode-795d7cbbc89m4lz   1/1     Running                  3 (7h27m ago)     7h28m   10.104.9.107    4am-node14   <none>           <none>
weekly-stab-17330400-2-37-9397-milvus-mixcoord-756bf4f5b7-9ch75   1/1     Running                  3 (7h27m ago)     7h28m   10.104.13.197   4am-node16   <none>           <none>
weekly-stab-17330400-2-37-9397-milvus-proxy-6d974df669-g24vn      1/1     Running                  3 (7h27m ago)     7h28m   10.104.14.59    4am-node18   <none>           <none>
weekly-stab-17330400-2-37-9397-milvus-querynode-54fb48c8d4rnl9t   1/1     Running                  3 (7h27m ago)     7h28m   10.104.14.60    4am-node18   <none>           <none>
weekly-stab-17330400-2-37-9397-minio-0                            1/1     Running                  0                 7h28m   10.104.34.87    4am-node37   <none>           <none>
weekly-stab-17330400-2-37-9397-minio-1                            1/1     Running                  0                 7h28m   10.104.26.253   4am-node32   <none>           <none>
weekly-stab-17330400-2-37-9397-minio-2                            1/1     Running                  0                 7h28m   10.104.16.188   4am-node21   <none>           <none>
weekly-stab-17330400-2-37-9397-minio-3                            1/1     Running                  0                 7h28m   10.104.21.126   4am-node24   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-bookie-0                  1/1     Running                  0                 7h28m   10.104.26.254   4am-node32   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-bookie-1                  1/1     Running                  0                 7h28m   10.104.16.189   4am-node21   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-bookie-2                  1/1     Running                  0                 7h28m   10.104.34.97    4am-node37   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-bookie-init-zmxp9         0/1     Completed                0                 7h28m   10.104.6.109    4am-node13   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-broker-0                  1/1     Running                  0                 7h28m   10.104.13.198   4am-node16   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-broker-1                  1/1     Running                  0                 7h28m   10.104.6.112    4am-node13   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-proxy-0                   1/1     Running                  0                 7h28m   10.104.6.110    4am-node13   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-proxy-1                   1/1     Running                  0                 7h28m   10.104.13.196   4am-node16   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-pulsar-init-rlkwh         0/1     Completed                0                 7h28m   10.104.13.195   4am-node16   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-recovery-0                1/1     Running                  0                 7h28m   10.104.6.111    4am-node13   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-zookeeper-0               1/1     Running                  0                 7h28m   10.104.34.84    4am-node37   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-zookeeper-1               1/1     Running                  0                 7h28m   10.104.16.187   4am-node21   <none>           <none>
weekly-stab-17330400-2-37-9397-pulsarv3-zookeeper-2               1/1     Running                  0                 7h28m   10.104.21.122   4am-node24   <none>           <none>
截屏2024-12-30 12 02 01

client log:

[2024-12-29 01:13:30,768 -  INFO - fouram]: [Base] Start inserting, ids: 99900000 - 99949999, data size: 100,000,000 (base.py:366)
[2024-12-29 01:13:32,660 -  INFO - fouram]: [Time] Collection.insert run in 1.8916s (api_request.py:49)
[2024-12-29 01:13:32,664 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_InZpr2ms): 99900000 (base.py:535)
[2024-12-29 01:13:32,678 -  INFO - fouram]: [Base] Start inserting, ids: 99950000 - 99999999, data size: 100,000,000 (base.py:366)
[2024-12-29 01:13:34,835 -  INFO - fouram]: [Time] Collection.insert run in 2.1557s (api_request.py:49)
[2024-12-29 01:13:34,838 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_InZpr2ms): 99900000 (base.py:535)
[2024-12-29 01:13:34,842 -  INFO - fouram]: [Base] Total time of insert: 3816.4014s, average number of vector bars inserted per second: 26202.6945, average time to insert 50000 vectors per time: 1.9082s (base.py:422)
[2024-12-29 01:13:34,853 -  INFO - fouram]: [Base] Start flush collection fouram_InZpr2ms (base.py:313)
[2024-12-29 01:13:37,461 -  INFO - fouram]: [Time] Collection.flush run in 2.6076s (api_request.py:49)
[2024-12-29 01:13:37,464 -  INFO - fouram]: [Base] Index params of fouram_InZpr2ms:[{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:491)
[2024-12-29 01:13:37,464 -  INFO - fouram]: [Base] Start build index of DISKANN for field:float_vector collection:fouram_InZpr2ms, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}, kwargs:{} (base.py:472)
[2024-12-29 07:17:45,884 -  INFO - fouram]: [Time] Index run in 21848.4186s (api_request.py:49)
[2024-12-29 07:17:45,884 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 21848.4186s (common_cases.py:168)
[2024-12-29 07:17:45,884 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:170)
[2024-12-29 07:17:45,885 -  INFO - fouram]: [CommonCases] No scalar and vector fields need to be indexed. (common_cases.py:189)
[2024-12-29 07:17:45,902 -  INFO - fouram]: [Base] Index params of fouram_InZpr2ms:[{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:491)
[2024-12-29 07:17:45,912 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_InZpr2ms): 100000000 (base.py:535)
[2024-12-29 07:17:45,912 -  INFO - fouram]: [Base] Start load collection fouram_InZpr2ms,replica_number:1,kwargs:{} (base.py:319)
[2024-12-29 07:30:44,367 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=3, message=show collection failed: memory limit exceeded[predict=3.9875015e+10][limit=3.435974e+10])>, <Time:{'RPC start': '2024-12-29 07:30:44.365278', 'RPC error': '2024-12-29 07:30:44.367694'}> (decorators.py:147)
[2024-12-29 07:30:44,368 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=3, message=show collection failed: memory limit exceeded[predict=3.9875015e+10][limit=3.435974e+10])>, <Time:{'RPC start': '2024-12-29 07:17:45.933423', 'RPC error': '2024-12-29 07:30:44.368890'}> (decorators.py:147)
[2024-12-29 07:30:44,369 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=3, message=show collection failed: memory limit exceeded[predict=3.9875015e+10][limit=3.435974e+10])>, <Time:{'RPC start': '2024-12-29 07:17:45.913372', 'RPC error': '2024-12-29 07:30:44.369042'}> (decorators.py:147)
[2024-12-29 07:30:44,372 - ERROR - fouram]: (api_response) : [Collection.load] <MilvusException: (code=3, message=show collection failed: memory limit exceeded[predict=3.9875015e+10][limit=3.435974e+10])>, [requestId: 00740a92-c5b5-11ef-90af-c6a13b199de6] (api_request.py:57)
[2024-12-29 07:30:44,372 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=3, message=show collection failed: memory limit exceeded[predict=3.9875015e+10][limit=3.435974e+10])> (func_check.py:106)

Expected Behavior

same case on 2.5 branch:
image: 2.5-20241227-ef400227-amd64
截屏2024-12-30 12 04 29

Steps To Reproduce

1. create a collection with fields: 'id'(INT64 primary key), 'float_vector'(128dim), 'float_1'
2. build DISKANN index on field 'float_vector'
3. insert 100m data
4. flush collection
5. rebuild index
6. load collection <- raises error

Milvus Log

No response

Anything else?

server config:

{
     "queryNode": {
          "resources": {
               "limits": {
                    "cpu": "8",
                    "memory": "32Gi",
                    "ephemeral-storage": "100Gi"
               },
               "requests": {
                    "cpu": "8",
                    "memory": "32Gi"
               }
          },
          "replicas": 1,
          "disk": {
               "size": {
                    "enabled": true
               }
          }
     },
     "indexNode": {
          "resources": {
               "limits": {
                    "cpu": "8",
                    "memory": "32Gi",
                    "ephemeral-storage": "100Gi"
               },
               "requests": {
                    "cpu": "8",
                    "memory": "32Gi"
               }
          },
          "replicas": 1,
          "disk": {
               "size": {
                    "enabled": true
               }
          }
     },
     "dataNode": {
          "resources": {
               "limits": {
                    "cpu": "2.0",
                    "memory": "8Gi"
               },
               "requests": {
                    "cpu": "2.0",
                    "memory": "5Gi"
               }
          },
          "replicas": 1
     },
     "cluster": {
          "enabled": true
     },
     "pulsarv3": {
          "enabled": true
     },
     "kafka": {
          "enabled": false
     },
     "minio": {
          "metrics": {
               "podMonitor": {
                    "enabled": true
               }
          }
     },
     "etcd": {
          "metrics": {
               "enabled": true,
               "podMonitor": {
                    "enabled": true
               }
          }
     },
     "metrics": {
          "serviceMonitor": {
               "enabled": true
          }
     },
     "log": {
          "level": "debug"
     },
     "image": {
          "all": {
               "repository": "harbor.milvus.io/milvus/milvus",
               "tag": "2.4-20241227-2f208ebc-amd64"
          }
     }
}

client config:

{
     "dataset_params": {
          "metric_type": "L2",
          "dim": 128,
          "dataset_name": "sift",
          "dataset_size": 100000000,
          "ni_per": 50000
     },
     "collection_params": {
          "other_fields": [
               "float_1"
          ],
          "shards_num": 2
     },
     "index_params": {
          "index_type": "DISKANN",
          "index_param": {}
     },
     "concurrent_params": {
          "concurrent_number": [
               20
          ],
          "during_time": "12h",
          "interval": 20
     },
     "concurrent_tasks": [
          {
               "type": "search",
               "weight": 20,
               "params": {
                    "nq": 10,
                    "top_k": 10,
                    "search_param": {
                         "search_list": 30
                    },
                    "expr": {
                         "float_1": {
                              "GT": -1,
                              "LT": 50000000
                         }
                    },
                    "timeout": 60,
                    "random_data": true
               }
          },
          {
               "type": "query",
               "weight": 10,
               "params": {
                    "ids": [
                         0,
                         1,
                         2,
                         3,
                         4,
                         5,
                         6,
                         7,
                         8,
                         9
                    ],
                    "timeout": 60
               }
          },
          {
               "type": "load",
               "weight": 1,
               "params": {
                    "replica_number": 1,
                    "timeout": 30
               }
          },
          {
               "type": "scene_test",
               "weight": 2,
               "params": {
                    "dim": 128,
                    "data_size": 3000,
                    "nb": 3000,
                    "index_type": "IVF_SQ8",
                    "index_param": {
                         "nlist": 2048
                    },
                    "metric_type": "L2"
               }
          }
     ]
}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Dec 30, 2024
@wangting0128 wangting0128 added this to the 2.4.20 milestone Dec 30, 2024
@yanliang567
Copy link
Contributor

/assign @liliu-z
/unassign

@sre-ci-robot sre-ci-robot assigned liliu-z and unassigned yanliang567 Dec 30, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 30, 2024
@xiaofan-luan
Copy link
Collaborator

@wangting0128
please check if #38793 fix the issue

@wangting0128
Copy link
Contributor Author

@wangting0128 please check if #38793 fix the issue

This is a question on 2.4 branch
Has this pr cp to 2.4 branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants