Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import into task can not start after replace pd member #48740

Closed
aytrack opened this issue Nov 21, 2023 · 6 comments
Closed

import into task can not start after replace pd member #48740

aytrack opened this issue Nov 21, 2023 · 6 comments
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. component/ddl This issue is related to DDL of TiDB. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@aytrack
Copy link
Contributor

aytrack commented Nov 21, 2023

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. tiup cluster deploy a 7.5 cluster with 3pd(pd-0, pd-1, pd-2), 2tidb, 3tikv
  2. set global tidb_enable_dist_task = 1;
  3. scale-out pd to 6 replicas, add (pd-3, pd-4, pd-5)
  4. using tiup cluster scale-in pd-0, pd-1, pd-2
  5. do add index and import into

2. What did you expect to see? (Required)

add index and import into task will success

3. What did you see instead (Required)

add index success
import into task can not start

[2023/11/21 15:28:49.036 +08:00] [INFO] [s3.go:406] ["succeed to get bucket region from s3"] ["bucket region"=Beijing]
[2023/11/21 15:28:49.109 +08:00] [INFO] [region_request.go:1532] ["send request meet region error without retry"] [conn=721420312] [session_alias=] [req-ts=445789432473714689] [req-type=Cop] [region="{ region id: 1685, ver: 532, confVer: 23 }"] [replica-read-type=leader]
[stale-read=false] [request-sender="{rpcError:<nil>,replicaSelector: replicaSelector{selectorStateStr: accessKnownLeader, cacheRegionIsValid: false, replicaStatus: [peer: 1686, store: 7, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 1687, store: 6, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 1688, store: 8, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [retry-times=0] [total-backoff-ms=0] [total-backoff-times=0] [max-exec-timeout-ms=60000] [total-region-errors=1687-epoch_not_match:1]
[2023/11/21 15:29:00.112 +08:00] [INFO] [domain.go:2830] ["refreshServerIDTTL succeed"] [serverID=344] ["lease id"=6f6b8bed66fd3e2f]
[2023/11/21 15:29:25.964 +08:00] [INFO] [coprocessor.go:1330] ["[TIME_COP_PROCESS] resp_time:302.405682ms txnStartTS:18446744073709551615 region_id:298661 store_addr:tikv-3-peer:20160 kv_process_ms:295 kv_wait_ms:0 kv_read_ms:0 processed_versions:7052 total_versions:7053 rocksdb_delete_skipped_count:0 rocksdb_key_skipped_count:14103 rocksdb_cache_hit_count:21 rocksdb_read_count:3530 rocksdb_read_byte:72657070"]
[2023/11/21 15:29:49.067 +08:00] [WARN] [expensivequery.go:145] [expensive_query] [cost_time=60.030312153s] [conn=721420312] [user=root] [database=test] [txn_start_ts=0] [mem_max="0 Bytes (0 Bytes)"] [sql="IMPORT INTO `test`.`xxx` FROM 's3://xxx.*.csv?access-key=xxxxxx&endpoint=http%3A%2F%2Fks3-cn-beijing-internal.ksyuncs.com&force-path-style=false&provider=ks&region=Beijing&secret-access-key=xxxxxx' WITH __max_engine_size=_UTF8MB4'50g', thread=16, detached"] [session_alias=]
[2023/11/21 15:30:22.741 +08:00] [INFO] [coprocessor.go:1330] ["[TIME_COP_PROCESS] resp_time:300.225976ms txnStartTS:18446744073709551615 region_id:299301 store_addr:tikv-2-peer:20160 kv_process_ms:293 kv_wait_ms:0 kv_read_ms:0 processed_versions:6991 total_versions:6992 rocksdb_delete_skipped_count:0 rocksdb_key_skipped_count:13981 rocksdb_cache_hit_count:20 rocksdb_read_count:3500 rocksdb_read_byte:72038979"]
[2023/11/21 15:30:24.384 +08:00] [INFO] [coprocessor.go:1330] ["[TIME_COP_PROCESS] resp_time:310.922599ms txnStartTS:18446744073709551615 region_id:299317 store_addr:tikv-2-peer:20160 kv_process_ms:308 kv_wait_ms:0 kv_read_ms:0 processed_versions:7387 total_versions:7388

if I scale-out the (pd-0, pd-1, pd-2) to the cluster, the import task can start

[2023/11/21 15:36:59.283 +08:00] [INFO] [pd_service_discovery.go:294] ["[pd] close pd service discovery client"]
[2023/11/21 15:36:59.287 +08:00] [INFO] [pd_service_discovery.go:606] ["[pd] update member urls"] [old-urls="[http://pd-0-peer:2379,http://pd-1-peer:2379,http://pd-2-peer:2379]"] [new-urls="[http://pd-0-peer:2379,http://pd-1-peer:2379,http://pd-2-peer:2379,http://pd-3-peer:2379,http://pd-4-peer:2379,http://pd-5-peer:2379]"]
[2023/11/21 15:36:59.287 +08:00] [INFO] [pd_service_discovery.go:632] ["[pd] switch leader"] [new-leader=http://pd-5-peer:2379] [old-leader=]
[2023/11/21 15:36:59.287 +08:00] [INFO] [pd_service_discovery.go:197] ["[pd] init cluster id"] [cluster-id=7303570552685293974]
[2023/11/21 15:36:59.288 +08:00] [INFO] [client.go:600] ["[pd] changing service mode"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE]
[2023/11/21 15:36:59.288 +08:00] [INFO] [tso_client.go:230] ["[tso] switch dc tso global allocator serving address"] [dc-location=global] [new-address=http://pd-5-peer:2379]
[2023/11/21 15:36:59.288 +08:00] [INFO] [tso_dispatcher.go:318] ["[tso] tso dispatcher created"] [dc-location=global]
[2023/11/21 15:36:59.288 +08:00] [INFO] [client.go:648] ["[pd] service mode changed"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE]
[2023/11/21 15:36:59.296 +08:00] [INFO] [local.go:698] ["multi ingest support"]
[2023/11/21 15:36:59.296 +08:00] [INFO] [table_import.go:597] ["use 0.8 of the storage size as default disk quota"] [table=xxx] [quota=374.2GB]
[2023/11/21 15:36:59.296 +08:00] [INFO] [scheduler.go:109] ["index writer memory size limit"] [type=ImportInto] [task-id=6] [step=import] [limit=54.86MiB]
[2023/11/21 15:36:59.302 +08:00] [INFO] [scheduler.go:116] ["run subtask start"] [type=ImportInto] [task-id=6] [step=import] [subtask-id=18]
[2023/11/21 15:36:59.304 +08:00] [INFO] [backend.go:246] ["open engine"] [engineTag=`test`.`xxx`:0] [engineUUID=1ccf9b40-64eb-5c3d-8efc-ee6a4ab85137]
[2023/11/21 15:36:59.305 +08:00] [INFO] [backend.go:246] ["open engine"] [engineTag=`test`.`xxx`:-1] [engineUUID=72296603-b7da-50b4-a77a-0f307e7cec37]
[2023/11/21 15:36:59.305 +08:00] [INFO] [subtask_executor.go:71] ["execute chunk"] [type=ImportInto] [table-id=110]
[2023/11/21 15:36:59.306 +08:00] [INFO] [subtask_executor.go:71] ["execute chunk"] [type=ImportInto] [table-id=110]

before replace the pd member (before reproduce step 3), import into can works, but there are some error log during import data

[2023/11/20 23:45:30.180 +08:00] [WARN] [pd_service_discovery.go:452] ["[pd] failed to get cluster id"] [url=http://pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: address pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379: too many colons in address\" target:pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: address pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379: too many colons in address\" target:pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379 status:TRANSIENT_FAILURE"]
[2023/11/20 23:45:30.787 +08:00] [INFO] [chunk_process.go:243] ["process chunk start"] [type=ImportInto] [table-id=102] [key=pinterest/10T/data3/test.item_core.104.csv:0]
[2023/11/20 23:45:31.180 +08:00] [WARN] [pd_service_discovery.go:452] ["[pd] failed to get cluster id"] [url=http://pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: address pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379: too many colons in address\" target:pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: address pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379: too many colons in address\" target:pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379 status:TRANSIENT_FAILURE"]
[2023/11/20 23:45:32.268 +08:00] [WARN] [pd_service_discovery.go:452] ["[pd] failed to get cluster id"] [url=http://pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: address pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379: too many colons in address\" target:pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: address pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379: too many colons in address\" target:pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379 status:TRANSIENT_FAILURE"]
[2023/11/20 23:45:33.180 +08:00] [WARN] [pd_service_discovery.go:452] ["[pd] failed to get cluster id"] [url=http://pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: address pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379: too many colons in address\" target:pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: address pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379: too many colons in address\" target:pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379 status:TRANSIENT_FAILURE"]

4. What is your TiDB version? (Required)

MySQL [test]> select tidb_version();
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version()                                                                                                                                                                                                                                                 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v7.5.0
Edition: Community
Git Commit Hash: 603b47c9917af415264f3de70359abadba2cd5bb
Git Branch: heads/refs/tags/v7.5.0
UTC Build Time: 2023-11-20 13:29:38
GoVersion: go1.21.3
Race Enabled: false
Check Table Before Drop: false
Store: tikv |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
@aytrack aytrack added type/bug The issue is confirmed as a bug. severity/major component/ddl This issue is related to DDL of TiDB. affects-7.5 This bug affects the 7.5.x(LTS) versions. labels Nov 21, 2023
@ti-chi-bot ti-chi-bot bot added may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Nov 21, 2023
@aytrack
Copy link
Contributor Author

aytrack commented Nov 21, 2023

releated #48680

@lance6716
Copy link
Contributor

url=http://pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379

The tidb's PD path in configuration is pd-0-peer:2379,pd-1-peer:2379,pd-2-peer:2379, I guess this is the cause. We will fix it when integrate tikv/pd#7300 with lightning, so lightning can use tidb-server's PD client rather than create its own client by dial the address.

@D3Hunter
Copy link
Contributor

this is a config issue, the pd address should be a load-balancer address in this case, IMHO there's no need to fix this

@CabinfeverB
Copy link
Contributor

CabinfeverB commented Nov 24, 2023

this is a config issue, the pd address should be a load-balancer address in this case, IMHO there's no need to fix this

Yes. However, I think it is necessary to ensure that lb no longer contains the replaced pd address, because tidb may access invalid server when using etcd client

@D3Hunter
Copy link
Contributor

load balancer is a single address, unless you have multi load balancer

@lance6716 lance6716 removed may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Dec 20, 2023
ti-chi-bot bot pushed a commit that referenced this issue Dec 21, 2023
@ti-chi-bot ti-chi-bot added affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. labels Mar 11, 2024
@lance6716 lance6716 removed affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. labels Mar 27, 2024
@ti-chi-bot ti-chi-bot added the affects-8.1 This bug affects the 8.1.x(LTS) versions. label Apr 9, 2024
@D3Hunter
Copy link
Contributor

D3Hunter commented Apr 28, 2024

user should make sure the configured pd address which might be a lb address should contain current and later addresses, else tidb might not able to discover any PD on netowork issue and won't run correctly after restart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. component/ddl This issue is related to DDL of TiDB. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

5 participants