Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[求助/Help] 遗留网络问题导致升级到10.11.5后vm无法dhcp获得ip地址 #21015

Open
saltfishh opened this issue Aug 14, 2024 · 17 comments
Labels
question Further information is requested stale state/awaiting processing

Comments

@saltfishh
Copy link

  • 版本:
    3.11.5

  • 问题

从 3.10.x升级到 3.11.5后, 新开的主机无法通过dhcp获取ip地址, 手动指定vm的ip地址后, 网络正常.

之前开通的机器 dns 变成了 10.1.1.251, 而 10.1.1.251 是很久之前, 宿主机的ip, 该ip已经更换掉了.

以下是管理界面宿主机的截图:
2024-08-14_10-19
该宿主机的 管理ip 实际为 10.13.1.2, 之前有换过ip, 但是该界面没同步过来, 不影响使用, 所以没有处理他.

以下是ip子网下的截图:

2024-08-14_10-22

麻烦问下, 这种情况应该如何处理?
感谢

@saltfishh saltfishh added the question Further information is requested label Aug 14, 2024
@saltfishh
Copy link
Author

当前所有的vm使用的都是这个 vpc 网络:

2024-08-14_10-28

dhcp 请求时, 没有响应
2024-08-14_10-37

vpc 网关是可达的.

@saltfishh
Copy link
Author

以下是 yunion 的配置:

 cat /etc/yunion/host.conf | grep 13
- em1/br0/10.13.1.2

@saltfishh
Copy link
Author

climc host-list

[root@cloud-stack ~]# climc host-network-list
+--------------------------------------+--------------------------------------+------------+-------------------+
|             Baremetal_ID             |              Network_ID              |  IP_addr   |     Mac_addr      |
+--------------------------------------+--------------------------------------+------------+-------------------+
| 11673fb2-8208-4900-8949-ef51c7475a51 | 55a0a8dd-d39e-4f3d-82f7-bc93bd511e4d | 10.13.1.48 | 0c:c4:7a:15:de:67 |
| 3a63610b-acf6-48eb-8b35-99d682e73024 | ff45d55e-a921-4d4e-855b-f6ed25312ee6 | 10.13.1.41 | 0c:c4:7a:15:d8:21 |
| f41c02b6-c26f-4195-87d9-fe795fcdb647 | 5c992e96-3472-4109-88f6-6c0b84ed9bdf | 10.13.1.43 | 0c:c4:7a:15:de:6b |
| 674eccc0-73cb-44d9-8f72-7ad683bb53be | 032c9093-db2c-42c2-82d0-3bae3393aba1 | 10.13.1.46 | 0c:c4:7a:15:d6:d3 |
| f72dae90-f1e5-4c7d-850b-60543af4eb56 | 77959d3c-2f94-40b9-884c-6eeb17a6c533 | 10.13.1.47 | 0c:c4:7a:15:de:5f |
| 125b4f68-cc32-4727-8183-b57957f51cbb | 1d1d595c-b9b7-448a-83c7-d0b58b7163ba | 10.13.1.42 | 0c:c4:7a:15:de:61 |
| 0a3bd92c-4179-46ff-8d72-258230d6ae32 | 2918c3a8-ba07-489c-8543-8ccfda799bcd | 10.13.1.44 | 0c:c4:7a:15:d7:95 |
| c91af114-da81-4f7e-872b-cea3343b64a1 | 20a11415-dfd9-404f-8d88-ba9b209acefb | 10.1.1.251 | 18:66:da:eb:06:b0 |
+--------------------------------------+--------------------------------------+------------+-------------------+
***  Total: 8 Pages: 1 Limit: 20 Offset: 0 Page: 1  ***
  • 最后一个就是ip地址不对的 host

@zexi
Copy link
Member

zexi commented Aug 19, 2024

@saltfishh 麻烦提供下 default-vpcagent-xxxx 这个 pod 的日志

@saltfishh
Copy link
Author

@saltfishh 麻烦提供下 default-vpcagent-xxxx 这个 pod 的日志

@zexi 感谢回复

kubectl logs -n onecloud default-vpcagent-86596df55b-fg6k5 --tail 30 -f
同时, vm 使用 dhclient 尝试发送 dhcp 请求, 以下是 default-vpc-agent 的日志

[info 2024-08-20 06:06:43 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: false, elapsed: 158.906037ms
[info 2024-08-20 06:06:53 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: false, elapsed: 164.933038ms
[info 2024-08-20 06:07:03 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 178.775684ms
[info 2024-08-20 06:07:03 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:07:13 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 176.872726ms
[info 2024-08-20 06:07:13 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:07:23 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 167.161157ms
[info 2024-08-20 06:07:23 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:07:34 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 190.606661ms
[info 2024-08-20 06:07:34 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:07:44 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: false, elapsed: 162.952171ms
[info 2024-08-20 06:07:54 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 171.311275ms
[info 2024-08-20 06:07:54 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:08:04 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 185.706185ms
[info 2024-08-20 06:08:04 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:08:14 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 177.506055ms
[info 2024-08-20 06:08:14 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:08:24 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: false, elapsed: 166.105949ms
[info 2024-08-20 06:08:35 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 165.902545ms
[info 2024-08-20 06:08:35 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:08:40 ovn.(*Worker).Start(worker.go:94)] ovn: tick check
[info 2024-08-20 06:08:45 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: false, elapsed: 155.524245ms
[info 2024-08-20 06:08:55 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 164.549536ms
[info 2024-08-20 06:08:55 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:09:05 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 164.532903ms
[info 2024-08-20 06:09:05 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:09:15 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 155.034944ms
[info 2024-08-20 06:09:15 ovn.(*Worker).Start(worker.go:87)] ovn: got new data from api helper
[info 2024-08-20 06:09:25 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: false, elapsed: 161.924653ms

直至 vm dhclient 执行结束未获取到ip, 该 pod 日志未出现其他的输出.

以下是过滤 err关键字获得到的日志

kubectl logs -n onecloud default-vpcagent-86596df55b-fg6k5 | grep -C10 -Ei 'err' 

[info 240814 02:42:02 options.parseOptions(options.go:336)] Use configuration file: /etc/yunion/vpcagent.conf
[warning 240814 02:42:02 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1215)] Cannot find argument api-sync-interval
[info 240814 02:42:02 options.parseOptions(options.go:359)] Set log level to "info"
[error 2024-08-14 02:42:02 auth.(*authManager).startRefreshRevokeTokens(auth.go:193)] refreshRevokeTokens: No valid admin token credential
[info 2024-08-14 02:42:02 service.StartService.func1(service.go:59)] auth finished ok
[info 2024-08-14 02:42:02 policy.(*SPolicyManager).init(policy.go:160)] policy fetch worker count 1
[info 2024-08-14 02:42:02 consts.SetNonDefaultDomainProjects(consts.go:109)] set non_default_domain_projects to false
[info 2024-08-14 02:42:02 options.StartOptionManagerWithSessionDriver(manager.go:68)] OptionManager start to fetch service configs with interval 30m0s ...
[info 2024-08-14 02:42:02 watcher.(*SInformerSyncManager).startWatcher(watcher.go:83)]EndpointChangeManager: Start resource informer watcher for endpoint
[info 2024-08-14 02:42:02 informer.(*EtcdBackendForClient).StartClientWatch(etcd_client.go:84)] /onecloud/informer watched
[info 2024-08-14 02:42:02 informer.NewWatchManagerBySessionBg.func1(watcher.go:51)] callback with watchMan success.
[info 2024-08-14 02:42:02 options.optionsEquals(manager.go:120)] Options added: {"api_server":"https://10.13.1.2"}
[info 2024-08-14 02:42:02 app.InitApp(app.go:32)] RequestWorkerCount: 8
[info 2024-08-14 02:42:02 watcher.(*SInformerSyncManager).startWatcher(watcher.go:83)]ServiceConfigManager: Start resource informer watcher for service

@zexi
Copy link
Member

zexi commented Aug 20, 2024

重启下 vpcagent 后,再到虚拟机里面重新获取下 ip 试试?

@saltfishh
Copy link
Author

重启下 vpcagent 后,再到虚拟机里面重新获取下 ip 试试?

当天貌似就试过了. 我再试试.

@saltfishh
Copy link
Author

重启下 vpcagent 后,再到虚拟机里面重新获取下 ip 试试?

未果, vm dhclicent 期间, vpcagent pod 日志与上面的日志无异, 没有特殊的地方.
以下是该pod中 /etc/yunion/vpcagent.conf 文件:

/ # cat /etc/yunion/vpcagent.conf 
address: 0.0.0.0
admin_domain: Default
admin_password: MY_PASS
admin_project: system
admin_project_domain: Default
admin_user: vpcagentadmin
api_list_batch_size: 1024
api_sync_interval: 10
application_id: vpcagent
auth_token_cache_size: 2048
auth_url: https://default-keystone:30357/v3
calculate_quota_usage_interval_seconds: 900
config_sync_period_seconds: 1800
cron_job_worker_count: 4
debug_client: false
default_quota_value: default
domainized_namespace: false
enable_quota_check: false
enable_ssl: false
help: false
ignore_nonrunning_guests: true
is_slave_node: false
log_level: info
log_verbose_level: 0
non_default_domain_projects: false
notify_admin_users:
- sysadmin
ovn_north_database: tcp:default-ovn-north:32241
ovn_underlay_mtu: 1500
ovn_worker_check_interval: 180
platform_name: Cloudpods
port: 0
rbac_debug: false
rbac_policy_refresh_interval_seconds: 30
region: region0
request_worker_count: 8
temp_path: /opt/yunion/tmp
tenant_cache_expire_seconds: 900
time_zone: Asia/Shanghai
version: false
vpc_provider: ovn

@saltfishh
Copy link
Author

回滚是否可以解决?

@zexi
Copy link
Member

zexi commented Aug 30, 2024

感觉不是版本的问题,试试把那个宿主机的 ip 改正确后再试试?

对应命令是:

climc host-remove-netif $host_id $mac
climc host-add-netif --ip-addr a.b.c.d --type admin $host_id $wire_id $mac 0

@saltfishh
Copy link
Author

感觉不是版本的问题,试试把那个宿主机的 ip 改正确后再试试?

对应命令是:

climc host-remove-netif $host_id $mac
climc host-add-netif --ip-addr a.b.c.d --type admin $host_id $wire_id $mac 0

这个会影响业务不?

@saltfishh
Copy link
Author

climc

$wire_id 是哪个?

@zexi
Copy link
Member

zexi commented Aug 30, 2024

climc

$wire_id 是哪个?

climc wire-list 看下,在前端对应的是二层网络

@saltfishh
Copy link
Author

还是不行, 当前宿主机的ip已被修正:
2024-08-30_11-25
修正宿主机ip后, 重启了 vpc agent pod
新创建vm时, 以下是 vpcagent pod 日志:

[info 2024-08-30 03:16:12 ovnutil.(*OvnNbCtl).must(ovn_nbctl.go:115)] ClaimGuestnetwork:
ovn-nbctl
-- "--id=@iface-249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff-LANVPCSub-15" "create" "Logical_Switch_Port"
"addresses=["00:22:70:ac:35:59 10.11.1.143"]"
"name="iface-249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff-LANVPCSub-15""
"port_security=["00:22:70:ac:35:59 10.11.1.143/25"]"
"type="""
-- "--id=@dhcp-opt-8938d1ad-dfae-4447-88ff-915c001e0674-LANVPCSub-15" "create" "DHCP_Options"
"cidr="10.11.1.128/25""
"external_ids:"oc-ref"="dhcp/8938d1ad-dfae-4447-88ff-915c001e0674/LANVPCSub-15""
"options:"domain_name"="\"cloud.onecloud.io\"""
"options:"ntp_server"="{\"ntp2.aliyun.com\"}""
"options:"server_id"="10.11.1.129""
"options:"server_mac"="9a:79:4c:76:af:32""
"options:"lease_time"="100663296""
"options:"T1"="67108864""
"options:"classless_static_route"="{169.254.169.254,0.0.0.0,0.0.0.0/0,10.11.1.129}""
"options:"router"="10.11.1.129""
"options:"mtu"="1440""
"options:"T2"="33554432""
"options:"dns_server"="{223.5.5.5,223.6.6.6}""
-- "add" "Logical_Switch_Port" "iface-249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff-LANVPCSub-15" "dhcpv4_options" "@dhcp-opt-8938d1ad-dfae-4447-88ff-915c001e0674-LANVPCSub-15"
-- "add" "Logical_Switch" "subnet/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff" "ports" "@iface-249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff-LANVPCSub-15"
-- "--id=@gnrDefault" "create" "Logical_Router_Static_Route"
"external_ids:"oc-ref"="gnrDefault/f22f0f11-22b9-4508-86bd-cf37c0a7dfb4/8938d1ad-dfae-4447-88ff-915c001e0674/LANVPCSub-15""
"ip_prefix="10.11.1.143/32""
"nexthop="100.64.0.129""
"output_port="vpc-rh/f22f0f11-22b9-4508-86bd-cf37c0a7dfb4""
"policy="src-ip""
-- "add" "Logical_Router" "vpc-ext-r/f22f0f11-22b9-4508-86bd-cf37c0a7dfb4" "static_routes" "@gnrDefault"
-- "--id=@acl0" "create" "ACL"
"action="drop""
"direction="to-lport""
"external_ids:"oc-ref"="acl/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff/8938d1ad-dfae-4447-88ff-915c001e0674/LANVPCSub-15""
"log=false"
"match="outport == \"iface-249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff-LANVPCSub-15\"""
"priority=1"
-- "add" "Logical_Switch" "subnet/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff" "acls" "@acl0"
-- "--id=@Acl1" "create" "ACL"
"action="allow-related""
"direction="to-lport""
"external_ids:"oc-ref"="acl/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff/8938d1ad-dfae-4447-88ff-915c001e0674/LANVPCSub-15""
"log=false"
"match="outport == \"iface-249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff-LANVPCSub-15\" && arp""
"priority=2"
-- "add" "Logical_Switch" "subnet/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff" "acls" "@Acl1"
-- "--id=@acl2" "create" "ACL"
"action="allow-related""
"direction="to-lport""
"external_ids:"oc-ref"="acl/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff/8938d1ad-dfae-4447-88ff-915c001e0674/LANVPCSub-15""
"log=false"
"match="outport == \"iface-249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff-LANVPCSub-15\" && ip4 && ip4.src == 0.0.0.0/0""
"priority=101"
-- "add" "Logical_Switch" "subnet/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff" "acls" "@acl2"
-- "--id=@qosVif0" "create" "QoS"
"bandwidth:"rate"=1000000"
"bandwidth:"burst"=2000000"
"direction="from-lport""
"external_ids:"oc-ref"="qos/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff/8938d1ad-dfae-4447-88ff-915c001e0674/LANVPCSub-15""
"match="inport == \"iface-249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff-LANVPCSub-15\"""
"priority=2000"
-- "add" "Logical_Switch" "subnet/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff" "qos_rules" "@qosVif0"
-- "--id=@qosVif1" "create" "QoS"
"bandwidth:"rate"=1000000"
"bandwidth:"burst"=2000000"
"direction="to-lport""
"external_ids:"oc-ref"="qos/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff/8938d1ad-dfae-4447-88ff-915c001e0674/LANVPCSub-15""
"match="outport == \"iface-249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff-LANVPCSub-15\"""
"priority=1000"
-- "add" "Logical_Switch" "subnet/249c0bc2-c1fd-45a4-8fc3-c8bb69af39ff" "qos_rules" "@qosVif1"

@saltfishh
Copy link
Author

saltfishh commented Sep 3, 2024

尝试重启了宿主机, 还是不行. 更新 3.11.5 之前创建的vm, 部分可以 dhcp 获取ip, 部分不行.
更新后再创建的 vm 都不行.

@saltfishh
Copy link
Author

arp -an
? (10.11.1.143) at 00:22:ee:73:e2:92 [ether] on eth0
? (10.11.1.169) at <incomplete> on eth0

? (10.11.1.172) at 00:22:ca:89:89:28 [ether] on eth0

其中, 172 是无法dhcp获取ip的vm, 奇怪的是, 同网段的主机居然可以通过arp拿到正确的MAC...

@saltfishh
Copy link
Author

该问题已经处理

  • ovn-controller日志:

2024-09-05_13-42

显示DHCPV4 option中, ntp_server 需要数字类型, 而之前该vpc子网中配置的 ntp 服务器是域名.
将该子网中的 ntp 服务器修改为 ip 后, 重启 host 再dhcp, 就可以获取到ip了.

default-host 中关于 ovn-controller 容器的定义:

   ovn-controller:
    Image:      registry.cn-beijing.aliyuncs.com/yunion/openvswitch:2.12.4-2
    Port:       <none>
    Host Port:  <none>
    Command:
      /start.sh
      controller
    Environment:  <none>
    Mounts:
      /var/log/openvswitch from var-log-openvswitch (rw)
      /var/run/openvswitch from var-run-openvswitch (rw)

是不是新版本做了变动?

@github-actions github-actions bot added the stale label Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale state/awaiting processing
Projects
None yet
Development

No branches or pull requests

2 participants