Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azuredisk-node-win returns wrong diskID and mounts the wrong disk (e.g. drive D:) #2661

Open
mweibel opened this issue Nov 22, 2024 · 6 comments · May be fixed by #2662
Open

azuredisk-node-win returns wrong diskID and mounts the wrong disk (e.g. drive D:) #2661

mweibel opened this issue Nov 22, 2024 · 6 comments · May be fixed by #2662

Comments

@mweibel
Copy link

mweibel commented Nov 22, 2024

What happened:

  • We upgraded to kubernetes 1.31
  • use windows ltsc2022 VMs
  • pods mount an emptyDir
  • Nodes are VMSS VMs (Uniform mode)
  • with Ephemeral disk set to Temp disk placement
  • Instance Type is Standard_D96ads_v5
  • use VM generation 2
  • We use azuredisk-csi in HostProcess mode.

When the pod starts up and tries to stage the volume, azure disk windows fetches the disk by LUN. It reaches the following code:

func ListDiskLocations() (map[uint32]Location, error) {
// sample response
// [{
// "number": 0,
// "location": "PCI Slot 3 : Adapter 0 : Port 0 : Target 1 : LUN 0"
// }, ...]
cmd := fmt.Sprintf("ConvertTo-Json @(Get-Disk | select Number, Location)")

Because it fetches by LUN, if two disks have the same LUN it takes whatever disk comes first. This may be for example drive D: instead of the actual disk it should mount.

The issue does not always happen because of the order of the response from Get-Disk. It happens very often though, in a recent reproduction attempt we had 3 out of 13 nodes with the issue.

In an affected node:

Powershell Cmdlet Get-Disk returns the following information:

[
    {
        "Number":  3,
        "Location":  "Integrated : Bus 0 : Device 63667 : Function 30747 : Adapter 1 : Port 0 : Target 0 : LUN 1"
    },
    {
        "Number":  0,
        "Location":  "Integrated : Bus 0 : Device 63667 : Function 30746 : Adapter 0 : Port 0 : Target 0 : LUN 0"
    },
    {
        "Number":  1,
        "Location":  "Integrated : Bus 0 : Device 63667 : Function 30746 : Adapter 0 : Port 0 : Target 0 : LUN 1"
    }
]
With all properties
> Get-Disk | select *

DiskNumber            : 3
PartitionStyle        : GPT
ProvisioningType      : Thin
OperationalStatus     : Offline
HealthStatus          : Healthy
BusType               : SAS
UniqueIdFormat        : FCPH Name
OfflineReason         : Collision
ObjectId              : {1}\\dev-mx25100000E\root/Microsoft/Windows/Storage/Providers_v2\WSP_Disk.ObjectId="{bec7662e-a827-11ef-b633-806e6f6e6963}:DI:\\?\scsi#disk&ven_msft&prod_virtual_disk#5&23db5f40&0&000001#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}"
PassThroughClass      :
PassThroughIds        :
PassThroughNamespace  :
PassThroughServer     :
UniqueId              : 60022480F48D294F40F18727DF9A6349
AdapterSerialNumber   :
AllocatedSize         : 549755813888
BootFromDisk          : False
FirmwareVersion       : 1.0
FriendlyName          : Msft Virtual Disk
Guid                  : {fd0e7613-2bbe-40ed-98bf-0426fc3bb7ad}
IsBoot                : False
IsClustered           : False
IsHighlyAvailable     : False
IsOffline             : True
IsReadOnly            : False
IsScaleOut            : False
IsSystem              : False
LargestFreeExtent     : 0
Location              : Integrated : Bus 0 : Device 63667 : Function 30747 : Adapter 1 : Port 0 : Target 0 : LUN 1
LogicalSectorSize     : 512
Manufacturer          : Msft
Model                 : Virtual Disk
Number                : 3
NumberOfPartitions    : 1
Path                  : \\?\scsi#disk&ven_msft&prod_virtual_disk#5&23db5f40&0&000001#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}
PhysicalSectorSize    : 4096
SerialNumber          :
Signature             :
Size                  : 549755813888
PSComputerName        :
CimClass              : ROOT/Microsoft/Windows/Storage:MSFT_Disk
CimInstanceProperties : {ObjectId, PassThroughClass, PassThroughIds, PassThroughNamespace...}
CimSystemProperties   : Microsoft.Management.Infrastructure.CimSystemProperties

DiskNumber            : 0
PartitionStyle        : GPT
ProvisioningType      : Thin
OperationalStatus     : Online
HealthStatus          : Healthy
BusType               : SAS
UniqueIdFormat        : FCPH Name
OfflineReason         :
ObjectId              : {1}\\dev-mx25100000E\root/Microsoft/Windows/Storage/Providers_v2\WSP_Disk.ObjectId="{bec7662e-a827-11ef-b633-806e6f6e6963}:DI:\\?\scsi#disk&ven_msft&prod_virtual_disk#5&394b69d0&0&000000#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}"
PassThroughClass      :
PassThroughIds        :
PassThroughNamespace  :
PassThroughServer     :
UniqueId              : 6002248068A1DF141A25EC72F5B697E0
AdapterSerialNumber   :
AllocatedSize         : 2146409906176
BootFromDisk          : True
FirmwareVersion       : 1.0
FriendlyName          : Msft Virtual Disk
Guid                  : {99bce706-5b33-4d9f-8d26-1645e715fcab}
IsBoot                : True
IsClustered           : False
IsHighlyAvailable     : False
IsOffline             : False
IsReadOnly            : False
IsScaleOut            : False
IsSystem              : True
LargestFreeExtent     : 0
Location              : Integrated : Bus 0 : Device 63667 : Function 30746 : Adapter 0 : Port 0 : Target 0 : LUN 0
LogicalSectorSize     : 512
Manufacturer          : Msft
Model                 : Virtual Disk
Number                : 0
NumberOfPartitions    : 4
Path                  : \\?\scsi#disk&ven_msft&prod_virtual_disk#5&394b69d0&0&000000#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}
PhysicalSectorSize    : 4096
SerialNumber          :
Signature             :
Size                  : 2146409906176
PSComputerName        :
CimClass              : ROOT/Microsoft/Windows/Storage:MSFT_Disk
CimInstanceProperties : {ObjectId, PassThroughClass, PassThroughIds, PassThroughNamespace...}
CimSystemProperties   : Microsoft.Management.Infrastructure.CimSystemProperties

DiskNumber            : 1
PartitionStyle        : MBR
ProvisioningType      : Thin
OperationalStatus     : Online
HealthStatus          : Healthy
BusType               : SAS
UniqueIdFormat        : FCPH Name
OfflineReason         :
ObjectId              : {1}\\dev-mx25100000E\root/Microsoft/Windows/Storage/Providers_v2\WSP_Disk.ObjectId="{bec7662e-a827-11ef-b633-806e6f6e6963}:DI:\\?\scsi#disk&ven_msft&prod_virtual_disk#5&394b69d0&0&000001#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}"
PassThroughClass      :
PassThroughIds        :
PassThroughNamespace  :
PassThroughServer     :
UniqueId              : 60022480113BB3262A95ADDEEDEC998D
AdapterSerialNumber   :
AllocatedSize         : 1719060660224
BootFromDisk          : False
FirmwareVersion       : 1.0
FriendlyName          : Msft Virtual Disk
Guid                  :
IsBoot                : False
IsClustered           : False
IsHighlyAvailable     : False
IsOffline             : False
IsReadOnly            : False
IsScaleOut            : False
IsSystem              : False
LargestFreeExtent     : 0
Location              : Integrated : Bus 0 : Device 63667 : Function 30746 : Adapter 0 : Port 0 : Target 0 : LUN 1
LogicalSectorSize     : 512
Manufacturer          : Msft
Model                 : Virtual Disk
Number                : 1
NumberOfPartitions    : 1
Path                  : \\?\scsi#disk&ven_msft&prod_virtual_disk#5&394b69d0&0&000001#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}
PhysicalSectorSize    : 4096
SerialNumber          :
Signature             : 1638780968
Size                  : 1719060660224
PSComputerName        :
CimClass              : ROOT/Microsoft/Windows/Storage:MSFT_Disk
CimInstanceProperties : {ObjectId, PassThroughClass, PassThroughIds, PassThroughNamespace...}
CimSystemProperties   : Microsoft.Management.Infrastructure.CimSystemProperties

As you can see, Disk 3 and 1 share the same LUN but are on different adapters. Disk 3 would be the correct one, while Disk 1 is incorrect.

When trying to take disk 1, the following logs show up in azuredisk-win:

I1122 08:48:23.186935   13024 nodeserver.go:157] NodeStageVolume: formatting 1 and mounting at \var\lib\kubelet\plugins\kubernetes.io\csi\disk.csi.azure.com\bc19d48e2a45fbacda32ff2f2914ff1ed48432852ee2abbee8a8b5968be59a4c\globalmount with mount options([])
E1122 08:48:24.687834   13024 safe_mounter_host_process_windows.go:162] SetDiskState on disk(1) failed with error setting disk attach state. cmd: Set-Disk -Number 1 -IsOffline $false;Set-Disk -Number 1 -IsReadOnly $false, output: Set-Disk : Operation not supported on a critical disk.

The mount in the end still works, but it mounts the wrong disk (D: instead of the emptyDir).

We haven't been able to find dedicated documentation on the Azure disk drivers, but judging from the information we gathered on that node, we determined that disks on Adapter 0 are critical disks and should be avoided.

Due to the way Location is just a string, we determined it might be better to use Get-CimInstance cmdlet instead:

> ConvertTo-Json @(Get-CimInstance win32_diskdrive|where-object -FilterScript {$_.SCSIPort -Ne 0}|select Index,SCSILogicalUnit,SCSITargetId,SCSIPort,SCSIBus)

[
    {
        "Index":  3,
        "SCSILogicalUnit":  1,
        "SCSITargetId":  0,
        "SCSIPort":  1,
        "SCSIBus":  0
    }
]

What you expected to happen:
Mounting works and mounts the correct disk.

How to reproduce it:
Create several pods running each on a dedicated node configured as outlined above, and mount an emptyDir. Check what disk the pods have mounted and look for the error logs in azuredisk win.

Anything else we need to know?:

Environment:

  • CSI Driver version: v1.31.1
  • Kubernetes version (use kubectl version): v1.31.1
  • OS (e.g. from /etc/os-release): windows ltsc2022
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
mweibel added a commit to helio/azuredisk-csi-driver that referenced this issue Nov 22, 2024
@andyzhangx
Copy link
Member

@mweibel thanks for the PR. do you know the 3 disks are all data disks? what's the LUN num of the 3 disks?

@mweibel
Copy link
Author

mweibel commented Nov 22, 2024

@andyzhangx thanks for the fast reply! There's a collapsed section in the initial message which has more detail, does that help?

Does this answer your question?

I also have some more detailed output from get-ciminstance:

> Get-CimInstance win32_diskdrive|select-object *


ConfigManagerErrorCode      : 0
LastErrorCode               :
NeedsCleaning               :
Status                      : OK
DeviceID                    : \\.\PHYSICALDRIVE1
StatusInfo                  :
Partitions                  : 1
BytesPerSector              : 512
ConfigManagerUserConfig     : False
DefaultBlockSize            :
Index                       : 1
InstallDate                 :
InterfaceType               : SCSI
MaxBlockSize                :
MaxMediaSize                :
MinBlockSize                :
NumberOfMediaSupported      :
SectorsPerTrack             : 63
Size                        : 1719058844160
TotalCylinders              : 208997
TotalHeads                  : 255
TotalSectors                : 3357536805
TotalTracks                 : 53294235
TracksPerCylinder           : 255
Caption                     : Microsoft Virtual Disk
Description                 : Disk drive
Name                        : \\.\PHYSICALDRIVE1
Availability                :
CreationClassName           : Win32_DiskDrive
ErrorCleared                :
ErrorDescription            :
PNPDeviceID                 : SCSI\DISK&VEN_MSFT&PROD_VIRTUAL_DISK\5&394B69D0&0&000001
PowerManagementCapabilities :
PowerManagementSupported    :
SystemCreationClassName     : Win32_ComputerSystem
SystemName                  : dev-mx25100000E
Capabilities                : {3, 4}
CapabilityDescriptions      : {Random Access, Supports Writing}
CompressionMethod           :
ErrorMethodology            :
FirmwareRevision            : 1.0
Manufacturer                : (Standard disk drives)
MediaLoaded                 : True
MediaType                   : Fixed hard disk media
Model                       : Microsoft Virtual Disk
SCSIBus                     : 0
SCSILogicalUnit             : 1
SCSIPort                    : 0
SCSITargetId                : 0
SerialNumber                :
Signature                   : 1638780968
PSComputerName              :
CimClass                    : root/cimv2:Win32_DiskDrive
CimInstanceProperties       : {Caption, Description, InstallDate, Name...}
CimSystemProperties         : Microsoft.Management.Infrastructure.CimSystemProperties

ConfigManagerErrorCode      : 0
LastErrorCode               :
NeedsCleaning               :
Status                      : OK
DeviceID                    : \\.\PHYSICALDRIVE0
StatusInfo                  :
Partitions                  : 3
BytesPerSector              : 512
ConfigManagerUserConfig     : False
DefaultBlockSize            :
Index                       : 0
InstallDate                 :
InterfaceType               : SCSI
MaxBlockSize                :
MaxMediaSize                :
MinBlockSize                :
NumberOfMediaSupported      :
SectorsPerTrack             : 63
Size                        : 2146403266560
TotalCylinders              : 260952
TotalHeads                  : 255
TotalSectors                : 4192193880
TotalTracks                 : 66542760
TracksPerCylinder           : 255
Caption                     : Microsoft Virtual Disk
Description                 : Disk drive
Name                        : \\.\PHYSICALDRIVE0
Availability                :
CreationClassName           : Win32_DiskDrive
ErrorCleared                :
ErrorDescription            :
PNPDeviceID                 : SCSI\DISK&VEN_MSFT&PROD_VIRTUAL_DISK\5&394B69D0&0&000000
PowerManagementCapabilities :
PowerManagementSupported    :
SystemCreationClassName     : Win32_ComputerSystem
SystemName                  : dev-mx25100000E
Capabilities                : {3, 4}
CapabilityDescriptions      : {Random Access, Supports Writing}
CompressionMethod           :
ErrorMethodology            :
FirmwareRevision            : 1.0
Manufacturer                : (Standard disk drives)
MediaLoaded                 : True
MediaType                   : Fixed hard disk media
Model                       : Microsoft Virtual Disk
SCSIBus                     : 0
SCSILogicalUnit             : 0
SCSIPort                    : 0
SCSITargetId                : 0
SerialNumber                :
Signature                   :
PSComputerName              :
CimClass                    : root/cimv2:Win32_DiskDrive
CimInstanceProperties       : {Caption, Description, InstallDate, Name...}
CimSystemProperties         : Microsoft.Management.Infrastructure.CimSystemProperties

ConfigManagerErrorCode      : 0
LastErrorCode               :
NeedsCleaning               :
Status                      : OK
DeviceID                    : \\.\PHYSICALDRIVE3
StatusInfo                  :
Partitions                  : 1
BytesPerSector              : 512
ConfigManagerUserConfig     : False
DefaultBlockSize            :
Index                       : 3
InstallDate                 :
InterfaceType               : SCSI
MaxBlockSize                :
MaxMediaSize                :
MinBlockSize                :
NumberOfMediaSupported      :
SectorsPerTrack             : 63
Size                        : 549753039360
TotalCylinders              : 66837
TotalHeads                  : 255
TotalSectors                : 1073736405
TotalTracks                 : 17043435
TracksPerCylinder           : 255
Caption                     : Microsoft Virtual Disk
Description                 : Disk drive
Name                        : \\.\PHYSICALDRIVE3
Availability                :
CreationClassName           : Win32_DiskDrive
ErrorCleared                :
ErrorDescription            :
PNPDeviceID                 : SCSI\DISK&VEN_MSFT&PROD_VIRTUAL_DISK\5&23DB5F40&0&000001
PowerManagementCapabilities :
PowerManagementSupported    :
SystemCreationClassName     : Win32_ComputerSystem
SystemName                  : dev-mx25100000E
Capabilities                : {3, 4}
CapabilityDescriptions      : {Random Access, Supports Writing}
CompressionMethod           :
ErrorMethodology            :
FirmwareRevision            : 1.0
Manufacturer                : (Standard disk drives)
MediaLoaded                 : True
MediaType                   : Fixed hard disk media
Model                       : Microsoft Virtual Disk
SCSIBus                     : 0
SCSILogicalUnit             : 1
SCSIPort                    : 1
SCSITargetId                : 0
SerialNumber                :
Signature                   :
PSComputerName              :
CimClass                    : root/cimv2:Win32_DiskDrive
CimInstanceProperties       : {Caption, Description, InstallDate, Name...}
CimSystemProperties         : Microsoft.Management.Infrastructure.CimSystemProperties

@andyzhangx
Copy link
Member

I think the issue exists when you set Temp disk placement, what if you don't make that setting?

@mweibel
Copy link
Author

mweibel commented Nov 22, 2024

to reproduce this I ran the following pod:

apiVersion: v1
kind: Pod
metadata:
  name: debug-windows
  namespace: default
spec:
  containers:
    - name: main
      image: mcr.microsoft.com/windows/servercore:ltsc2022
      imagePullPolicy: IfNotPresent
      command: [ "cmd", "/c" ]
      args: [ "ping /t 127.0.0.1" ]
      volumeMounts:
        - name: test
          mountPath: /test
  volumes:
    - name: test
      ephemeral:
        volumeClaimTemplate:
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 1Gi
            storageClassName: ssd-ntfs
            volumeMode: Filesystem
  nodeSelector:
    kubernetes.io/os: windows

Storage class:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ssd-ntfs
parameters:
  cachingMode: ReadOnly
  fsType: ntfs
  kind: managed
  skuName: Premium_LRS
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Running Get-Disk on the node the pod got assigned to:

> ConvertTo-Json @(Get-Disk | select Number, Location)
[
    {
        "Number":  2,
        "Location":  "Integrated : Bus 0 : Device 63667 : Function 30747 : Adapter 1 : Port 0 : Target 0 : LUN 0"
    },
    {
        "Number":  0,
        "Location":  "Integrated : Bus 0 : Device 63667 : Function 30746 : Adapter 0 : Port 0 : Target 0 : LUN 0"
    },
    {
        "Number":  1,
        "Location":  "Integrated : Bus 0 : Device 63667 : Function 30746 : Adapter 0 : Port 0 : Target 0 : LUN 1"
    }
]

so without this fix the issue would still prevail, even though Temp disk placement was not set.

@andyzhangx
Copy link
Member

I see, that's related to NVMe disk controller VM sku, will take a look next week

@mweibel
Copy link
Author

mweibel commented Nov 26, 2024

I see, that's related to NVMe disk controller VM sku, will take a look next week

can you elaborate please? not sure what you mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants