Failed to collect metrics: nvml: Not Supported #3

Cherishty · 2018-11-29T11:28:16Z

When starting the exporter in k8s, the log alway says:

Failed to collect metrics: nvml: Not Supported

And below is the result of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.59                 Driver Version: 390.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000460E:00:00.0 Off |                    0 |
| N/A   37C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00006180:00:00.0 Off |                    0 |
| N/A   33C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

While this error does not occur on another GPU machine which using GTX 1080.

Any clues or suggestion?

The text was updated successfully, but these errors were encountered:

Cherishty · 2018-11-29T15:05:09Z

Additionally, compared with a similar gpu-exporter, I find it meets the same using tesla issue, while it still can work.

And seems it has claimed by nvidia officially:

NVIDIA/nvidia-docker#40
ComputationalRadiationPhysics/cuda_memtest#16

So can we unblock it ?

jackpgao · 2019-10-24T06:37:49Z

+1

nvml: Not Supported

auto-testing · 2019-10-25T11:45:38Z

+1
Tesla: 2019/10/25 11:24:19 Failed to collect metrics: nvml: Not Supported

| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|

On GTX 1060, 1070 works fine.

ashleyprimo · 2019-12-04T10:53:03Z

Will be submitting a PR shortly; however to quickly explain the issue. It looks like not all metrics are supported on Tesla graphics card via NVML (there are likely other GPUs also) - however the exporter handles this by returning out instead of continuing to collect other metrics.

Example:

		fanSpeed, err := device.FanSpeed()
		if err != nil {
			return nil, err
		}

Instead of return nil, err we should just log (catch) the event - and do something that does not interrupt the routine.

So currently, as seen on you're nvidia-smi output's anything that has N/A would currently cause the above error and interrupt collection.

ashleyprimo mentioned this issue Dec 4, 2019

Introduce handling of missing metrics #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to collect metrics: nvml: Not Supported #3

Failed to collect metrics: nvml: Not Supported #3

Cherishty commented Nov 29, 2018

Cherishty commented Nov 29, 2018 •

edited

Loading

jackpgao commented Oct 24, 2019

auto-testing commented Oct 25, 2019

ashleyprimo commented Dec 4, 2019

Failed to collect metrics: nvml: Not Supported #3

Failed to collect metrics: nvml: Not Supported #3

Comments

Cherishty commented Nov 29, 2018

Cherishty commented Nov 29, 2018 • edited Loading

jackpgao commented Oct 24, 2019

auto-testing commented Oct 25, 2019

ashleyprimo commented Dec 4, 2019

Cherishty commented Nov 29, 2018 •

edited

Loading