Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to collect metrics: nvml: Not Supported #3

Open
Cherishty opened this issue Nov 29, 2018 · 4 comments
Open

Failed to collect metrics: nvml: Not Supported #3

Cherishty opened this issue Nov 29, 2018 · 4 comments

Comments

@Cherishty
Copy link

Hi @BugRoger

When starting the exporter in k8s, the log alway says:

Failed to collect metrics: nvml: Not Supported

And below is the result of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.59                 Driver Version: 390.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000460E:00:00.0 Off |                    0 |
| N/A   37C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00006180:00:00.0 Off |                    0 |
| N/A   33C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

While this error does not occur on another GPU machine which using GTX 1080.

Any clues or suggestion?

@Cherishty
Copy link
Author

Cherishty commented Nov 29, 2018

Additionally, compared with a similar gpu-exporter, I find it meets the same using tesla issue, while it still can work.

And seems it has claimed by nvidia officially:

NVIDIA/nvidia-docker#40
ComputationalRadiationPhysics/cuda_memtest#16

So can we unblock it ?

@jackpgao
Copy link

+1

nvml: Not Supported

@auto-testing
Copy link

+1
Tesla: 2019/10/25 11:24:19 Failed to collect metrics: nvml: Not Supported

| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|

On GTX 1060, 1070 works fine.

@ashleyprimo
Copy link

Will be submitting a PR shortly; however to quickly explain the issue. It looks like not all metrics are supported on Tesla graphics card via NVML (there are likely other GPUs also) - however the exporter handles this by returning out instead of continuing to collect other metrics.

Example:

		fanSpeed, err := device.FanSpeed()
		if err != nil {
			return nil, err
		}

Instead of return nil, err we should just log (catch) the event - and do something that does not interrupt the routine.

So currently, as seen on you're nvidia-smi output's anything that has N/A would currently cause the above error and interrupt collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants