Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to collect metrics: could not load NVML library #1

Open
zh168654 opened this issue Aug 15, 2018 · 3 comments
Open

Failed to collect metrics: could not load NVML library #1

zh168654 opened this issue Aug 15, 2018 · 3 comments

Comments

@zh168654
Copy link

This is my deployment:

apiVersion: apps/v1beta1
kind: Deployment

metadata:
  name: nvidia-exporter
  namespace: monitoring
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: nvidia-exporter
    spec:
      containers:
        - name: nvidia-exporter
          securityContext:
            privileged: true
          image: bugroger/nvidia-exporter:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 9401
          volumeMounts:
            - mountPath: /usr/local/nvidia
              name: nvidia
      volumes:
        - name: nvidia
          hostPath:
            path: /home/zy/cuda

when I exec into nvidia-exporter and run

ls /usr/local/nvidia/lib64

there exists libnvidia-ml.so.1
But the container logs always show

Failed to collect metrics: could not load NVML library

@Cherishty
Copy link

@zh168654 have you find any workaround or clues?
I am facing a similar error which says:

Failed to collect metrics: nvml: Not Supported

My Driver Version is : 390.59, GPU is Tesla K80.

While this error does NOT occur on other env whose GPU is GTX 1080

@SjhZju
Copy link

SjhZju commented Mar 5, 2019

hi,

@zh168654 have you find any workaround or clues?
I am facing a similar error which says:

Failed to collect metrics: nvml: Not Supported

My Driver Version is : 390.59, GPU is Tesla K80.

While this error does NOT occur on other env whose GPU is GTX 1080

hi,
I have the same problem. I think it is the reason why exporter can not get metrics.
My Driver Version is 390.48, with two GTX 980. Server Os is Ubuntu 16.04

@bmerry
Copy link

bmerry commented Jun 23, 2022

I'm running into the same problem. I suspect it's because the Docker image is built with Alpine (and hence musl libc) while Nvidia's NVML library (libnvidia-ml.so) depends on glibc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants