Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring training costs #3

Open
jorgtied opened this issue Jan 10, 2023 · 2 comments
Open

monitoring training costs #3

jorgtied opened this issue Jan 10, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@jorgtied
Copy link

add functionality to

  • monitor GPU utilisation
  • stats about power consumption
@jorgtied jorgtied added the enhancement New feature or request label Jan 10, 2023
@jelmervdl
Copy link
Contributor

jelmervdl commented Jan 12, 2023

As I mentioned on Mattermost, I think these are all available through nvidia-smi, e.g. nvidia-smi stats gives you a csv stream of stats.

I also quickly checked LUMI, which has a similar utility called rocm-smi, and something like rocm-smi -fPtu --showmemuse --showvoltage --json seems to give you a snapshot at the moment of calling.

Edit: in terms of how to integrate this with the rest… I was thinking of some sort of general event database/log and also store things like "N lines passed onto trainer", "Restarted reading dataset X", "marian validation score is now X blue" into that. In the AWS/Cloud world I think the ELK stack is commonly used for this. Don't think we'd need that kind of scale (and I just want to have it all built-in in this repo ideally…) but might be a source for inspiration.

@jorgtied
Copy link
Author

jorgtied commented Mar 9, 2023

For NVIDIA there is an energy consumption counter that can be checked before and after the training process:

#!/usr/bin/python3
from pynvml import (
    nvmlInit, nvmlDeviceGetCount, nvmlDeviceGetHandleByIndex,
    nvmlDeviceGetTotalEnergyConsumption, nvmlShutdown
)

nvmlInit()

deviceCount = nvmlDeviceGetCount()
for i in range(deviceCount):
    handle = nvmlDeviceGetHandleByIndex(i)
    energy = nvmlDeviceGetTotalEnergyConsumption(handle)
    print(f"GPU {i}: {energy} mJ")
nvmlShutdown()

requires:

pip install nvidia-ml-py --user

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants