You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I mentioned on Mattermost, I think these are all available through nvidia-smi, e.g. nvidia-smi stats gives you a csv stream of stats.
I also quickly checked LUMI, which has a similar utility called rocm-smi, and something like rocm-smi -fPtu --showmemuse --showvoltage --json seems to give you a snapshot at the moment of calling.
Edit: in terms of how to integrate this with the rest… I was thinking of some sort of general event database/log and also store things like "N lines passed onto trainer", "Restarted reading dataset X", "marian validation score is now X blue" into that. In the AWS/Cloud world I think the ELK stack is commonly used for this. Don't think we'd need that kind of scale (and I just want to have it all built-in in this repo ideally…) but might be a source for inspiration.
add functionality to
The text was updated successfully, but these errors were encountered: