AI-Driven Science on Supercomputers @ ALCF 2022
Contact: Sam Foreman ([email protected])
- Accompanying slides:
This section of the workshop will introduce you to some of the tools that we use to run large-scale distributed training on supercomputers.
Note
Additional examples (+ resources) can be found at:
- Clone / update repo:
# if not already cloned: git clone https://github.com/argonne-lcf/ai-science-training-series cd ai-science-training-series/ git pull
- Navigate into
07_largeScaleTraining/src/ai4sci
- Run (with
batch_size=512
, for example):export BS=512; ./main.sh "batch_size=${BS}" > "main-bs-${BS}.log" 2>&1 &
- View output:
tail -f "main-bs-${BS}.log" $(tail -1 logs/latest)
Warning
If you run into issues with packages etc, try (directly from a compute node):module load conda conda activate base cd 07_largeScaleTraining/ python3 -m venv venv --system-site-packages source venv/bin/activate python3 -m pip install --upgrade pip python3 -m pip install hydra-core hydra_colorlog python3 -m pip install -e . python3 -c 'import ai4sci; print(ai4sci.__file__)'
- To run:
cd src/ai4sci ./main.sh > main.log 2>&1 &- To view output:
tail -f main.log $(tail -1 logs/latest)