Name		Name	Last commit message	Last commit date
parent directory ..
src/ai4sci		src/ai4sci
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

README.md

Large Scale Training

AI-Driven Science on Supercomputers @ ALCF 2022

Contact: Sam Foreman ([email protected])

Accompanying slides:

This section of the workshop will introduce you to some of the tools that we use to run large-scale distributed training on supercomputers.

Note
Additional examples (+ resources) can be found at:

ALCF: Computational Performance Workshop

ALCF: Simulation, Data, and Learning Workshop for AI

Running

Clone / update repo:

# if not already cloned:
git clone https://github.com/argonne-lcf/ai-science-training-series
cd ai-science-training-series/
git pull

Navigate into 07_largeScaleTraining/src/ai4sci

Run (with batch_size=512, for example):

export BS=512; ./main.sh "batch_size=${BS}" > "main-bs-${BS}.log" 2>&1 &

View output:

tail -f "main-bs-${BS}.log" $(tail -1 logs/latest)

Warning
If you run into issues with packages etc, try (directly from a compute node):

module load conda
conda activate base
cd 07_largeScaleTraining/
python3 -m venv venv --system-site-packages
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install hydra-core hydra_colorlog
python3 -m pip install -e .
python3 -c 'import ai4sci; print(ai4sci.__file__)'

To run:

cd src/ai4sci
./main.sh > main.log 2>&1 &

To view output:
```
tail -f main.log $(tail -1 logs/latest)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

07_largeScaleTraining

07_largeScaleTraining

README.md

Large Scale Training

Running

Files

07_largeScaleTraining

Directory actions

More options

Directory actions

More options

Latest commit

History

07_largeScaleTraining

Folders and files

parent directory

README.md

Large Scale Training

Running