Skip to content

Latest commit

 

History

History
 
 

07_largeScaleTraining

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Large Scale Training

AI-Driven Science on Supercomputers @ ALCF 2022

Contact: Sam Foreman ([email protected])

This section of the workshop will introduce you to some of the tools that we use to run large-scale distributed training on supercomputers.

Note
Additional examples (+ resources) can be found at:

Running

  1. Clone / update repo:
    # if not already cloned:
    git clone https://github.com/argonne-lcf/ai-science-training-series
    cd ai-science-training-series/
    git pull
  2. Navigate into 07_largeScaleTraining/src/ai4sci
  3. Run (with batch_size=512, for example):
    export BS=512; ./main.sh "batch_size=${BS}" > "main-bs-${BS}.log" 2>&1 &
  4. View output:
    tail -f "main-bs-${BS}.log" $(tail -1 logs/latest)

Warning
If you run into issues with packages etc, try (directly from a compute node):

module load conda
conda activate base
cd 07_largeScaleTraining/
python3 -m venv venv --system-site-packages
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install hydra-core hydra_colorlog
python3 -m pip install -e .
python3 -c 'import ai4sci; print(ai4sci.__file__)'
  • To run:
    cd src/ai4sci
    ./main.sh > main.log 2>&1 &
  • To view output:
    tail -f main.log $(tail -1 logs/latest)