PyTorch Optimem

A package for speeding up inference of PyTorch modules with low-memory GPUs.

GPUs can make your PyTorch deep learning applications run significantly faster, but as models get larger, it's harder and harder to fit them within this specialized hardware. Traditionally, if your model is larger than the GPUs memory (vRAM), you have to run the entirety on the much slower CPU. The goal of PyTorch Optimem is to take advantage of GPUs of any size, even if they cannot contain the entire model through paging or chunk loading.

This currently focuses only on inference, but future versions may include similar techniques for training applications. Additionally, PyTorch Optimem supports paging and chunk loading on other devices (TPU, MPS, etc.) but has not benchmarked for those.

Usage

Install the package using pip:

pip install --upgrade pytorch-optimem

PyTorch Optimem offers two modes, page or chunk. For both modes, the model must be in evaluation mode (run model.eval()) and all parameters must currently be on the CPU.

Paging

Paging mode pages the input model between the RAM and vRAM during inference. At a customizable granularity, it moves chunks of the model to the GPU or other hardware "on the fly" while maintaining the data on the GPU throughout. An illustration of the process is shown below.

Usage:

optimem.page(
  model: torch.nn.Module # model to apply paging to
  device: torch.device = torch.device('cuda') # deivce to page from CPU to
  max_layer_size: int = -1 # maximum number of parameters for which to stop recursing when determining granularity of paging
) - > None

Example:

import torchvision
import optimem

resnet = torchvision.models.resnet101(pretrained=True).eval()
optimem.page(resnet)

NOTE: The data tensor must start on GPU.

NOTE: max_layer_size exists to customize how much GPU is being used. For example, ResNet has "block" modules with multiple conv layers + max pooling and if max_layer_size is set high enough those will be paged all at once instead of each layer being paged invidually.

Performance results for passing in 512x512 images into ResNet-101 are shown below:

Chunking

Chunking mode identifies the largest chunk of the model which can be loaded to the GPU and permanently stores it in the vRAM. It also adds the pipelining that allows the model to convert the data to the right type throughout the process.

The first few layers are added to the GPU so that part of the model can run faster

Usage:

optimem.chunk(
  model: torch.nn.Module # model to apply paging to
  device: torch.device = torch.device('cuda') # deivce to page from CPU to
  max_capacity: int = 1e9 # total capacity of parameters that can be placed onto the GPU
) - > None

Example:

import torchvision
import optimem

resnet = torchvision.models.resnet101(pretrained=True).eval()
optimem.chunk(resnet)

NOTE: The data tensor must start on CPU.

NOTE: Higher max_capacity increases the performance but also adds to the GPU utilization

Performance results for passing in 512x512 images into ResNet-101 are shown below with half the model chunk loaded on the GPU:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PyTorch Optimem

Usage

Paging

Chunking

Files

README.md

Latest commit

History

README.md

File metadata and controls

PyTorch Optimem

Usage

Paging

Chunking