From bc567dde16217397fda927235ff86866bf610853 Mon Sep 17 00:00:00 2001
From: Mark Saroufim <marksaroufim@meta.com>
Date: Mon, 29 Apr 2024 09:55:44 -0700
Subject: [PATCH] louder warning + docs for custom cuda extensions (#186)

* louder warning for missing cudatoolkit

* docs for custom ops
---
 README.md              |  7 ++++---
 setup.py               |  7 ++++++-
 torchao/csrc/README.md | 29 +++++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 4 deletions(-)
 create mode 100644 torchao/csrc/README.md

diff --git a/README.md b/README.md
index 275f9a5887..31a0e5e81e 100644
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@ pip install torchao
 ```Shell
 git clone https://github.com/pytorch-labs/ao
 cd ao
-pip install -e .
+python setup.py develop
 ```
 
 ## Key Features
@@ -35,17 +35,18 @@ The library provides
   * High level `autoquant` API and kernel auto tuner targeting SOTA performance across varying model shapes on consumer/enterprise GPUs.
 3. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
 4. Integration with other PyTorch native libraries like [torchtune](https://github.com/pytorch/torchtune) and [ExecuTorch](https://github.com/pytorch/executorch)
+5. [Custom C++/CUDA Extension support](./torchao/csrc/)
 
 
 ## Our Goals
 torchao embodies PyTorch’s design philosophy [details](https://pytorch.org/docs/stable/community/design.html), especially "usability over everything else". Our vision for this repository is the following:
 
-* Composability: Native solutions for optimization techniques that compose with both `torch.compile` and `FSDP` 
+* Composability: Native solutions for optimization techniques that compose with both `torch.compile` and `FSDP`
     * For example, for QLoRA for new dtypes support
 * Interoperability: Work with the rest of the PyTorch ecosystem such as torchtune, gpt-fast and ExecuTorch
 * Transparent Benchmarks: Regularly run performance benchmarking of our APIs across a suite of Torchbench models and across hardware backends
 * Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch).
-* Infrastructure Support: Release packaging solution for kernels and a CI/CD setup that runs these kernels on different backends. 
+* Infrastructure Support: Release packaging solution for kernels and a CI/CD setup that runs these kernels on different backends.
 
 ## Interoperability with PyTorch Libraries
 
diff --git a/setup.py b/setup.py
index eb5ba9e3be..3972cb2c76 100644
--- a/setup.py
+++ b/setup.py
@@ -37,7 +37,12 @@ def get_extensions():
     if debug_mode:
         print("Compiling in debug mode")
 
-    # TODO: And cudatoolkit is available
+    if not torch.cuda.is_available():
+        print("PyTorch GPU support is not available. Skipping compilation of CUDA extensions")
+    if CUDA_HOME is None and torch.cuda.is_available():
+        print("CUDA toolkit is not available. Skipping compilation of CUDA extensions")
+        print("If you'd like to compile CUDA extensions locally please install the cudatoolkit from https://anaconda.org/nvidia/cuda-toolkit")
+        
     use_cuda = torch.cuda.is_available() and CUDA_HOME is not None
     extension = CUDAExtension if use_cuda else CppExtension
 
diff --git a/torchao/csrc/README.md b/torchao/csrc/README.md
new file mode 100644
index 0000000000..b08a0a06ba
--- /dev/null
+++ b/torchao/csrc/README.md
@@ -0,0 +1,29 @@
+# Custom C++/CUDA Extensions
+
+This folder is an example of how to integrate your own custom kernels into ao such that
+1. They work on as many devices and operating systems as possible
+2. They compose with `torch.compile()` without graph breaks
+
+The goal is that you can focus on just writing your custom CUDA or C++ kernel and we can package it up so it's available via `torchao.ops.your_custom_kernel`.
+
+
+## How to add your own kernel in ao
+
+We've integrated a test kernel which implements a non-maximum supression (NMS) op which you can use as a template for your own kernels.
+
+1. Install the cudatoolkit https://anaconda.org/conda-forge/cudatoolkit
+2. In `csrc/cuda` author your custom kernel and ensure you expose a `TORCH_LIBRARY_IMPL` which will expose `torchao::your_custom_kernel`
+3. In `csrc/` author a `cpp` stub which will include a `TORCH_LIBRARY_FRAGMENT` which will place your custom kernel in the `torchao.ops` namespace and also expose a public function with the right arguments
+4. In `torchao/ops.py` is where you'll expose the python API which your new end users will leverage
+5. Write a new test in `test/test_ops.py` which most importantly needs to pass `opcheck()`, this ensures that your custom kernel composes out of the box with `torch.compile()`
+
+And that's it! Once CI passes and your code merged you'll be able to point people to `torchao.ops.your_custom_kernel`. If you're working on an interesting kernel and would like someone else to handle the release and package management please feel free to open an issue.
+
+If you'd like to learn more please check out [torch.library](https://pytorch.org/docs/main/library.html)
+
+## Required dependencies
+
+The important dependencies are already taken care of in our CI so feel free to test in CI directly
+
+1. cudatoolkit so you can build your own custom extensions locally. We highly recommend using https://anaconda.org/conda-forge/cudatoolkit for installation
+2. manylinux with CUDA support. In your own Github actions you can integrate this support using `uses: pytorch/test-infra/.github/workflows/linux_job.yml@main`