Skip to content

Commit

Permalink
lots gpu gosl updates
Browse files Browse the repository at this point in the history
  • Loading branch information
rcoreilly committed Sep 3, 2024
1 parent 20c7d23 commit 25a52cc
Show file tree
Hide file tree
Showing 69 changed files with 380 additions and 291 deletions.
10 changes: 5 additions & 5 deletions GPU.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# GPU: graphical processing unit implementation

This document provides detailed info about the GPU implementation of axon, which allows the same Go codebase to run on CPU and GPU. [gosl](https://github.com/goki/gosl) converts the existing Go code into HLSL shader code, along with hand-written HLSL glue code in `gpu_hlsl`, all of which ends up in the `shaders` directory. The `go generate` command in the `axon` subdirectory, or equivalent `make all` target in the `shaders` directory, must be called whenever the main codebase changes. The `.hlsl` files are compiled via `glslc` into SPIR-V `.spv` files that are embedded into the axon library and loaded by the [vgpu](https://github.com/goki/vgpu) Vulkan GPU framework.
This document provides detailed info about the GPU implementation of axon, which allows the same Go codebase to run on CPU and GPU. [gosl](https://github.com/goki/gosl) converts the existing Go code into HLSL shader code, along with hand-written HLSL glue code in `gpu_wgsl`, all of which ends up in the `shaders` directory. The `go generate` command in the `axon` subdirectory, or equivalent `make all` target in the `shaders` directory, must be called whenever the main codebase changes. The `.wgsl` files are compiled via `glslc` into SPIR-V `.spv` files that are embedded into the axon library and loaded by the [vgpu](https://github.com/goki/vgpu) Vulkan GPU framework.

To add GPU support to an existing simulation, add these lines to the end of the `ConfigGUI` method to run in GUI mode:

Expand Down Expand Up @@ -141,12 +141,12 @@ Set: 3 Syns

The following Layer and Path level types contain most of the core algorithm specific code, and are used as a `uniform` constant data structure in the GPU shader code:

* `LayerParams` in `layerparams.go` has all the core algorithm parameters and methods that run on both the GPU and the CPU. This file is converted to `shaders/layerparams.hlsl` by [gosl](https://github.com/goki/gosl). All the methods must have args providing all of the state that is needed for the computation, which is supplied either by the GPU or CPU. The overall layer-level parameters are further defined in:
* `LayerParams` in `layerparams.go` has all the core algorithm parameters and methods that run on both the GPU and the CPU. This file is converted to `shaders/layerparams.wgsl` by [gosl](https://github.com/goki/gosl). All the methods must have args providing all of the state that is needed for the computation, which is supplied either by the GPU or CPU. The overall layer-level parameters are further defined in:
+ `ActParams` in `act.go` -- for computing spiking neural activation.
+ `InhibParams` in `inhib.go` -- for simulated inhibitory interneuron inhibition.
+ `LearnNeurParams` in `learn.go` -- learning-related functions at the neuron level.

* `PathParams` in `pathparams.go` has all the core algorithm parameters and methods that run on both the GPU and CPU, likewise converted to `shaders/pathparams.hlsl`. The specific params are in:
* `PathParams` in `pathparams.go` has all the core algorithm parameters and methods that run on both the GPU and CPU, likewise converted to `shaders/pathparams.wgsl`. The specific params are in:
+ `SynComParams` at the bottom of `act.go` -- synaptic communication params used in computing spiking activation.
+ `PathScaleParams` also at end of `act.go` -- pathway scaling params, for `GScale` overall value.
+ `SWtParams` in `learn.go` -- for initializing the slow and regular weight values -- most of the initial weight variation goes into SWt.
Expand Down Expand Up @@ -175,14 +175,14 @@ Most of the relevant limits in Vulkan are listed in the [PhysicalDeviceLimits](h

## Compute threads

In HLSL, a compute shader is parameterized by `Dispatch` indexes (equivalent to the `thread block` concept in CUDA), which determine the total number and shape of parallel compute threads that run the given compute shader (kernel in CUDA). The threads are grouped together into a *Warp*, which shares memory access and is the minimum chunk of computation. Each HLSL shader has a `[numthreads(x, y, z)]` directive right before the `main` function specifying how many threads per dimension are executed in each warp: [HLSL numthreads doc](https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-attributes-numthreads). According to this [reddit/r/GraphicsProgramming post](https://www.reddit.com/r/GraphicsProgramming/comments/aeyfkh/for_compute_shaders_is_there_an_ideal_numthreads/),
In HLSL, a compute shader is parameterized by `Dispatch` indexes (equivalent to the `thread block` concept in CUDA), which determine the total number and shape of parallel compute threads that run the given compute shader (kernel in CUDA). The threads are grouped together into a *Warp*, which shares memory access and is the minimum chunk of computation. Each HLSL shader has a `[numthreads(x, y, z)]` directive right before the `main` function specifying how many threads per dimension are executed in each warp: [HLSL numthreads doc](https://learn.microsoft.com/en-us/windows/win32/direct3dwgsl/sm5-attributes-numthreads). According to this [reddit/r/GraphicsProgramming post](https://www.reddit.com/r/GraphicsProgramming/comments/aeyfkh/for_compute_shaders_is_there_an_ideal_numthreads/),
the hardware typically has 32 (NVIDIA, M1, M2) or 64 (AMD) hardware threads per warp, so 64 is typically used as a default product of threads per warp across all of the dimensions. Here's more [HLSL docs on dispatch](https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12graphicscommandlist-dispatch).

Because of this lower hardware limit, the upper bounds on threads per warp (numthreads x*y*z) is not that relevant, but it is given by `maxComputeWorkGroupInvocations`, and is typically 1024 for relevant hardware: [vulkan gpuinfo browser](https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations&platform=all).

The limit on *total number of threads* in any invocation is the main relevant limit, and is given by `maxComputeWorkGroupCount[x,y,z] * numthreads`. The 1D `x` dimension is generally larger than the other two (`y, z`), which are almost always 2^16-1 (64k), and it varies widely across platforms: [vulkan gpuinfo browser](https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupCount[0]&platform=all)with ~2^16 (64k) or ~2^31 (2 Gi) being the modal values. It appears to be a largely software-defined value, as all macs with a variety of discrete GPU types, including the M1, have 2^30 (1 Gi), whereas that same chip on other platforms can have a lower value (e.g., 64k). The modern desktop NVIDIA chips generally have 2 Gi.

Given that this limit is specified per dimension, it remains unclear exactly how all the dimensions add up into an overall total limit. Empirically, for both the Mac M1 and NVIDIA A100, the actual hard limit was 2 Gi for a 1D case -- invoking with more than that many threads resulted in a failure on the `gpu_test_synca.hlsl` shader run in the `TestGPUSynCa` test in `bench_lvis` varying `ndata` to push the memory and compute limits, without any diagnostic warning from the `vgpu.Debug = true` mode that activates Vulkan validation. [vgpu](https://github.com/goki/vgpu) now has a MaxComputeWorkGroupCount1D for the max threads when using just 1D (typical case) -- it is set to 2 Gi for Mac and NVIDIA.
Given that this limit is specified per dimension, it remains unclear exactly how all the dimensions add up into an overall total limit. Empirically, for both the Mac M1 and NVIDIA A100, the actual hard limit was 2 Gi for a 1D case -- invoking with more than that many threads resulted in a failure on the `gpu_test_synca.wgsl` shader run in the `TestGPUSynCa` test in `bench_lvis` varying `ndata` to push the memory and compute limits, without any diagnostic warning from the `vgpu.Debug = true` mode that activates Vulkan validation. [vgpu](https://github.com/goki/vgpu) now has a MaxComputeWorkGroupCount1D for the max threads when using just 1D (typical case) -- it is set to 2 Gi for Mac and NVIDIA.

To work around the limit, we are just launching multiple kernels with a push constant starting offset set for each, to cover the full space.

Expand Down
8 changes: 4 additions & 4 deletions axon/act.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ import (
///////////////////////////////////////////////////////////////////////
// act.go contains the activation params and functions for axon

//gosl:hlsl act
// #include "chans.hlsl"
// #include "minmax.hlsl"
// #include "neuron.hlsl"
//gosl:wgsl act
// #include "chans.wgsl"
// #include "minmax.wgsl"
// #include "neuron.wgsl"
//gosl:end act

//gosl:start act
Expand Down
2 changes: 1 addition & 1 deletion axon/avgmax.go
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ func (am *AvgMaxI32) Calc(refIndex int32) {

//gosl:end avgmaxi

//gosl:hlsl avgmaxi
//gosl:wgsl avgmaxi
/*
// // AtomicUpdateAvgMaxI32 provides an atomic update using atomic ints
// // implemented by InterlockedAdd HLSL intrinsic.
Expand Down
42 changes: 21 additions & 21 deletions axon/context.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,12 @@ var (
Networks []*Network
)

// note: the following nohlsl is included for the Go type inference processing
// but is then excluded from the final .hlsl file.
// note: the following nowgsl is included for the Go type inference processing
// but is then excluded from the final .wgsl file.
// this is key for cases where there are alternative versions of functions
// in GPU vs. CPU.

//gosl:nohlsl context
//gosl:nowgsl context

// NeuronVars

Expand Down Expand Up @@ -245,14 +245,14 @@ func (ctx *Context) CopyNetStridesFrom(srcCtx *Context) {

//gosl:end context

//gosl:hlsl context
// #include "etime.hlsl"
// #include "axonrand.hlsl"
// #include "neuron.hlsl"
// #include "synapse.hlsl"
// #include "globals.hlsl"
// #include "neuromod.hlsl"
//gosl:endhlsl context
//gosl:wgsl context
// #include "etime.wgsl"
// #include "axonrand.wgsl"
// #include "neuron.wgsl"
// #include "synapse.wgsl"
// #include "globals.wgsl"
// #include "neuromod.wgsl"
//gosl:endwgsl context

//gosl:start context

Expand Down Expand Up @@ -463,7 +463,7 @@ func (ctx *Context) CycleInc() {
ctx.CyclesTotal++
ctx.Time += ctx.TimePerCycle
ctx.SynCaCtr += 1
ctx.RandCtr.Add(uint32(RandFunIndexN))
// ctx.RandCtr.Add(uint32(RandFunIndexN)) TODO: gosl
}

// SlowInc increments the Slow counter and returns true if time
Expand Down Expand Up @@ -518,26 +518,26 @@ func (ctx *Context) GlobalVNFloats() uint32 {

// note: following is real code, uncommented by gosl

//gosl:hlsl context
//gosl:wgsl context

/*
// // NeuronVars
float NrnV(in Context ctx, uint ni, uint di, NeuronVars nvar) {
return Neurons[ctx.NeuronVars.Index(ni, di, nvar)];
fn NrnV(ctx: ptr<function,Context>, ni: u32, di: u32, nvar:NeuronVars) -> f32 {
return Neurons[NeuronVars_Index(ctx.NeuronVars, ni, di, nvar)];
}
void SetNrnV(in Context ctx, uint ni, uint di, NeuronVars nvar, float val) {
Neurons[ctx.NeuronVars.Index(ni, di, nvar)] = val;
fn SetNrnV(ctx: ptr<function,Context>, ni: u32, di: u32, nvar:NeuronVars, val: f32) {
Neurons[NeuronVars_Index(ctx.NeuronVars, ni, di, nvar)] = val;
}
void AddNrnV(in Context ctx, uint ni, uint di, NeuronVars nvar, float val) {
Neurons[ctx.NeuronVars.Index(ni, di, nvar)] += val;
fn AddNrnV(ctx: ptr<function,Context>, ni: u32, di: u32, nvar:NeuronVars, val: f32) {
Neurons[NeuronVars_Index(ctx.NeuronVars, ni, di, nvar)] += val;
}
void MulNrnV(in Context ctx, uint ni, uint di, NeuronVars nvar, float val) {
Neurons[ctx.NeuronVars.Index(ni, di, nvar)] *= val;
fn MulNrnV(ctx: ptr<function,Context>, ni: u32, di: u32, nvar:NeuronVars, val: f32) {
Neurons[NeuronVars_Index(ctx.NeuronVars, ni, di, nvar)] *= val;
}
bool NrnHasFlag(in Context ctx, uint ni, uint di, NeuronFlags flag) {
Expand Down
4 changes: 2 additions & 2 deletions axon/globals.go
Original file line number Diff line number Diff line change
Expand Up @@ -328,8 +328,8 @@ const (

//gosl:end globals

//gosl:hlsl globals
//gosl:wgsl globals
/*
static const GlobalVars GlobalVarsN = GvVSMatrixPoolGated + 1;
const GlobalVarsN: GlobalVars = GvVSMatrixPoolGated + 1;
*/
//gosl:end globals
88 changes: 57 additions & 31 deletions axon/gpu.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,56 +18,82 @@ import (
vk "github.com/goki/vulkan"
)

//go:embed shaders/*.spv
//go:embed shaders/*.wgsl
var content embed.FS

//go:generate gosl -exclude=Update,UpdateParams,Defaults,AllParams,ShouldDisplay cogentcore.org/core/math32/fastexp.go cogentcore.org/core/math32/minmax ../chans/chans.go ../chans ../kinase ../fsfffb/inhib.go ../fsfffb github.com/emer/emergent/v2/etime github.com/emer/emergent/v2/ringidx rand.go avgmax.go neuromod.go globals.go context.go neuron.go synapse.go pool.go layervals.go act.go act_path.go inhib.go learn.go layertypes.go layerparams.go deep_layers.go rl_layers.go rubicon_layers.go pcore_layers.go pathtypes.go pathparams.go deep_paths.go rl_paths.go rubicon_paths.go pcore_paths.go hip_paths.go gpu_hlsl
//go:generate gosl -exclude=Update,UpdateParams,Defaults,AllParams,ShouldDisplay cogentcore.org/core/math32/fastexp.go cogentcore.org/core/math32/minmax ../chans/chans.go ../chans ../kinase ../fsfffb/inhib.go ../fsfffb github.com/emer/emergent/v2/etime github.com/emer/emergent/v2/ringidx rand.go avgmax.go neuromod.go globals.go context.go neuron.go synapse.go pool.go layervals.go act.go act_path.go inhib.go learn.go layertypes.go layerparams.go deep_layers.go rl_layers.go rubicon_layers.go pcore_layers.go pathtypes.go pathparams.go deep_paths.go rl_paths.go rubicon_paths.go pcore_paths.go hip_paths.go gpu_wgsl/gpu_applyext.wgsl

// Full vars code -- each gpu_*.hlsl uses a subset
// Full vars code -- each gpu_*.wgsl uses a subset

/*
// note: binding is var, set
// Set 0: uniform layer params -- could not have paths also be uniform..
[[vk::binding(0, 0)]] StructuredBuffer<LayerParams> Layers; // [Layer]
[[vk::binding(1, 0)]] StructuredBuffer<PathParams> Paths; // [Layer][SendPaths]
@group(0) @binding(0)
var<storage, read_write> Layers: array<LayerParams>;
@group(0) @binding(1)
var<storage, read_write> Paths: array<PathParams>;
// Set 1: effectively uniform indexes and path params as structured buffers in storage
[[vk::binding(0, 1)]] StructuredBuffer<uint> NeuronIxs; // [Neurons][Indexes]
[[vk::binding(1, 1)]] StructuredBuffer<uint> SynapseIxs; // [Layer][SendPaths][SendNeurons][Syns]
[[vk::binding(2, 1)]] StructuredBuffer<StartN> SendCon; // [Layer][SendPaths][SendNeurons]
[[vk::binding(3, 1)]] StructuredBuffer<uint> RecvPathIndexes; // [Layer][RecvPaths]
[[vk::binding(4, 1)]] StructuredBuffer<StartN> RecvCon; // [Layer][RecvPaths][RecvNeurons]
[[vk::binding(5, 1)]] StructuredBuffer<uint> RecvSynIndexes; // [Layer][RecvPaths][RecvNeurons][Syns]
@group(1) @binding(0)
var<storage, read_write> NeuronIxs: array<u32>; // [Neurons][Indexes]
@group(1) @binding(1)
var<storage, read_write> SynapseIxs: array<u32>; // [Layer][SendPaths][SendNeurons][Syns]
@group(1) @binding(2)
var<storage, read_write> SendCon: array<StartN>; // [Layer][SendPaths][SendNeurons]
@group(1) @binding(3)
var<storage, read_write> RecvPathIndexes: array<u32>; // [Layer][RecvPaths]
@group(1) @binding(4)
var<storage, read_write> RecvCon: array<StartN>; // [Layer][RecvPaths][RecvNeurons]
@group(1) @binding(5)
var<storage, read_write> RecvSynIndexes: array<u32>; // [Layer][RecvPaths][RecvNeurons][Syns]
// Set 2: main network structs and vals -- all are writable
[[vk::binding(0, 2)]] RWStructuredBuffer<Context> Ctx; // [0]
[[vk::binding(1, 2)]] RWStructuredBuffer<float> Neurons; // [Neurons][Vars][Data]
[[vk::binding(2, 2)]] RWStructuredBuffer<float> NeuronAvgs; // [Neurons][Vars]
[[vk::binding(3, 2)]] RWStructuredBuffer<Pool> Pools; // [Layer][Pools][Data]
[[vk::binding(4, 2)]] RWStructuredBuffer<LayerValues> LayValues; // [Layer][Data]
[[vk::binding(5, 2)]] RWStructuredBuffer<float> Globals; // [NGlobals]
[[vk::binding(6, 2)]] RWStructuredBuffer<float> Exts; // [In / Out Layers][Neurons][Data]
@group(2) @binding(0)
var<storage, read_write> Ctx: array<Context>; // [0]
@group(2) @binding(1)
var<storage, read_write> Neurons: array<f32>; // [Neurons][Vars][Data]
@group(2) @binding(2)
var<storage, read_write> NeuronAvgs: array<f32>; // [Neurons][Vars]
@group(2) @binding(3)
var<storage, read_write> Pools: array<f32>; // [Layer][Pools][Data]
@group(2) @binding(4)
var<storage, read_write> LayValues: array<LayerValues>; // [Layer][Data]
@group(2) @binding(5)
var<storage, read_write> Globals: array<f32>; // [NGlobals]
@group(2) @binding(6)
var<storage, read_write> Exts: array<f32>; // [In / Out Layers][Neurons][Data]
// There might be a limit of 8 buffers per set -- can't remember..
// Set 3: synapse vars
[[vk::binding(0, 3)]] RWStructuredBuffer<int> GBuf; // [Layer][RecvPaths][RecvNeurons][MaxDel+1][Data]
[[vk::binding(1, 3)]] RWStructuredBuffer<float> GSyns; // [Layer][RecvPaths][RecvNeurons][Data]
[[vk::binding(2, 3)]] RWStructuredBuffer<float> Synapses; // [Layer][SendPaths][SendNeurons][Syns]
@group(3) @binding(0)
var<storage, read_write> GBuf: array<i32>; // [Layer][RecvPaths][RecvNeurons][MaxDel+1][Data]
@group(3) @binding(1)
var<storage, read_write> GSyns: array<f32>; // [Layer][RecvPaths][RecvNeurons][Data]
@group(3) @binding(2)
var<storage, read_write> Synapses: array<f32>; // [Layer][SendPaths][SendNeurons][Syns]
// todo: future expansion to add more tranches of Synapses
// Set 4: SynCa -- can only access in 2^31 chunks
[[vk::binding(0, 4)]] RWStructuredBuffer<float> SynapseCas; // [Layer][SendPaths][SendNeurons][Syns][Data]
[[vk::binding(1, 4)]] RWStructuredBuffer<float> SynapseCas1; // [Layer][SendPaths][SendNeurons][Syns][Data]
[[vk::binding(2, 4)]] RWStructuredBuffer<float> SynapseCas2; // [Layer][SendPaths][SendNeurons][Syns][Data]
[[vk::binding(3, 4)]] RWStructuredBuffer<float> SynapseCas3; // [Layer][SendPaths][SendNeurons][Syns][Data]
[[vk::binding(4, 4)]] RWStructuredBuffer<float> SynapseCas4; // [Layer][SendPaths][SendNeurons][Syns][Data]
[[vk::binding(5, 4)]] RWStructuredBuffer<float> SynapseCas5; // [Layer][SendPaths][SendNeurons][Syns][Data]
[[vk::binding(6, 4)]] RWStructuredBuffer<float> SynapseCas6; // [Layer][SendPaths][SendNeurons][Syns][Data]
[[vk::binding(7, 4)]] RWStructuredBuffer<float> SynapseCas7; // [Layer][SendPaths][SendNeurons][Syns][Data]
@group(4) @binding(0)
var<storage, read_write> SynapseCas: array<f32>; // [Layer][SendPaths][SendNeurons][Syns][Data]
@group(4) @binding(1)
var<storage, read_write> SynapseCas1: array<f32>;
@group(4) @binding(2)
var<storage, read_write> SynapseCas2: array<f32>;
@group(4) @binding(3)
var<storage, read_write> SynapseCas3: array<f32>;
@group(4) @binding(4)
var<storage, read_write> SynapseCas4: array<f32>;
@group(4) @binding(5)
var<storage, read_write> SynapseCas5: array<f32>;
@group(4) @binding(6)
var<storage, read_write> SynapseCas6: array<f32>;
@group(4) @binding(7)
var<storage, read_write> SynapseCas7: array<f32>;
Set: 0
Role: Storage
Expand Down Expand Up @@ -163,7 +189,7 @@ type GPU struct {
// for sequencing commands
Semaphores map[string]vk.Semaphore `display:"-"`

// number of warp threads -- typically 64 -- must update all hlsl files if changed!
// number of warp threads -- typically 64 -- must update all wgsl files if changed!
NThreads int `display:"-" inactive:"-" default:"64"`

// maximum number of bytes per individual storage buffer element, from GPUProps.Limits.MaxStorageBufferRange
Expand Down
Loading

0 comments on commit 25a52cc

Please sign in to comment.