Git revision #5a9dfdd7 Written by alex_s168
Modified at 26. August 2025 21:13
GPU Architecture
GPUs consists of multiple (these days at least 32) compute units (= CU).
Each compute unit has multiple SIMD units, also called “wave”, “wavefront” or “warp”. Compute units also have some fast local memory (tens of kilobytes), main memory access queues, texture units, a scalar unit, and other features. Subscribe to the Atom feed
to get notified of future articles.
The main memory (graphics memory) is typically outside of the GPU, and is slow, but high-bandwidth memory.
Waves
A wave is a SIMD processing unit consisting of typically 32 “lanes” (sometimes called threads).
Each wave in a CU has separate control flow, and doesn’t have to be related.
Instructions that waves support:
Since only the whole wave can do control flow, and not each lane, all operations can be masked so that they only apply to specific lanes.
=> waves are really similar to SIMD on modern CPUs
In modern GPUs, instruction execution in waves is superscalar, so there are multiple different execution units for executing different kinds of instructions, and multiple instructions can be executed at once, if there are free execution units, and they don’t depend on each other.
We’ll be exploring that in a future article.
Local memory
The local memory inside GPUs is banked, typically into 32 banks. The memory word size is typically 32 bits.
The addresses are interlaved, so for two banks:
Each bank has an dedicated access port, so for 32 banks, you get 32 access ports.
The lanes of the waves inside a CU get routed to the local memory banks magically.
f32
, so when each wave performs some_f32_array[lane_id()]
, all 32 banks can be used at the same time.Why multiple waves share the same local memory
A wave doesn’t do memory accesses every instruction, but also does computations. This means that there are cycles where the memory isn’t doing anything.
By making multiple waves share the same local memory and access ports, you save resources.
Global memory
Since global memory reads/writes are really slow, they happen asynchronosly.
This means that a wave requests an access, then can continue executing, and then eventually waits for that access to finish.
Because of this, modern compilers automagically start the access before the data is needed, and then wait for the data later on.
Scalar unit
Most newer GPUs also have a scalar unit for saving energy when performing simple operations.
When the controller sees a scalar instruction in the code running on a wave, it automatically makes the code run on the scalar unit.
The scalar unit can be used for:
GPU Programming Terminology
__local
memory in OpenCL applies to this.
OpenCL and other APIs let you specify both the number of work groups and work items.
Since a program might specify a higher number of work items per work group than we have available, the compiler needs to be able to put multiple work items onto one SIMD lane.
Our own architecture
We’ll go with these specs for now:
Note that we won’t specify the exact instruction encoding.
Predefined Constants
We will pre-define 16 constants (as virtual vector registers):
zero
one
sid
: 0,1,2,3,4,5,6wave
: the ID of the wave in the compute task, broadcasted to all elements.u8_max
: 255,255,…n2nd
: 1,2,1,2,…n3rd
: 1,2,4,1,…n4th
: 1,2,4,8,1,…lo16
: 1,1,1,… (x16) 0,0,0,… (x16)ch2
: 1,1,0,0,1,1,…ch4
: 1,1,1,1,0,0,0,0,1,…alo8
: 1 (x8) 0 (x8) 1 (x8) 0 (x8)Operands
We define the following instruction operands:
Vreg
: vector registerM
: (read only) vector gp reg as mask (1b). only first 32 registers can be used as mask. the operand consists of two masks and-ed together, each of which can conditionally be inverted first. this means that this operand takes up 12 bitsVany
: Vreg
or M
Simm
: immediate scalar valueSreg
: the first element of a vector register, as scalarSany
: a Simm
or an Sreg
dist
: Vany
, or a Sany
broadcasted to each elementsig
: one of the 16 completion signal slotsData Movement
fn mov(out out: Vreg, in wrmask: M, in val: dist)
fn select(out out: Vreg, in select: M, in false: dist, in true: dist)
fn first_where_true(out out: Sreg, in where: M, in values: dist)
: if none of the elements are true, it doesn’t overwrite the previous value in out.Mathematics
u32
, i32
, and f32
elementwise arithmetic and logic operations: fn add<u32>(out out: Vreg, in left: Vany, in right: dist)
fn add<u32>(out out: Sreg, in left: Sany, in right: Sany)
Local memory
fn local_load32(out out: Vreg, in mask: M, in addr: Vreg)
fn local_store32(in addr: Vreg, in mask: M, in val: Vany)
Global (async) memory
fn global_load32(out sig: sig, out out: Vreg, in mask: M, in addr: Vreg)
local_store32
fn global_store32(out sig: sig, in addr: Vreg, in mask: M, in val: Vany)
fn sig_done1(out r: Sreg, in sig: sig)
fn sig_done2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_wait(out r: Sreg, in sig: sig)
fn sig_waitall2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_waitall3(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig)
fn sig_waitall4(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig, in sig4: sig)
As a future extension, we could add a instruction that waits for any of the given signals to complete, and then jump to a specific location, depending on which of those completed.
Control flow (whole wave)
fn brz(in dest: Simm, in val: Sany)
fn brnz(in dest: Simm, in val: Sany)
fn br_all(in dest: Simm, in cond: M)
fn br_any(in dest: Simm, in cond: M)
Hand-compiling code
Now that we decided on a simple compute-only GPU architecture, we can try hand-compiling an OpenCL program.
I asked an LLM to produce a N*N matmul example (comments written manually):
// convenient number for our specifc hardware
#define TILE_SIZE 8
// this kernel will be launched with dimensions:
// global[2] = { 128,128 } = { N, N };
// local[2] = { 8,8 } = { TILE_SIZE, TILE_SIZE };
__kernel void matmul_tiled(
__global float* A,
__global float* B,
__global float* C,
const int N)
{
int row = get_global_id(1); // y
int col = get_global_id(0); // x
int local_row = get_local_id(1); // y
int local_col = get_local_id(0); // x
__local float Asub[TILE_SIZE][TILE_SIZE];
__local float Bsub[TILE_SIZE][TILE_SIZE];
float sum = 0.0f;
for (int t = 0; t < N / TILE_SIZE; ++t) {
// load tiles into local
int tiledRow = row;
int tiledCol = t * TILE_SIZE + local_col;
float av;
if (tiledRow < N && tiledCol < N)
av = A[tiledRow * N + tiledCol];
else
av = 0.0f;
Asub[local_row][local_col] = av;
tiledRow = t * TILE_SIZE + local_row;
tiledCol = col;
float bv;
if (tiledRow < N && tiledCol < N)
bv; = B[tiledRow * N + tiledCol];
else
bv = 0.0f;
Bsub[local_row][local_col]= bv;
// sync local access across local grp
barrier(CLK_LOCAL_MEM_FENCE);
for (int k = 0; k < TILE_SIZE; ++k)
sum += Asub[local_row][k] * Bsub[k][local_col];
// sync local access across local grp
barrier(CLK_LOCAL_MEM_FENCE);
}
if (row < N && col < N)
C[row * N + col] = sum;
}
First, we have to decide on how we want to map the kernel to the hardware.
Since the local dimension of the kernel is 8*8, which is 64, we can map each local group to one CU, by mapping 32 kernels to one wave, and using both waves available on one CU for the local group.
Our global dimension is 128*128, which means that we would need 256 compute units. But since we probably don’t have 256 compute units, GPUs, including ours, will have a on-hardware task scheduler, for scheduing tasks onto compute units.
Outro
Modern GPUs are really complex, but designing a simple GPU is not that hard either.
Subscribe to the Atom feed
to get notified of future articles.