(this is the second attempt at layouting the circuit)
Step 1: Sketching
While starting doing this, I realised that one wire always overlaps with one node triangle, unless I cheated. Here is a visual representation of this (inaccurate):
This means that I have to modify the layouting from step 0 a bit, which is unfortunate, but in retrospect, I think that it makes the result look better:
That however takes up too much space, so I did another variation:
I also did another variation here, but decided to not use that.
Step 2: Preparation for coloring
I colored the back side of the piece of paper which contains the sketeches with a pencil, put a white piece of paper behind it, and then re-traced the line, to get a lighter version of the sketch onto the white paper.
Then I used modern technology (a copier) to copy that piece of paper multiple times, and also scale it up (to allow for more details).
Step 3: Coloring
It was a disaster…
Some variants actually look nice, but only parts of it.
Step 4: Outsourcing the coloring
After some time, I just gave up, and decided to ask my sister for help…
I only told her (translated):
Can you please color this? It's supposed to be a circuit, and it will be a small logo for a website. The website is mainly black and white, but this (context: persian blue) blue would work too.
And less than half a minute later, she came up with this:
We considered that the logo will end up being quite small, so “we” wanted it to look good when zoomed out. This is a pretty nice idea, because the different colored wires end up blending together nicely.
I put that into the scanner, and meanwhile she experimented with different filling styles.
Then she came up with this (the final version):
Filling the drawing only took her about 20 seconds!
Step 5: Digital Modifications
As last step, I removed some of the sketch lines and some minor imperfections digitally.
Conclusion
I like the final result a lot (as small logo), but it’s a bit too detailed as favicon.
I will re-visit this topic in the future.
+
+
+
+
+
+
+
diff --git a/build/article-favicon.typ.desktop.html.d b/build/article-favicon.typ.desktop.html.d
new file mode 100644
index 0000000..8b9057e
--- /dev/null
+++ b/build/article-favicon.typ.desktop.html.d
@@ -0,0 +1 @@
+build/article-favicon.typ.desktop.html: common.typ pages/article-favicon.typ core-page-style.typ components/header.typ simple-page-layout.typ
diff --git a/build/article-favicon.typ.git_rev.txt b/build/article-favicon.typ.git_rev.txt
new file mode 100644
index 0000000..5208c6b
--- /dev/null
+++ b/build/article-favicon.typ.git_rev.txt
@@ -0,0 +1 @@
+--input git_rev=c80eb6ef20acd402096f38d45bb40779fa15149b --input git_commit_date="26. July 2025 15:04"
diff --git a/build/article-favicon.typ.git_rev.txt.iso b/build/article-favicon.typ.git_rev.txt.iso
new file mode 100644
index 0000000..17981c4
--- /dev/null
+++ b/build/article-favicon.typ.git_rev.txt.iso
@@ -0,0 +1 @@
+2025-07-26T15:04:04+02:00
diff --git a/build/article-favicon.typ.min.html b/build/article-favicon.typ.min.html
new file mode 100644
index 0000000..ea3e426
--- /dev/null
+++ b/build/article-favicon.typ.min.html
@@ -0,0 +1,232 @@
+
+
+
+ The making of the favicon
+
+
+
+
+
+
+
+
+
(this is the second attempt at layouting the circuit)
+
+
+
Step 1: Sketching
While starting doing this, I realised that one wire always overlaps with one node triangle, unless I cheated. Here is a visual representation of this (inaccurate):
+
+
This means that I have to modify the layouting from step 0 a bit, which is unfortunate, but in retrospect, I think that it makes the result look better:
+
+
That however takes up too much space, so I did another variation:
+
+
I also did another variation here, but decided to not use that.
+
+
+
Step 2: Preparation for coloring
I colored the back side of the piece of paper which contains the sketeches with a pencil, put a white piece of paper behind it, and then re-traced the line, to get a lighter version of the sketch onto the white paper.
+
+
Then I used modern technology (a copier) to copy that piece of paper multiple times, and also scale it up (to allow for more details).
+
+
+
Step 3: Coloring
It was a disaster…
+
+
+
+
Some variants actually look nice, but only parts of it.
+
+
+
Step 4: Outsourcing the coloring
After some time, I just gave up, and decided to ask my sister for help…
+
+
I only told her (translated):
+
Can you please color this? It's supposed to be a circuit, and it will be a small logo for a website. The website is mainly black and white, but this (context: persian blue) blue would work too.
+
And less than half a minute later, she came up with this:
+
+
We considered that the logo will end up being quite small, so “we” wanted it to look good when zoomed out. This is a pretty nice idea, because the different colored wires end up blending together nicely.
+
I put that into the scanner, and meanwhile she experimented with different filling styles.
+
Then she came up with this (the final version):
+
+
Filling the drawing only took her about 20 seconds!
+
+
Step 5: Digital Modifications
As last step, I removed some of the sketch lines and some minor imperfections digitally.
+
+
Conclusion
I like the final result a lot (as small logo), but it’s a bit too detailed as favicon.
+
I will re-visit this topic in the future.
+
+
+
+
+
diff --git a/build/article-favicon.typ.min.html.d b/build/article-favicon.typ.min.html.d
new file mode 100644
index 0000000..48e53e8
--- /dev/null
+++ b/build/article-favicon.typ.min.html.d
@@ -0,0 +1 @@
+build/article-favicon.typ.min.html: common.typ pages/article-favicon.typ core-page-style.typ components/header.typ simple-page-layout.typ
diff --git a/build/article-favicon.typ.min.pdf b/build/article-favicon.typ.min.pdf
new file mode 100644
index 0000000..2bcb75b
Binary files /dev/null and b/build/article-favicon.typ.min.pdf differ
diff --git a/build/article-favicon.typ.min.pdf.d b/build/article-favicon.typ.min.pdf.d
new file mode 100644
index 0000000..3113a8a
--- /dev/null
+++ b/build/article-favicon.typ.min.pdf.d
@@ -0,0 +1 @@
+build/article-favicon.typ.min.pdf: res/article-favicon/step1_2.png common.typ res/article-favicon/step3_1.png components/header.typ res/article-favicon/step4_2.png res/article-favicon/step1_0.png pages/article-favicon.typ res/article-favicon/step2.png simple-page-layout.typ res/article-favicon/step4_0.png res/favicon.png res/article-favicon/step1_1.png res/article-favicon/step3_0.png core-page-style.typ res/article-favicon/step4_1.png res/article-favicon/step0.png
diff --git a/build/article-favicon.typ.nano.html b/build/article-favicon.typ.nano.html
new file mode 100644
index 0000000..52c142b
--- /dev/null
+++ b/build/article-favicon.typ.nano.html
@@ -0,0 +1,129 @@
+
+
+
+ The making of the favicon
+
+
+
+
+
+
The making of the favicon
+
Last modified: 26. July 2025 15:04 (Git #c80eb6ef)
(this is the second attempt at layouting the circuit)
+
+
+
+
Step 1: Sketching
+
While starting doing this, I realised that one wire always overlaps with one node triangle, unless I cheated. Here is a visual representation of this (inaccurate):
+
+
This means that I have to modify the layouting from step 0 a bit, which is unfortunate, but in retrospect, I think that it makes the result look better:
+
+
That however takes up too much space, so I did another variation:
+
+
I also did another variation here, but decided to not use that.
+
+
+
+
Step 2: Preparation for coloring
+
I colored the back side of the piece of paper which contains the sketeches with a pencil, put a white piece of paper behind it, and then re-traced the line, to get a lighter version of the sketch onto the white paper.
+
+
Then I used modern technology (a copier) to copy that piece of paper multiple times, and also scale it up (to allow for more details).
+
+
+
+
Step 3: Coloring
+
It was a disaster…
+
+
+
+
Some variants actually look nice, but only parts of it.
+
+
+
+
Step 4: Outsourcing the coloring
+
After some time, I just gave up, and decided to ask my sister for help…
+
+
I only told her (translated):
+
Can you please color this? It's supposed to be a circuit, and it will be a small logo for a website. The website is mainly black and white, but this (context: persian blue) blue would work too.
+
And less than half a minute later, she came up with this:
+
+
We considered that the logo will end up being quite small, so “we” wanted it to look good when zoomed out. This is a pretty nice idea, because the different colored wires end up blending together nicely.
+
I put that into the scanner, and meanwhile she experimented with different filling styles.
+
Then she came up with this (the final version):
+
+
Filling the drawing only took her about 20 seconds!
+
+
+
+
Step 5: Digital Modifications
+
As last step, I removed some of the sketch lines and some minor imperfections digitally.
+
+
+
+
Conclusion
+
I like the final result a lot (as small logo), but it’s a bit too detailed as favicon.
Note that the PDF Version of this page might look a bit better styling wise.
Introduction
In this article, we’ll be looking into the hardware of GPUs, and then designing our own. Specifically GPUs with unified shader architecture.
Comparison with CPUs
GPUs focus on operating on a lot of data at once (triangles, vertices, pixels, …), while CPUs focus on high performance on a single core, and low compute delay.
GPU Architecture
GPUs consists of multiple (these days at least 32) compute units (= CU).
Each compute unit has multiple SIMD units, also called “wave”, “wavefront” or “warp”. Compute units also have some fast local memory (tens of kilobytes), main memory access queues, texture units, a scalar unit, and other features. Subscribe to the Atom feed to get notified of future articles.
The main memory (graphics memory) is typically outside of the GPU, and is slow, but high-bandwidth memory.
Waves
A wave is a SIMD processing unit consisting of typically 32 “lanes” (sometimes called threads).
Each wave in a CU has separate control flow, and doesn’t have to be related.
Instructions that waves support:
arithmetic operations
cross-lane data movement
CU local and global memory access: each SIMD lane can access a completely different address. similar to CPU gather / scatter.
synchronization with other CUs in the work group (see future article)
Since only the whole wave can do control flow, and not each lane, all operations can be masked so that they only apply to specific lanes.
=> waves are really similar to SIMD on modern CPUs
In modern GPUs, instruction execution in waves is superscalar, so there are multiple different execution units for executing different kinds of instructions, and multiple instructions can be executed at once, if there are free execution units, and they don’t depend on each other.
We’ll be exploring that in a future article.
Local memory
The local memory inside GPUs is banked, typically into 32 banks. The memory word size is typically 32 bits.
The addresses are interlaved, so for two banks:
addr 0 => bank 0
addr 1 => bank 1
addr 2 => bank 0
addr 3 => bank 1
…
Each bank has an dedicated access port, so for 32 banks, you get 32 access ports.
The lanes of the waves inside a CU get routed to the local memory banks magically.
Why are the banks interlaved?
When the whole wave wants to read a contiguous array of f32, so when each wave performs some_f32_array[lane_id()], all 32 banks can be used at the same time.
Why multiple waves share the same local memory
A wave doesn’t do memory accesses every instruction, but also does computations. This means that there are cycles where the memory isn’t doing anything.
By making multiple waves share the same local memory and access ports, you save resources.
Global memory
Since global memory reads/writes are really slow, they happen asynchronosly.
This means that a wave requests an access, then can continue executing, and then eventually waits for that access to finish.
Because of this, modern compilers automagically start the access before the data is needed, and then wait for the data later on.
Scalar unit
Most newer GPUs also have a scalar unit for saving energy when performing simple operations.
When the controller sees a scalar instruction in the code running on a wave, it automatically makes the code run on the scalar unit.
The scalar unit can be used for:
address calculation
partial reductions
execution of expensive operations not implemented on SIMD because of costs
GPU Programming Terminology
“work item”: typically maps to a SIMD lane
“kernel”: the code for a work item
“work group”: consists of multiple work items. typically maps to an CU. the __local memory in OpenCL applies to this.
“compute task”: a set of work groups
OpenCL and other APIs let you specify both the number of work groups and work items.
Since a program might specify a higher number of work items per work group than we have available, the compiler needs to be able to put multiple work items onto one SIMD lane.
Our own architecture
We’ll go with these specs for now:
N compute units
2 waves per CU
32 lanes per wave.
1KiB local memory per lane => 64 KiB
48 vector registers of 16x32b per wave
one scalar unit per CU
128 global memory ports
16 async task completion “signal” slots per wave
no fancy out of order or superscalar execution
support standard 32 bit floating point, without exceptions.
Note that we won’t specify the exact instruction encoding.
Predefined Constants
We will pre-define 16 constants (as virtual vector registers):
zero
one
sid: 0,1,2,3,4,5,6
wave: the ID of the wave in the compute task, broadcasted to all elements.
u8_max: 255,255,…
n2nd: 1,2,1,2,…
n3rd: 1,2,4,1,…
n4th: 1,2,4,8,1,…
lo16: 1,1,1,… (x16) 0,0,0,… (x16)
ch2: 1,1,0,0,1,1,…
ch4: 1,1,1,1,0,0,0,0,1,…
alo8: 1 (x8) 0 (x8) 1 (x8) 0 (x8)
a few reserved ones
Operands
We define the following instruction operands:
Vreg: vector register
M: (read only) vector gp reg as mask (1b). only first 32 registers can be used as mask. the operand consists of two masks and-ed together, each of which can conditionally be inverted first. this means that this operand takes up 12 bits
Vany: Vreg or M
Simm: immediate scalar value
Sreg: the first element of a vector register, as scalar
Sany: a Simm or an Sreg
dist: Vany, or a Sany broadcasted to each element
sig: one of the 16 completion signal slots
Instructions
We will add more instructions in future articles.
Data Movement
fn mov(out out: Vreg, in wrmask: M, in val: dist)
fn select(out out: Vreg, in select: M, in false: dist, in true: dist)
fn first_where_true(out out: Sreg, in where: M, in values: dist): if none of the elements are true, it doesn’t overwrite the previous value in out.
cross-lane operations: not important for this article
Mathematics
simple (unmasked) u32, i32, and f32 elementwise arithmetic and logic operations: fn add<u32>(out out: Vreg, in left: Vany, in right: dist)
scalar arithmetic and logic operations: fn add<u32>(out out: Sreg, in left: Sany, in right: Sany)
partial reduction operations: “chunks” the input with a size of 8, reduces each chunk, and stores it in the first element of the chunk. this means that every 8th element will contain a partial result.
and operations to finish that reduction into the first element of the vector
Local memory
load 32 bit value at each elem where mask is true: fn local_load32(out out: Vreg, in mask: M, in addr: Vreg)
store 32 bit value at each elem where mask is true: fn local_store32(in addr: Vreg, in mask: M, in val: Vany)
Global (async) memory
start an async global load, and make the given signal correspond to the completion of the access: load 32 bit value at each elem where mask is true: fn global_load32(out sig: sig, out out: Vreg, in mask: M, in addr: Vreg)
see above and local_store32fn global_store32(out sig: sig, in addr: Vreg, in mask: M, in val: Vany)
fn sig_done1(out r: Sreg, in sig: sig)
fn sig_done2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_wait(out r: Sreg, in sig: sig)
fn sig_waitall2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_waitall3(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig)
fn sig_waitall4(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig, in sig4: sig)
As a future extension, we could add a instruction that waits for any of the given signals to complete, and then jump to a specific location, depending on which of those completed.
Control flow (whole wave)
branch if scalar is zero: fn brz(in dest: Simm, in val: Sany)
branch if scalar is not zero: fn brnz(in dest: Simm, in val: Sany)
branch on the whole wave if each element has a true value for the mask: fn br_all(in dest: Simm, in cond: M)
branch on the whole wave if any element has a true value for the mask: fn br_any(in dest: Simm, in cond: M)
Hand-compiling code
Now that we decided on a simple compute-only GPU architecture, we can try hand-compiling an OpenCL program.
I asked an LLM to produce a N*N matmul example (comments written manually):
// convenient number for our specifc hardware #defineTILE_SIZE8
// this kernel will be launched with dimensions: // global[2] = { 128,128 } = { N, N }; // local[2] = { 8,8 } = { TILE_SIZE, TILE_SIZE }; __kernel voidmatmul_tiled( __global float* A, __global float* B, __global float* C, constint N) { int row =get_global_id(1); // y int col =get_global_id(0); // x int local_row =get_local_id(1); // y int local_col =get_local_id(0); // x
for (int t =0; t < N / TILE_SIZE; ++t) { // load tiles into local int tiledRow = row; int tiledCol = t * TILE_SIZE + local_col; float av; if (tiledRow < N && tiledCol < N) av = A[tiledRow * N + tiledCol]; else av =0.0f; Asub[local_row][local_col] = av;
tiledRow = t * TILE_SIZE + local_row; tiledCol = col; float bv; if (tiledRow < N && tiledCol < N) bv; = B[tiledRow * N + tiledCol]; else bv =0.0f; Bsub[local_row][local_col]= bv;
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE);
for (int k =0; k < TILE_SIZE; ++k) sum += Asub[local_row][k] * Bsub[k][local_col];
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE); }
if (row < N && col < N) C[row * N + col] = sum; }
First, we have to decide on how we want to map the kernel to the hardware.
Since the local dimension of the kernel is 8*8, which is 64, we can map each local group to one CU, by mapping 32 kernels to one wave, and using both waves available on one CU for the local group.
Our global dimension is 128*128, which means that we would need 256 compute units. But since we probably don’t have 256 compute units, GPUs, including ours, will have a on-hardware task scheduler, for scheduing tasks onto compute units.
Outro
Modern GPUs are really complex, but designing a simple GPU is not that hard either.
Subscribe to the Atom feed to get notified of future articles.
In this article, we’ll be looking into the hardware of GPUs, and then designing our own. Specifically GPUs with unified shader architecture.
+
Comparison with CPUs
GPUs focus on operating on a lot of data at once (triangles, vertices, pixels, …), while CPUs focus on high performance on a single core, and low compute delay.
+
+
GPU Architecture
GPUs consists of multiple (these days at least 32) compute units (= CU).
+
Each compute unit has multiple SIMD units, also called “wave”, “wavefront” or “warp”. Compute units also have some fast local memory (tens of kilobytes), main memory access queues, texture units, a scalar unit, and other features. Subscribe to the Atom feed
+
to get notified of future articles.
+
The main memory (graphics memory) is typically outside of the GPU, and is slow, but high-bandwidth memory.
+
+
+
Waves
A wave is a SIMD processing unit consisting of typically 32 “lanes” (sometimes called threads).
+
Each wave in a CU has separate control flow, and doesn’t have to be related.
+
Instructions that waves support:
+
+
arithmetic operations
+
cross-lane data movement
+
CU local and global memory access: each SIMD lane can access a completely different address. similar to CPU gather / scatter.
+
synchronization with other CUs in the work group (see future article)
+
+
Since only the whole wave can do control flow, and not each lane, all operations can be masked so that they only apply to specific lanes.
+
=> waves are really similar to SIMD on modern CPUs
+
+
+
In modern GPUs, instruction execution in waves is superscalar, so there are multiple different execution units for executing different kinds of instructions, and multiple instructions can be executed at once, if there are free execution units, and they don’t depend on each other.
+
We’ll be exploring that in a future article.
+
+
+
Local memory
The local memory inside GPUs is banked, typically into 32 banks. The memory word size is typically 32 bits.
+
The addresses are interlaved, so for two banks:
+
+
addr 0 => bank 0
+
addr 1 => bank 1
+
addr 2 => bank 0
+
addr 3 => bank 1
+
…
+
+
Each bank has an dedicated access port, so for 32 banks, you get 32 access ports.
+
The lanes of the waves inside a CU get routed to the local memory banks magically.
+
+
Why are the banks interlaved?
When the whole wave wants to read a contiguous array of f32, so when each wave performs some_f32_array[lane_id()], all 32 banks can be used at the same time.
+
+
Why multiple waves share the same local memory
A wave doesn’t do memory accesses every instruction, but also does computations. This means that there are cycles where the memory isn’t doing anything.
+
By making multiple waves share the same local memory and access ports, you save resources.
+
+
+
Global memory
Since global memory reads/writes are really slow, they happen asynchronosly.
+
This means that a wave requests an access, then can continue executing, and then eventually waits for that access to finish.
+
Because of this, modern compilers automagically start the access before the data is needed, and then wait for the data later on.
+
+
+
Scalar unit
Most newer GPUs also have a scalar unit for saving energy when performing simple operations.
+
When the controller sees a scalar instruction in the code running on a wave, it automatically makes the code run on the scalar unit.
+
The scalar unit can be used for:
+
+
address calculation
+
partial reductions
+
execution of expensive operations not implemented on SIMD because of costs
+
+
+
+
GPU Programming Terminology
+
+
“work item”: typically maps to a SIMD lane
+
“kernel”: the code for a work item
+
“work group”: consists of multiple work items. typically maps to an CU. the __local memory in OpenCL applies to this.
+
“compute task”: a set of work groups
+
+
+
+
OpenCL and other APIs let you specify both the number of work groups and work items.
+
Since a program might specify a higher number of work items per work group than we have available, the compiler needs to be able to put multiple work items onto one SIMD lane.
+
+
+
Our own architecture
We’ll go with these specs for now:
+
+
N compute units
+
2 waves per CU
+
32 lanes per wave.
+
1KiB local memory per lane => 64 KiB
+
48 vector registers of 16x32b per wave
+
one scalar unit per CU
+
128 global memory ports
+
16 async task completion “signal” slots per wave
+
no fancy out of order or superscalar execution
+
support standard 32 bit floating point, without exceptions.
+
+
Note that we won’t specify the exact instruction encoding.
+
+
+
Predefined Constants
We will pre-define 16 constants (as virtual vector registers):
+
+
zero
+
one
+
sid: 0,1,2,3,4,5,6
+
wave: the ID of the wave in the compute task, broadcasted to all elements.
+
u8_max: 255,255,…
+
n2nd: 1,2,1,2,…
+
n3rd: 1,2,4,1,…
+
n4th: 1,2,4,8,1,…
+
lo16: 1,1,1,… (x16) 0,0,0,… (x16)
+
ch2: 1,1,0,0,1,1,…
+
ch4: 1,1,1,1,0,0,0,0,1,…
+
alo8: 1 (x8) 0 (x8) 1 (x8) 0 (x8)
+
a few reserved ones
+
+
+
+
Operands
We define the following instruction operands:
+
+
Vreg: vector register
+
M: (read only) vector gp reg as mask (1b). only first 32 registers can be used as mask. the operand consists of two masks and-ed together, each of which can conditionally be inverted first. this means that this operand takes up 12 bits
+
Vany: Vreg or M
+
Simm: immediate scalar value
+
Sreg: the first element of a vector register, as scalar
+
Sany: a Simm or an Sreg
+
dist: Vany, or a Sany broadcasted to each element
+
sig: one of the 16 completion signal slots
+
+
+
Instructions
We will add more instructions in future articles.
+
+
Data Movement
+
+
fn mov(out out: Vreg, in wrmask: M, in val: dist)
+
fn select(out out: Vreg, in select: M, in false: dist, in true: dist)
+
fn first_where_true(out out: Sreg, in where: M, in values: dist): if none of the elements are true, it doesn’t overwrite the previous value in out.
+
cross-lane operations: not important for this article
+
+
+
+
Mathematics
+
+
simple (unmasked) u32, i32, and f32 elementwise arithmetic and logic operations: fn add<u32>(out out: Vreg, in left: Vany, in right: dist)
+
scalar arithmetic and logic operations: fn add<u32>(out out: Sreg, in left: Sany, in right: Sany)
+
partial reduction operations: “chunks” the input with a size of 8, reduces each chunk, and stores it in the first element of the chunk. this means that every 8th element will contain a partial result.
+
and operations to finish that reduction into the first element of the vector
+
+
+
+
Local memory
+
+
load 32 bit value at each elem where mask is true: fn local_load32(out out: Vreg, in mask: M, in addr: Vreg)
+
store 32 bit value at each elem where mask is true: fn local_store32(in addr: Vreg, in mask: M, in val: Vany)
+
+
+
+
Global (async) memory
+
+
start an async global load, and make the given signal correspond to the completion of the access: load 32 bit value at each elem where mask is true: fn global_load32(out sig: sig, out out: Vreg, in mask: M, in addr: Vreg)
+
see above and local_store32fn global_store32(out sig: sig, in addr: Vreg, in mask: M, in val: Vany)
+
fn sig_done1(out r: Sreg, in sig: sig)
+
fn sig_done2(out r: Sreg, in sig1: sig, in sig2: sig)
+
fn sig_wait(out r: Sreg, in sig: sig)
+
fn sig_waitall2(out r: Sreg, in sig1: sig, in sig2: sig)
+
fn sig_waitall3(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig)
+
fn sig_waitall4(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig, in sig4: sig)
+
+
As a future extension, we could add a instruction that waits for any of the given signals to complete, and then jump to a specific location, depending on which of those completed.
+
+
+
Control flow (whole wave)
+
+
branch if scalar is zero: fn brz(in dest: Simm, in val: Sany)
+
branch if scalar is not zero: fn brnz(in dest: Simm, in val: Sany)
+
branch on the whole wave if each element has a true value for the mask: fn br_all(in dest: Simm, in cond: M)
+
branch on the whole wave if any element has a true value for the mask: fn br_any(in dest: Simm, in cond: M)
+
+
+
+
Hand-compiling code
Now that we decided on a simple compute-only GPU architecture, we can try hand-compiling an OpenCL program.
+
I asked an LLM to produce a N*N matmul example (comments written manually):
+
// convenient number for our specifc hardware #defineTILE_SIZE8
// this kernel will be launched with dimensions: // global[2] = { 128,128 } = { N, N }; // local[2] = { 8,8 } = { TILE_SIZE, TILE_SIZE }; __kernel voidmatmul_tiled( __global float* A, __global float* B, __global float* C, constint N) { int row =get_global_id(1); // y int col =get_global_id(0); // x int local_row =get_local_id(1); // y int local_col =get_local_id(0); // x
for (int t =0; t < N / TILE_SIZE; ++t) { // load tiles into local int tiledRow = row; int tiledCol = t * TILE_SIZE + local_col; float av; if (tiledRow < N && tiledCol < N) av = A[tiledRow * N + tiledCol]; else av =0.0f; Asub[local_row][local_col] = av;
tiledRow = t * TILE_SIZE + local_row; tiledCol = col; float bv; if (tiledRow < N && tiledCol < N) bv; = B[tiledRow * N + tiledCol]; else bv =0.0f; Bsub[local_row][local_col]= bv;
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE);
for (int k =0; k < TILE_SIZE; ++k) sum += Asub[local_row][k] * Bsub[k][local_col];
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE); }
if (row < N && col < N) C[row * N + col] = sum; }
+
+
+
First, we have to decide on how we want to map the kernel to the hardware.
+
Since the local dimension of the kernel is 8*8, which is 64, we can map each local group to one CU, by mapping 32 kernels to one wave, and using both waves available on one CU for the local group.
+
Our global dimension is 128*128, which means that we would need 256 compute units. But since we probably don’t have 256 compute units, GPUs, including ours, will have a on-hardware task scheduler, for scheduing tasks onto compute units.
+
+
+
Outro
Modern GPUs are really complex, but designing a simple GPU is not that hard either.
In this article, we’ll be looking into the hardware of GPUs, and then designing our own. Specifically GPUs with unified shader architecture.
+
+
+
+
Comparison with CPUs
+
GPUs focus on operating on a lot of data at once (triangles, vertices, pixels, …), while CPUs focus on high performance on a single core, and low compute delay.
+
+
+
+
GPU Architecture
+
GPUs consists of multiple (these days at least 32) compute units (= CU).
+
Each compute unit has multiple SIMD units, also called “wave”, “wavefront” or “warp”. Compute units also have some fast local memory (tens of kilobytes), main memory access queues, texture units, a scalar unit, and other features. Subscribe to the Atom feed to get notified of future articles.
+
The main memory (graphics memory) is typically outside of the GPU, and is slow, but high-bandwidth memory.
+
+
+
+
Waves
+
A wave is a SIMD processing unit consisting of typically 32 “lanes” (sometimes called threads).
+
Each wave in a CU has separate control flow, and doesn’t have to be related.
+
Instructions that waves support:
+
+
arithmetic operations
+
cross-lane data movement
+
CU local and global memory access: each SIMD lane can access a completely different address. similar to CPU gather / scatter.
+
synchronization with other CUs in the work group (see future article)
+
+
Since only the whole wave can do control flow, and not each lane, all operations can be masked so that they only apply to specific lanes.
+
=> waves are really similar to SIMD on modern CPUs
+
+
+
In modern GPUs, instruction execution in waves is superscalar, so there are multiple different execution units for executing different kinds of instructions, and multiple instructions can be executed at once, if there are free execution units, and they don’t depend on each other.
+
We’ll be exploring that in a future article.
+
+
+
+
Local memory
+
The local memory inside GPUs is banked, typically into 32 banks. The memory word size is typically 32 bits.
+
The addresses are interlaved, so for two banks:
+
+
addr 0 => bank 0
+
addr 1 => bank 1
+
addr 2 => bank 0
+
addr 3 => bank 1
+
…
+
+
Each bank has an dedicated access port, so for 32 banks, you get 32 access ports.
+
The lanes of the waves inside a CU get routed to the local memory banks magically.
+
+
+
+
Why are the banks interlaved?
+
When the whole wave wants to read a contiguous array of f32, so when each wave performs some_f32_array[lane_id()], all 32 banks can be used at the same time.
+
+
+
+
Why multiple waves share the same local memory
+
A wave doesn’t do memory accesses every instruction, but also does computations. This means that there are cycles where the memory isn’t doing anything.
+
By making multiple waves share the same local memory and access ports, you save resources.
+
+
+
+
Global memory
+
Since global memory reads/writes are really slow, they happen asynchronosly.
+
This means that a wave requests an access, then can continue executing, and then eventually waits for that access to finish.
+
Because of this, modern compilers automagically start the access before the data is needed, and then wait for the data later on.
+
+
+
+
Scalar unit
+
Most newer GPUs also have a scalar unit for saving energy when performing simple operations.
+
When the controller sees a scalar instruction in the code running on a wave, it automatically makes the code run on the scalar unit.
+
The scalar unit can be used for:
+
+
address calculation
+
partial reductions
+
execution of expensive operations not implemented on SIMD because of costs
+
+
+
+
+
GPU Programming Terminology
+
+
“work item”: typically maps to a SIMD lane
+
“kernel”: the code for a work item
+
“work group”: consists of multiple work items. typically maps to an CU. the __local memory in OpenCL applies to this.
+
“compute task”: a set of work groups
+
+
+
+
OpenCL and other APIs let you specify both the number of work groups and work items.
+
Since a program might specify a higher number of work items per work group than we have available, the compiler needs to be able to put multiple work items onto one SIMD lane.
+
+
+
+
Our own architecture
+
We’ll go with these specs for now:
+
+
N compute units
+
2 waves per CU
+
32 lanes per wave.
+
1KiB local memory per lane => 64 KiB
+
48 vector registers of 16x32b per wave
+
one scalar unit per CU
+
128 global memory ports
+
16 async task completion “signal” slots per wave
+
no fancy out of order or superscalar execution
+
support standard 32 bit floating point, without exceptions.
+
+
Note that we won’t specify the exact instruction encoding.
+
+
+
+
Predefined Constants
+
We will pre-define 16 constants (as virtual vector registers):
+
+
zero
+
one
+
sid: 0,1,2,3,4,5,6
+
wave: the ID of the wave in the compute task, broadcasted to all elements.
+
u8_max: 255,255,…
+
n2nd: 1,2,1,2,…
+
n3rd: 1,2,4,1,…
+
n4th: 1,2,4,8,1,…
+
lo16: 1,1,1,… (x16) 0,0,0,… (x16)
+
ch2: 1,1,0,0,1,1,…
+
ch4: 1,1,1,1,0,0,0,0,1,…
+
alo8: 1 (x8) 0 (x8) 1 (x8) 0 (x8)
+
a few reserved ones
+
+
+
+
+
Operands
+
We define the following instruction operands:
+
+
Vreg: vector register
+
M: (read only) vector gp reg as mask (1b). only first 32 registers can be used as mask. the operand consists of two masks and-ed together, each of which can conditionally be inverted first. this means that this operand takes up 12 bits
+
Vany: Vreg or M
+
Simm: immediate scalar value
+
Sreg: the first element of a vector register, as scalar
+
Sany: a Simm or an Sreg
+
dist: Vany, or a Sany broadcasted to each element
+
sig: one of the 16 completion signal slots
+
+
+
+
+
Instructions
+
We will add more instructions in future articles.
+
+
+
+
Data Movement
+
+
fn mov(out out: Vreg, in wrmask: M, in val: dist)
+
fn select(out out: Vreg, in select: M, in false: dist, in true: dist)
+
fn first_where_true(out out: Sreg, in where: M, in values: dist): if none of the elements are true, it doesn’t overwrite the previous value in out.
+
cross-lane operations: not important for this article
+
+
+
+
+
Mathematics
+
+
simple (unmasked) u32, i32, and f32 elementwise arithmetic and logic operations: fn add<u32>(out out: Vreg, in left: Vany, in right: dist)
+
scalar arithmetic and logic operations: fn add<u32>(out out: Sreg, in left: Sany, in right: Sany)
+
partial reduction operations: “chunks” the input with a size of 8, reduces each chunk, and stores it in the first element of the chunk. this means that every 8th element will contain a partial result.
+
and operations to finish that reduction into the first element of the vector
+
+
+
+
+
Local memory
+
+
load 32 bit value at each elem where mask is true: fn local_load32(out out: Vreg, in mask: M, in addr: Vreg)
+
store 32 bit value at each elem where mask is true: fn local_store32(in addr: Vreg, in mask: M, in val: Vany)
+
+
+
+
+
Global (async) memory
+
+
start an async global load, and make the given signal correspond to the completion of the access: load 32 bit value at each elem where mask is true: fn global_load32(out sig: sig, out out: Vreg, in mask: M, in addr: Vreg)
+
see above and local_store32fn global_store32(out sig: sig, in addr: Vreg, in mask: M, in val: Vany)
+
fn sig_done1(out r: Sreg, in sig: sig)
+
fn sig_done2(out r: Sreg, in sig1: sig, in sig2: sig)
+
fn sig_wait(out r: Sreg, in sig: sig)
+
fn sig_waitall2(out r: Sreg, in sig1: sig, in sig2: sig)
+
fn sig_waitall3(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig)
+
fn sig_waitall4(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig, in sig4: sig)
+
+
As a future extension, we could add a instruction that waits for any of the given signals to complete, and then jump to a specific location, depending on which of those completed.
+
+
+
+
Control flow (whole wave)
+
+
branch if scalar is zero: fn brz(in dest: Simm, in val: Sany)
+
branch if scalar is not zero: fn brnz(in dest: Simm, in val: Sany)
+
branch on the whole wave if each element has a true value for the mask: fn br_all(in dest: Simm, in cond: M)
+
branch on the whole wave if any element has a true value for the mask: fn br_any(in dest: Simm, in cond: M)
+
+
+
+
+
Hand-compiling code
+
Now that we decided on a simple compute-only GPU architecture, we can try hand-compiling an OpenCL program.
+
I asked an LLM to produce a N*N matmul example (comments written manually):
+
// convenient number for our specifc hardware #defineTILE_SIZE8
// this kernel will be launched with dimensions: // global[2] = { 128,128 } = { N, N }; // local[2] = { 8,8 } = { TILE_SIZE, TILE_SIZE }; __kernel voidmatmul_tiled( __global float* A, __global float* B, __global float* C, constint N) { int row =get_global_id(1); // y int col =get_global_id(0); // x int local_row =get_local_id(1); // y int local_col =get_local_id(0); // x
for (int t =0; t < N / TILE_SIZE; ++t) { // load tiles into local int tiledRow = row; int tiledCol = t * TILE_SIZE + local_col; float av; if (tiledRow < N && tiledCol < N) av = A[tiledRow * N + tiledCol]; else av =0.0f; Asub[local_row][local_col] = av;
tiledRow = t * TILE_SIZE + local_row; tiledCol = col; float bv; if (tiledRow < N && tiledCol < N) bv; = B[tiledRow * N + tiledCol]; else bv =0.0f; Bsub[local_row][local_col]= bv;
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE);
for (int k =0; k < TILE_SIZE; ++k) sum += Asub[local_row][k] * Bsub[k][local_col];
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE); }
if (row < N && col < N) C[row * N + col] = sum; }
+
+
+
First, we have to decide on how we want to map the kernel to the hardware.
+
Since the local dimension of the kernel is 8*8, which is 64, we can map each local group to one CU, by mapping 32 kernels to one wave, and using both waves available on one CU for the local group.
+
Our global dimension is 128*128, which means that we would need 256 compute units. But since we probably don’t have 256 compute units, GPUs, including ours, will have a on-hardware task scheduler, for scheduing tasks onto compute units.
+
+
+
+
Outro
+
Modern GPUs are really complex, but designing a simple GPU is not that hard either.
+
Subscribe to the Atom feed to get notified of future articles.
Note that the PDF Version of this page might look a bit better styling wise.
Introduction
If you are any kind of programmer, you’ve probably heard of RegEx
RegEx (Regular expression) is kind of like a small programming language used to define string search and replace patterns.
RegEx might seem overwhelming at first, but you can learn the most important features of RegEx very quickly.
It is important to mention that there is not a single standard for RegEx syntax, but instead each “implementation” has it’s own quirks, and additional features. Most common features however behave identically on most RegEx “engines”/implementations.
Syntax
The behavior of RegEx expressions / patterns depends on the match options passed to the RegEx engine.
Common match options:
Anchored at start and end of line
Case insensitive
multi-line or instead whole string
“Atoms”
In this article, we will refer to single expression parts as “atoms”.
Characters
Just use the character that you want to match. For example a to match an a. This however does not work for all characters, because many are part of special RegEx syntax.
Escaped Characters
Thee previously mentioned special characters like [ can be matched by putting a backslash in front of them: \[
Character Groups
RegEx engines already define some groups of characters that can make writing RegEx expressions quicker.
Anchors
^ is used to assert the beginning of a line in multi-line mode, or the beginning of the string in whole-string mode.
$ is used to assert the end of a line in multi-line mode, or the end of the string in whole-string mode.
Some combinators will either match “lazy”, or “greedy”.
Lazy is when the engine only matches as many characters required to get to the next step. This should almost always be used.
Greedy matching is when the engine tries to match as many characters as possible. The problem with this is that it might cause “backtracking”, which happens when the engine goes back in the pattern multiple times to ensure that as many characters as possible where matched. This can cause big performance issues.
Combinators
Multiple atoms can be combined together to form more complex patterns.
Chain
When two expressions are next to each other, they will be chained together, which means that both will be evaluated in-order.
Example: x\d matches a x and then a digit, like for example x9
Or
Two expressions separated by a | cause the RegEx engine to first try to match the left side, and only if it fails, it tries the right side instead.
Note that “or” has a long left and right scope, which means that ab|cd will match either ab or cd
Or-Not
Tries to match the expression on the left to it, but won’t error if it doesn’t succeed.
Note that “or-not” has a short left scope, which means that ab? will always match a, and then try to match b
Repeated
A expression followed by either a * for greedy repeat, or a *? for lazy repeat.
This matches as many times as possible, but can also match the pattern zero times.
Note that this has a short left scope.
Repeated At Least Once
A expression followed by either a + for greedy repeat, or a +? for lazy repeat.
This matches as many times as possible, and at least one time.
Note that this has a short left scope.
(Non-Capture) Group
Groups multiple expressions together for scoping.
Example: (?:abc) will just match abc
Capture Group
Similar to Non-Capture Groups except that they capture the matched text. This allows the matched text of the inner expression to be extracted later.
Capture group IDs are enumerated from left to right, starting with 1.
Example: (abc)de will match abcde, and store abc in group 1.
Character Set
By surrounding multiple characters in square brackets, the engine will match any of them. Special characters or expressions won’t be parsed inside them, which means that this can also be used to escape characters.
For example: [abc] will match either a, b or c.
and [ab(?:c)] will match either a, b, (, ?, :, c, or ).
If you are any kind of programmer, you’ve probably heard of RegEx
+
RegEx (Regular expression) is kind of like a small programming language used to define string search and replace patterns.
+
RegEx might seem overwhelming at first, but you can learn the most important features of RegEx very quickly.
+
It is important to mention that there is not a single standard for RegEx syntax, but instead each “implementation” has it’s own quirks, and additional features. Most common features however behave identically on most RegEx “engines”/implementations.
+
+
+
Syntax
The behavior of RegEx expressions / patterns depends on the match options passed to the RegEx engine.
+
Common match options:
+
+
Anchored at start and end of line
+
Case insensitive
+
multi-line or instead whole string
+
+
+
“Atoms”
In this article, we will refer to single expression parts as “atoms”.
+
Characters
Just use the character that you want to match. For example a to match an a. This however does not work for all characters, because many are part of special RegEx syntax.
+
+
Escaped Characters
Thee previously mentioned special characters like [ can be matched by putting a backslash in front of them: \[
+
+
+
+
Character Groups
RegEx engines already define some groups of characters that can make writing RegEx expressions quicker.
+
+
+
+
Anchors
^ is used to assert the beginning of a line in multi-line mode, or the beginning of the string in whole-string mode.
+
$ is used to assert the end of a line in multi-line mode, or the end of the string in whole-string mode.
Some combinators will either match “lazy”, or “greedy”.
+
Lazy is when the engine only matches as many characters required to get to the next step. This should almost always be used.
+
Greedy matching is when the engine tries to match as many characters as possible. The problem with this is that it might cause “backtracking”, which happens when the engine goes back in the pattern multiple times to ensure that as many characters as possible where matched. This can cause big performance issues.
+
+
Combinators
Multiple atoms can be combined together to form more complex patterns.
+
+
Chain
When two expressions are next to each other, they will be chained together, which means that both will be evaluated in-order.
+
Example: x\d matches a x and then a digit, like for example x9
+
+
+
Or
Two expressions separated by a | cause the RegEx engine to first try to match the left side, and only if it fails, it tries the right side instead.
+
Note that “or” has a long left and right scope, which means that ab|cd will match either ab or cd
+
+
+
Or-Not
Tries to match the expression on the left to it, but won’t error if it doesn’t succeed.
+
Note that “or-not” has a short left scope, which means that ab? will always match a, and then try to match b
+
+
+
Repeated
A expression followed by either a * for greedy repeat, or a *? for lazy repeat.
+
This matches as many times as possible, but can also match the pattern zero times.
+
Note that this has a short left scope.
+
+
+
Repeated At Least Once
A expression followed by either a + for greedy repeat, or a +? for lazy repeat.
+
This matches as many times as possible, and at least one time.
+
Note that this has a short left scope.
+
+
+
(Non-Capture) Group
Groups multiple expressions together for scoping.
+
Example: (?:abc) will just match abc
+
+
+
Capture Group
Similar to Non-Capture Groups except that they capture the matched text. This allows the matched text of the inner expression to be extracted later.
+
Capture group IDs are enumerated from left to right, starting with 1.
+
Example: (abc)de will match abcde, and store abc in group 1.
+
+
+
Character Set
By surrounding multiple characters in square brackets, the engine will match any of them. Special characters or expressions won’t be parsed inside them, which means that this can also be used to escape characters.
+
For example: [abc] will match either a, b or c.
+
and [ab(?:c)] will match either a, b, (, ?, :, c, or ).
If you are any kind of programmer, you’ve probably heard of RegEx
+
RegEx (Regular expression) is kind of like a small programming language used to define string search and replace patterns.
+
RegEx might seem overwhelming at first, but you can learn the most important features of RegEx very quickly.
+
It is important to mention that there is not a single standard for RegEx syntax, but instead each “implementation” has it’s own quirks, and additional features. Most common features however behave identically on most RegEx “engines”/implementations.
+
+
+
+
Syntax
+
The behavior of RegEx expressions / patterns depends on the match options passed to the RegEx engine.
+
Common match options:
+
+
Anchored at start and end of line
+
Case insensitive
+
multi-line or instead whole string
+
+
+
+
+
“Atoms”
+
In this article, we will refer to single expression parts as “atoms”.
+
+
+
+
Characters
+
Just use the character that you want to match. For example a to match an a. This however does not work for all characters, because many are part of special RegEx syntax.
+
+
+
+
Escaped Characters
+
Thee previously mentioned special characters like [ can be matched by putting a backslash in front of them: \[
+
+
+
+
+
Character Groups
+
RegEx engines already define some groups of characters that can make writing RegEx expressions quicker.
+
+
+
+
+
Anchors
+
^ is used to assert the beginning of a line in multi-line mode, or the beginning of the string in whole-string mode.
+
$ is used to assert the end of a line in multi-line mode, or the end of the string in whole-string mode.
Some combinators will either match “lazy”, or “greedy”.
+
Lazy is when the engine only matches as many characters required to get to the next step. This should almost always be used.
+
Greedy matching is when the engine tries to match as many characters as possible. The problem with this is that it might cause “backtracking”, which happens when the engine goes back in the pattern multiple times to ensure that as many characters as possible where matched. This can cause big performance issues.
+
+
+
+
Combinators
+
Multiple atoms can be combined together to form more complex patterns.
+
+
+
+
Chain
+
When two expressions are next to each other, they will be chained together, which means that both will be evaluated in-order.
+
Example: x\d matches a x and then a digit, like for example x9
+
+
+
+
Or
+
Two expressions separated by a | cause the RegEx engine to first try to match the left side, and only if it fails, it tries the right side instead.
+
Note that “or” has a long left and right scope, which means that ab|cd will match either ab or cd
+
+
+
+
Or-Not
+
Tries to match the expression on the left to it, but won’t error if it doesn’t succeed.
+
Note that “or-not” has a short left scope, which means that ab? will always match a, and then try to match b
+
+
+
+
Repeated
+
A expression followed by either a * for greedy repeat, or a *? for lazy repeat.
+
This matches as many times as possible, but can also match the pattern zero times.
+
Note that this has a short left scope.
+
+
+
+
Repeated At Least Once
+
A expression followed by either a + for greedy repeat, or a +? for lazy repeat.
+
This matches as many times as possible, and at least one time.
+
Note that this has a short left scope.
+
+
+
+
(Non-Capture) Group
+
Groups multiple expressions together for scoping.
+
Example: (?:abc) will just match abc
+
+
+
+
Capture Group
+
Similar to Non-Capture Groups except that they capture the matched text. This allows the matched text of the inner expression to be extracted later.
+
Capture group IDs are enumerated from left to right, starting with 1.
+
Example: (abc)de will match abcde, and store abc in group 1.
+
+
+
+
Character Set
+
By surrounding multiple characters in square brackets, the engine will match any of them. Special characters or expressions won’t be parsed inside them, which means that this can also be used to escape characters.
+
For example: [abc] will match either a, b or c.
+
and [ab(?:c)] will match either a, b, (, ?, :, c, or ).
Note that the PDF Version of this page might look a bit better styling wise.
Introduction
Function calls have some overhead, which can sometimes be a big issue for other optimizations. Because of that, compiler backends (should) inline function calls. There are however many issues with just greedily inlining calls…
Greedy inlining with heuristics
This is the most obvious approach. We can just inline all functions with only one call, and then inline calls where the inlined function does not have many instructions.
Example:
function f32 $square(f32 %x) { @entry: // this is stupid, but I couldn't come up with a better example f32 %e = add %x, 0 f32 %out = add %e, %x ret %out }
The second option is to inline the $hypot call into $tri_hypot. (There are also some other options)
Now in this case, it seems obvious to prefer inlining $square into $hypot.
Issue 2: ABI requirements on argument passing
If we assume the target ABI only has one f32 register for passing arguments, then we would have to generate additional instructions for passing the second argument of $hypot, and then it might actually be more efficient to inline $hypot instead of $square.
This example is not realistic, but this issue actually occurs when compiling lots of code.
Another related issue is that having more arguments arranged in a fixed way will require lots of moving data arround at the call site.
A solution to this is to make the heuristics not just output code size, but also make it depend on the number of arguments / outputs passed to the function.
If the target has a efficient hypot operation, then that operation will only be used if we inline $myfunc into $callsite.
This means that inlining is now depended on… instruction selection??
This is not the only optimization prevented by not inlining the call. If $callsite were to be called in a loop, then not inlining would prevent vectorization.
Function outlining
A related optimization is “outlining”. It’s the opposite to inlining. It moves duplicate code into a function, to reduce code size, and sometimes increase performance (because of instruction caching)
If we do inlining seperately from outlining, we often get unoptimal code.
A better approach
We can instead first inline all inlinable calls, and then perform more agressive outlining.
Step 1: inlining
We inline all function calls, except for:
self recursion (obviously)
functions explicitly marked as no-inline by the user
Step 2: detect duplicate code
There are many algorithms for doing this.
The goal of this step is to both:
maximize size of outlinable section
minimize size of code
Step 3: slightly reduce size of outlinable section
The goal is to reduce size of outlinable sections, to make the code more optimal.
This should be ABI and instruction depended, and have the goal of:
reducing argument shuffles required at all call sites
reducing register preassure
not preventing good isel choices and optimizations.
this is also dependent on the targetted code size.
Step 4: perform outlining
This is obvious.
Issue 1: high compile-time memory usage
Inlining all function calls first will increase the memory usage during compilation by A LOT
I’m sure that there is a smarter way to implement this method, without actually performing the inlining…
Conclusion
Function inlining is much more complex than one might think.
PS: No idea how to implement this…
Subscribe to the Atom feed to get notified about futre compiler-related articles.
+
+
+
+
+
+
+
diff --git a/build/compiler-inlining.typ.desktop.html.d b/build/compiler-inlining.typ.desktop.html.d
new file mode 100644
index 0000000..eec6e00
--- /dev/null
+++ b/build/compiler-inlining.typ.desktop.html.d
@@ -0,0 +1 @@
+build/compiler-inlining.typ.desktop.html: common.typ pages/compiler-inlining.typ core-page-style.typ components/header.typ simple-page-layout.typ
diff --git a/build/compiler-inlining.typ.git_rev.txt b/build/compiler-inlining.typ.git_rev.txt
new file mode 100644
index 0000000..2356470
--- /dev/null
+++ b/build/compiler-inlining.typ.git_rev.txt
@@ -0,0 +1 @@
+--input git_rev=9c2913af189b62c028f6f773370f50f9e6c13307 --input git_commit_date="11. August 2025 16:38"
diff --git a/build/compiler-inlining.typ.git_rev.txt.iso b/build/compiler-inlining.typ.git_rev.txt.iso
new file mode 100644
index 0000000..12faf08
--- /dev/null
+++ b/build/compiler-inlining.typ.git_rev.txt.iso
@@ -0,0 +1 @@
+2025-08-11T16:38:10+02:00
diff --git a/build/compiler-inlining.typ.min.html b/build/compiler-inlining.typ.min.html
new file mode 100644
index 0000000..68fc3e3
--- /dev/null
+++ b/build/compiler-inlining.typ.min.html
@@ -0,0 +1,358 @@
+
+
+
+ Automatically inlining functions is not easy
+
+
+
+
+
+
+
+
+
Function calls have some overhead, which can sometimes be a big issue for other optimizations. Because of that, compiler backends (should) inline function calls. There are however many issues with just greedily inlining calls…
+
+
Greedy inlining with heuristics
This is the most obvious approach. We can just inline all functions with only one call, and then inline calls where the inlined function does not have many instructions.
+
Example:
+
function f32 $square(f32 %x) { @entry: // this is stupid, but I couldn't come up with a better example f32 %e = add %x, 0 f32 %out = add %e, %x ret %out }
The second option is to inline the $hypot call into $tri_hypot. (There are also some other options)
+
Now in this case, it seems obvious to prefer inlining $square into $hypot.
+
+
+
Issue 2: ABI requirements on argument passing
If we assume the target ABI only has one f32 register for passing arguments, then we would have to generate additional instructions for passing the second argument of $hypot, and then it might actually be more efficient to inline $hypot instead of $square.
+
This example is not realistic, but this issue actually occurs when compiling lots of code.
+
Another related issue is that having more arguments arranged in a fixed way will require lots of moving data arround at the call site.
+
A solution to this is to make the heuristics not just output code size, but also make it depend on the number of arguments / outputs passed to the function.
If the target has a efficient hypot operation, then that operation will only be used if we inline $myfunc into $callsite.
+
This means that inlining is now depended on… instruction selection??
+
This is not the only optimization prevented by not inlining the call. If $callsite were to be called in a loop, then not inlining would prevent vectorization.
+
+
+
Function outlining
A related optimization is “outlining”. It’s the opposite to inlining. It moves duplicate code into a function, to reduce code size, and sometimes increase performance (because of instruction caching)
+
If we do inlining seperately from outlining, we often get unoptimal code.
+
+
A better approach
We can instead first inline all inlinable calls, and then perform more agressive outlining.
+
+
Step 1: inlining
We inline all function calls, except for:
+
+
self recursion (obviously)
+
functions explicitly marked as no-inline by the user
+
+
+
+
Step 2: detect duplicate code
There are many algorithms for doing this.
+
The goal of this step is to both:
+
+
maximize size of outlinable section
+
minimize size of code
+
+
+
+
Step 3: slightly reduce size of outlinable section
The goal is to reduce size of outlinable sections, to make the code more optimal.
+
This should be ABI and instruction depended, and have the goal of:
+
+
reducing argument shuffles required at all call sites
+
reducing register preassure
+
not preventing good isel choices and optimizations.
+
+
this is also dependent on the targetted code size.
+
+
Step 4: perform outlining
This is obvious.
+
+
Issue 1: high compile-time memory usage
Inlining all function calls first will increase the memory usage during compilation by A LOT
+
I’m sure that there is a smarter way to implement this method, without actually performing the inlining…
+
+
+
Conclusion
Function inlining is much more complex than one might think.
Function calls have some overhead, which can sometimes be a big issue for other optimizations. Because of that, compiler backends (should) inline function calls. There are however many issues with just greedily inlining calls…
+
+
+
+
Greedy inlining with heuristics
+
This is the most obvious approach. We can just inline all functions with only one call, and then inline calls where the inlined function does not have many instructions.
+
Example:
+
function f32 $square(f32 %x) { @entry: // this is stupid, but I couldn't come up with a better example f32 %e = add %x, 0 f32 %out = add %e, %x ret %out }
The second option is to inline the $hypot call into $tri_hypot. (There are also some other options)
+
Now in this case, it seems obvious to prefer inlining $square into $hypot.
+
+
+
+
Issue 2: ABI requirements on argument passing
+
If we assume the target ABI only has one f32 register for passing arguments, then we would have to generate additional instructions for passing the second argument of $hypot, and then it might actually be more efficient to inline $hypot instead of $square.
+
This example is not realistic, but this issue actually occurs when compiling lots of code.
+
Another related issue is that having more arguments arranged in a fixed way will require lots of moving data arround at the call site.
+
A solution to this is to make the heuristics not just output code size, but also make it depend on the number of arguments / outputs passed to the function.
If the target has a efficient hypot operation, then that operation will only be used if we inline $myfunc into $callsite.
+
This means that inlining is now depended on… instruction selection??
+
This is not the only optimization prevented by not inlining the call. If $callsite were to be called in a loop, then not inlining would prevent vectorization.
+
+
+
+
Function outlining
+
A related optimization is “outlining”. It’s the opposite to inlining. It moves duplicate code into a function, to reduce code size, and sometimes increase performance (because of instruction caching)
+
If we do inlining seperately from outlining, we often get unoptimal code.
+
+
+
+
A better approach
+
We can instead first inline all inlinable calls, and then perform more agressive outlining.
+
+
+
+
Step 1: inlining
+
We inline all function calls, except for:
+
+
self recursion (obviously)
+
functions explicitly marked as no-inline by the user
+
+
+
+
+
Step 2: detect duplicate code
+
There are many algorithms for doing this.
+
The goal of this step is to both:
+
+
maximize size of outlinable section
+
minimize size of code
+
+
+
+
+
Step 3: slightly reduce size of outlinable section
+
The goal is to reduce size of outlinable sections, to make the code more optimal.
+
This should be ABI and instruction depended, and have the goal of:
+
+
reducing argument shuffles required at all call sites
+
reducing register preassure
+
not preventing good isel choices and optimizations.
+
+
this is also dependent on the targetted code size.
+
+
+
+
Step 4: perform outlining
+
This is obvious.
+
+
+
+
Issue 1: high compile-time memory usage
+
Inlining all function calls first will increase the memory usage during compilation by A LOT
+
I’m sure that there is a smarter way to implement this method, without actually performing the inlining…
+
+
+
+
Conclusion
+
Function inlining is much more complex than one might think.
+
PS: No idea how to implement this…
+
Subscribe to the Atom feed to get notified about futre compiler-related articles.
Note that the PDF Version of this page might look a bit better styling wise.
Introduction
Compilers often have to deal with pattern matching and rewriting (find-and-replace) inside the compiler IR (intermediate representation).
Common use cases for pattern matching in compilers:
“peephole optimizations”: the most common kind of optimization in compilers. They find a short sequence of code and replace it with some other code. For example replacing x & (1<< b) with a bit test operation.
finding a sequence of operations for complex optimization passes to operate on: advanced compilers have complex optimizations that can’t really be performed with simple IR operation replacements, and instead require complex logic. Patterns are used here to find operation sequences where those optimizations are applicable, and also to extract details inside that sequence.
code generation: converting the IR to machine code / VM bytecode. A compiler needs to find operations (or sequences of operations) inside the IR, and “replace” them with machine code.
Simplest Approach
Currently, most compilers mostly do this inside their source code. For example, in MLIR, most (but not all) pattern matches are performed in C++ code.
The only advantage to this approach is that it doesn’t require a complex pattern matching system.
I only recommend doing this for small compiler toy projects.
Disadvantages
Doing pattern matching this way has many disadvantages.
Some (but not all):
debugging pattern match rules can be hard
IR rewrites need to be tracked manually (for debugging)
source locations and debug information also need to be tracked manually, which often isn’t implemented very well.
verbose and barely readable pattern matching code
overall error-prone
I myself did pattern matching this way in my old compiler backend, and I speak from experience when I say that this approach sucks (in most cases).
Pattern Matching DSLs
A custom language for describing IR patterns and IR transformations (aka rewrites).
I will put this into the category of “structured pattern matching”.
An example is Cranelift’s ISLE DSL:
;; x ^ x == 0. (rule (simplify (bxor (ty_int ty) x x)) (subsume (iconst_u ty 0)))
Fun fact: tinygrad actually decompiles the python code inside the second element of the pair, and runs multiple optimization passes on that.
This approach is used by many popular compilers such as LLVM, GCC, and Cranelift for peephole optimizations and code generation.
Advantages
debugging and tracking of rewrites, source locations, and debug information can be done properly
patterns themselves can be inspected and modified programmatically.
they are easier to use and read than manual pattern matching in the source code.
There is however an even better alternative:
Pattern Matching Dialects
I will also put this method into the category of “structured pattern matching”.
The main example of this is MLIR, with the pdl and the transform dialects. Sadly few projects/people use these dialects, and instead do pattern matching in C++ code. Probably because the dialects aren’t documented very well.
What are compiler dialects?
Modern compilers, especially multi-level compilers, such as MLIR, have their operations grouped in “dialects”.
Each dialect either represents specific kinds of operations, like arithmetic operations, or a specific backend’s/frontend’s operations, such as the llvm, emitc, and the spirv dialects in MLIR.
Dialects commonly contain operations, data types, as well as optimization and dialect conversion passes.
Core Concept
The IR patterns and transformations are represented using the compiler’s IR. This is mostly done in a separate dialect, with dedicated operations for operating on IR.
Examples
MLIR’s pdl dialect can be used to replace arith.addi with my.add like this:
the pattern matching infrastructure can optimize it’s own patterns: The compiler can operate on patterns and rewrite rules like they are normal operations. This removes the need for special infrastructure regarding pattern matching DSLs.
the compiler could AOT compile patterns
the compiler could optimize, analyze, and combine patterns to reduce compile time.
IR (de-)serialization infrastructure in the compiler can also be used to exchange peephole optimizations.
bragging rights: your compiler represents its patterns in it’s own IR
Combining with a DSL
I recommend having a pattern matching / rewrite DSL, that transpiles to pattern matching / rewrite dialect operations.
The advantage of this over just having a rewrite dialect is that it makes patterns even more readable (and maintainable!)
E-Graphs
E-Graphs are magical datastructures that can be used to efficiently encode all possible transformations, and then select the best transformation.
Even though E-Graphs solve most problems, I still recommend using a pattern matching dialect, especially in multi-level compilers, to be more flexible, and have more future-proof pattern matching, or you decide that you want to match some complex patterns manually.
More Advantages of Structured Pattern Matching
Smart Pattern Matchers
Instead of brute-forcing all peephole optimizations (of which there can be a LOT in advanced compilers), the compiler can organize all the patterns to provide more efficient matching. I didn’t yet investigate how to do this. If you have any ideas regarding this, please contact me.
There are other ways to speed up the pattern matching and rewrite process using this too.
Reversible Transformations
I don’t think that there currently is any compiler that does this. If you do know one, again, please contact me.
Optimizing compilers typically deal with code (mostly written by people) that is on a lower level than the compiler theoretically supports. For example, humans tend to write code like this for extracting a bit: x & (1<< b), but compilers tend to have a high-level bit test operation (with exceptions). A reason for having higher-level primitives is that it allows the compiler to do more high-level optimizations, but also some target architectures have a bit test operation, that is more optimal.
This is not just the case for “low-level” things like bit tests, but also high level concepts, like a reduction over an array, or even the implementation of a whole algorithm. For example LLVM, since recently, can detect implementations of CRC.
LLVM actually doesn’t have many dedicated operations like a bit-test operation, and instead canonicalizes all bit-test patterns to x & (1<< b) !=0, and matches for that in compiler passes that expect bit test operations.
Now let’s go back to the x & (1<< b) (bit test) example. Optimizing compilers should be able to detect that, and other bit test patterns (like x & (1<< b) >0), and then replace those with a bit-test operation. But they also have to be able to convert bit-test operations back to their implementation for compilation targets that don’t have a bit-test instruction. Currently, compiler backends do this by having separate patterns for converting bit-test to it’s dedicated operation, and back.
A better solution is to associate a set of implementations with the bit test operation, and make the compiler automatically reverse those to generate the best implementation (in the instruction selector for example).
Implementing pattern/transformation reversion can be challenging however, but it provides many benefits, and all “big” compilers should definitely do this, in my opinion.
Runtime Library
Compilers typically come with a runtime library that implement more complex operations that aren’t supported by most processors or architectures.
The implementation of those functions should also use that pattern matching dialect. This allows your backend to detect code written by users with a similar implementation as in the runtime library, giving you some additional optimizations for free.
I don’t think any compiler currently does this either.
Problems with Pattern Matching
The main problem is ordering the patterns.
As an example, consider these three patterns:
;; A (add x:Const y) => (add y x)
;; B (sub (add x y:Const) z:Const) => (lea x y (const_neg z))
;; C (add x 1) => (inc x)
Now what should the compiler do when it sees this:
(sub (add51) 2)
All three patterns would match:
;; apply A (sub (add51) 2) => (sub (add15) 2) ;; only B applies now (sub (add15) 2) => (lea15 (const_neg2)) ;; nothing applies anymore
Now which of those transformations should be performed?
This is not as easy to solve as it seems, especially in the context of instruction selection (specifically scheduling), where the performance on processors depends on a sequence of instructions, instead of just a single instruction.
Superscalar CPUs
Modern processor architecture features like superscalar execution make this even more complicated.
As a simple, unrealistic example, let’s imagine a CPU (core) that has one bit operations execution unit, and two ALU execution units / ports. This means that the CPU can execute two instructions in the ALU unit and one instruction in the bit ops unit at the same time.
One might think that always optimizing a & (1 << b) to a bit test operation is good for performance. But in this example, that is not the case.
If we have a function that does a lot of bitwise operations next to each other, and the compiler replaces all bit tests with bit test operations, suddenly all operations depend on the bit ops unit, which means that instead of executing 3 instructions at a time (ignoring pipelining), the CPU can only execute one instruction at a time.
This shows that we won’t know if an optimization is actually good, until we are at a late point in the compilation process where we can simulate the CPU’s instruction scheduling.
This does not only apply to instruction selection, but also to more higher-level optimizations, such as loop and control flow related optimizations.
Conclusion
One can see how pattern matching dialects are the best option to approach pattern matching.
Someone wanted me to insert a takeaway here, but I won’t.
PS: I’ll hunt down everyone who still decides to do pattern matching in their compiler source after reading this article.
Compilers often have to deal with pattern matching and rewriting (find-and-replace) inside the compiler IR (intermediate representation).
+
Common use cases for pattern matching in compilers:
+
+
“peephole optimizations”: the most common kind of optimization in compilers. They find a short sequence of code and replace it with some other code. For example replacing x & (1<< b) with a bit test operation.
+
finding a sequence of operations for complex optimization passes to operate on: advanced compilers have complex optimizations that can’t really be performed with simple IR operation replacements, and instead require complex logic. Patterns are used here to find operation sequences where those optimizations are applicable, and also to extract details inside that sequence.
+
code generation: converting the IR to machine code / VM bytecode. A compiler needs to find operations (or sequences of operations) inside the IR, and “replace” them with machine code.
+
+
+
+
Simplest Approach
Currently, most compilers mostly do this inside their source code. For example, in MLIR, most (but not all) pattern matches are performed in C++ code.
+
The only advantage to this approach is that it doesn’t require a complex pattern matching system.
+
I only recommend doing this for small compiler toy projects.
+
+
+
Disadvantages
Doing pattern matching this way has many disadvantages.
+
Some (but not all):
+
+
debugging pattern match rules can be hard
+
IR rewrites need to be tracked manually (for debugging)
+
source locations and debug information also need to be tracked manually, which often isn’t implemented very well.
+
verbose and barely readable pattern matching code
+
overall error-prone
+
+
I myself did pattern matching this way in my old compiler backend, and I speak from experience when I say that this approach sucks (in most cases).
+
+
+
Pattern Matching DSLs
A custom language for describing IR patterns and IR transformations (aka rewrites).
+
I will put this into the category of “structured pattern matching”.
+
+
+
An example is Cranelift’s ISLE DSL:
+
;; x ^ x == 0. (rule (simplify (bxor (ty_int ty) x x)) (subsume (iconst_u ty 0)))
Fun fact: tinygrad actually decompiles the python code inside the second element of the pair, and runs multiple optimization passes on that.
+
+
This approach is used by many popular compilers such as LLVM, GCC, and Cranelift for peephole optimizations and code generation.
+
+
Advantages
+
+
debugging and tracking of rewrites, source locations, and debug information can be done properly
+
patterns themselves can be inspected and modified programmatically.
+
they are easier to use and read than manual pattern matching in the source code.
+
+
There is however an even better alternative:
+
+
+
Pattern Matching Dialects
I will also put this method into the category of “structured pattern matching”.
+
The main example of this is MLIR, with the pdl and the transform dialects. Sadly few projects/people use these dialects, and instead do pattern matching in C++ code. Probably because the dialects aren’t documented very well.
+
+
+
What are compiler dialects?
Modern compilers, especially multi-level compilers, such as MLIR, have their operations grouped in “dialects”.
+
Each dialect either represents specific kinds of operations, like arithmetic operations, or a specific backend’s/frontend’s operations, such as the llvm, emitc, and the spirv dialects in MLIR.
+
Dialects commonly contain operations, data types, as well as optimization and dialect conversion passes.
+
+
Core Concept
The IR patterns and transformations are represented using the compiler’s IR. This is mostly done in a separate dialect, with dedicated operations for operating on IR.
+
+
Examples
MLIR’s pdl dialect can be used to replace arith.addi with my.add like this:
the pattern matching infrastructure can optimize it’s own patterns: The compiler can operate on patterns and rewrite rules like they are normal operations. This removes the need for special infrastructure regarding pattern matching DSLs.
+
the compiler could AOT compile patterns
+
the compiler could optimize, analyze, and combine patterns to reduce compile time.
+
IR (de-)serialization infrastructure in the compiler can also be used to exchange peephole optimizations.
+
bragging rights: your compiler represents its patterns in it’s own IR
+
+
+
+
Combining with a DSL
I recommend having a pattern matching / rewrite DSL, that transpiles to pattern matching / rewrite dialect operations.
+
The advantage of this over just having a rewrite dialect is that it makes patterns even more readable (and maintainable!)
Even though E-Graphs solve most problems, I still recommend using a pattern matching dialect, especially in multi-level compilers, to be more flexible, and have more future-proof pattern matching, or you decide that you want to match some complex patterns manually.
+
+
+
More Advantages of Structured Pattern Matching
+
Smart Pattern Matchers
Instead of brute-forcing all peephole optimizations (of which there can be a LOT in advanced compilers), the compiler can organize all the patterns to provide more efficient matching. I didn’t yet investigate how to do this. If you have any ideas regarding this, please contact me.
+
There are other ways to speed up the pattern matching and rewrite process using this too.
+
+
Reversible Transformations
I don’t think that there currently is any compiler that does this. If you do know one, again, please contact me.
+
Optimizing compilers typically deal with code (mostly written by people) that is on a lower level than the compiler theoretically supports. For example, humans tend to write code like this for extracting a bit: x & (1<< b), but compilers tend to have a high-level bit test operation (with exceptions). A reason for having higher-level primitives is that it allows the compiler to do more high-level optimizations, but also some target architectures have a bit test operation, that is more optimal.
+
+
This is not just the case for “low-level” things like bit tests, but also high level concepts, like a reduction over an array, or even the implementation of a whole algorithm. For example LLVM, since recently, can detect implementations of CRC.
+
+
LLVM actually doesn’t have many dedicated operations like a bit-test operation, and instead canonicalizes all bit-test patterns to x & (1<< b) !=0, and matches for that in compiler passes that expect bit test operations.
+
Now let’s go back to the x & (1<< b) (bit test) example. Optimizing compilers should be able to detect that, and other bit test patterns (like x & (1<< b) >0), and then replace those with a bit-test operation. But they also have to be able to convert bit-test operations back to their implementation for compilation targets that don’t have a bit-test instruction. Currently, compiler backends do this by having separate patterns for converting bit-test to it’s dedicated operation, and back.
+
+
A better solution is to associate a set of implementations with the bit test operation, and make the compiler automatically reverse those to generate the best implementation (in the instruction selector for example).
+
Implementing pattern/transformation reversion can be challenging however, but it provides many benefits, and all “big” compilers should definitely do this, in my opinion.
+
+
+
Runtime Library
Compilers typically come with a runtime library that implement more complex operations that aren’t supported by most processors or architectures.
+
The implementation of those functions should also use that pattern matching dialect. This allows your backend to detect code written by users with a similar implementation as in the runtime library, giving you some additional optimizations for free.
+
I don’t think any compiler currently does this either.
+
+
+
Problems with Pattern Matching
The main problem is ordering the patterns.
+
As an example, consider these three patterns:
+
;; A (add x:Const y) => (add y x)
;; B (sub (add x y:Const) z:Const) => (lea x y (const_neg z))
;; C (add x 1) => (inc x)
+
+
+
Now what should the compiler do when it sees this:
+
(sub (add51) 2)
+
+
+
All three patterns would match:
+
;; apply A (sub (add51) 2) => (sub (add15) 2) ;; only B applies now (sub (add15) 2) => (lea15 (const_neg2)) ;; nothing applies anymore
Now which of those transformations should be performed?
+
This is not as easy to solve as it seems, especially in the context of instruction selection (specifically scheduling), where the performance on processors depends on a sequence of instructions, instead of just a single instruction.
+
+
+
Superscalar CPUs
Modern processor architecture features like superscalar execution make this even more complicated.
+
As a simple, unrealistic example, let’s imagine a CPU (core) that has one bit operations execution unit, and two ALU execution units / ports. This means that the CPU can execute two instructions in the ALU unit and one instruction in the bit ops unit at the same time.
+
+
+
One might think that always optimizing a & (1 << b) to a bit test operation is good for performance. But in this example, that is not the case.
+
If we have a function that does a lot of bitwise operations next to each other, and the compiler replaces all bit tests with bit test operations, suddenly all operations depend on the bit ops unit, which means that instead of executing 3 instructions at a time (ignoring pipelining), the CPU can only execute one instruction at a time.
+
+
+
This shows that we won’t know if an optimization is actually good, until we are at a late point in the compilation process where we can simulate the CPU’s instruction scheduling.
+
This does not only apply to instruction selection, but also to more higher-level optimizations, such as loop and control flow related optimizations.
+
+
+
Conclusion
One can see how pattern matching dialects are the best option to approach pattern matching.
+
Someone wanted me to insert a takeaway here, but I won’t.
+
PS: I’ll hunt down everyone who still decides to do pattern matching in their compiler source after reading this article.
+
+
+
+
+
diff --git a/build/compiler-pattern-matching.typ.min.html.d b/build/compiler-pattern-matching.typ.min.html.d
new file mode 100644
index 0000000..74176f8
--- /dev/null
+++ b/build/compiler-pattern-matching.typ.min.html.d
@@ -0,0 +1 @@
+build/compiler-pattern-matching.typ.min.html: common.typ pages/compiler-pattern-matching.typ core-page-style.typ components/header.typ simple-page-layout.typ
diff --git a/build/compiler-pattern-matching.typ.min.pdf b/build/compiler-pattern-matching.typ.min.pdf
new file mode 100644
index 0000000..7208fa1
Binary files /dev/null and b/build/compiler-pattern-matching.typ.min.pdf differ
diff --git a/build/compiler-pattern-matching.typ.min.pdf.d b/build/compiler-pattern-matching.typ.min.pdf.d
new file mode 100644
index 0000000..da24db9
--- /dev/null
+++ b/build/compiler-pattern-matching.typ.min.pdf.d
@@ -0,0 +1 @@
+build/compiler-pattern-matching.typ.min.pdf: common.typ pages/compiler-pattern-matching.typ core-page-style.typ components/header.typ simple-page-layout.typ
diff --git a/build/compiler-pattern-matching.typ.nano.html b/build/compiler-pattern-matching.typ.nano.html
new file mode 100644
index 0000000..76511c3
--- /dev/null
+++ b/build/compiler-pattern-matching.typ.nano.html
@@ -0,0 +1,347 @@
+
+
+
+ Approaches to Compiler Pattern Matching
+
+
+
+
+
+
Approaches to pattern matching in compilers
+
Last modified: 19. August 2025 09:55 (Git #34fd6adb)
Compilers often have to deal with pattern matching and rewriting (find-and-replace) inside the compiler IR (intermediate representation).
+
Common use cases for pattern matching in compilers:
+
+
“peephole optimizations”: the most common kind of optimization in compilers. They find a short sequence of code and replace it with some other code. For example replacing x & (1<< b) with a bit test operation.
+
finding a sequence of operations for complex optimization passes to operate on: advanced compilers have complex optimizations that can’t really be performed with simple IR operation replacements, and instead require complex logic. Patterns are used here to find operation sequences where those optimizations are applicable, and also to extract details inside that sequence.
+
code generation: converting the IR to machine code / VM bytecode. A compiler needs to find operations (or sequences of operations) inside the IR, and “replace” them with machine code.
+
+
+
+
+
Simplest Approach
+
Currently, most compilers mostly do this inside their source code. For example, in MLIR, most (but not all) pattern matches are performed in C++ code.
+
The only advantage to this approach is that it doesn’t require a complex pattern matching system.
+
I only recommend doing this for small compiler toy projects.
+
+
+
+
Disadvantages
+
Doing pattern matching this way has many disadvantages.
+
Some (but not all):
+
+
debugging pattern match rules can be hard
+
IR rewrites need to be tracked manually (for debugging)
+
source locations and debug information also need to be tracked manually, which often isn’t implemented very well.
+
verbose and barely readable pattern matching code
+
overall error-prone
+
+
I myself did pattern matching this way in my old compiler backend, and I speak from experience when I say that this approach sucks (in most cases).
+
+
+
+
Pattern Matching DSLs
+
A custom language for describing IR patterns and IR transformations (aka rewrites).
+
I will put this into the category of “structured pattern matching”.
+
+
+
An example is Cranelift’s ISLE DSL:
+
;; x ^ x == 0. (rule (simplify (bxor (ty_int ty) x x)) (subsume (iconst_u ty 0)))
Fun fact: tinygrad actually decompiles the python code inside the second element of the pair, and runs multiple optimization passes on that.
+
+
This approach is used by many popular compilers such as LLVM, GCC, and Cranelift for peephole optimizations and code generation.
+
+
+
Advantages
+
+
debugging and tracking of rewrites, source locations, and debug information can be done properly
+
patterns themselves can be inspected and modified programmatically.
+
they are easier to use and read than manual pattern matching in the source code.
+
+
There is however an even better alternative:
+
+
+
+
Pattern Matching Dialects
+
I will also put this method into the category of “structured pattern matching”.
+
The main example of this is MLIR, with the pdl and the transform dialects. Sadly few projects/people use these dialects, and instead do pattern matching in C++ code. Probably because the dialects aren’t documented very well.
+
+
+
+
What are compiler dialects?
+
Modern compilers, especially multi-level compilers, such as MLIR, have their operations grouped in “dialects”.
+
Each dialect either represents specific kinds of operations, like arithmetic operations, or a specific backend’s/frontend’s operations, such as the llvm, emitc, and the spirv dialects in MLIR.
+
Dialects commonly contain operations, data types, as well as optimization and dialect conversion passes.
+
+
+
+
Core Concept
+
The IR patterns and transformations are represented using the compiler’s IR. This is mostly done in a separate dialect, with dedicated operations for operating on IR.
+
+
+
+
Examples
+
MLIR’s pdl dialect can be used to replace arith.addi with my.add like this:
the pattern matching infrastructure can optimize it’s own patterns: The compiler can operate on patterns and rewrite rules like they are normal operations. This removes the need for special infrastructure regarding pattern matching DSLs.
+
the compiler could AOT compile patterns
+
the compiler could optimize, analyze, and combine patterns to reduce compile time.
+
IR (de-)serialization infrastructure in the compiler can also be used to exchange peephole optimizations.
+
bragging rights: your compiler represents its patterns in it’s own IR
+
+
+
+
+
Combining with a DSL
+
I recommend having a pattern matching / rewrite DSL, that transpiles to pattern matching / rewrite dialect operations.
+
The advantage of this over just having a rewrite dialect is that it makes patterns even more readable (and maintainable!)
+
+
+
+
E-Graphs
+
E-Graphs are magical datastructures that can be used to efficiently encode all possible transformations, and then select the best transformation.
Even though E-Graphs solve most problems, I still recommend using a pattern matching dialect, especially in multi-level compilers, to be more flexible, and have more future-proof pattern matching, or you decide that you want to match some complex patterns manually.
+
+
+
+
More Advantages of Structured Pattern Matching
+
Smart Pattern Matchers
+
Instead of brute-forcing all peephole optimizations (of which there can be a LOT in advanced compilers), the compiler can organize all the patterns to provide more efficient matching. I didn’t yet investigate how to do this. If you have any ideas regarding this, please contact me.
+
There are other ways to speed up the pattern matching and rewrite process using this too.
+
+
+
+
Reversible Transformations
+
I don’t think that there currently is any compiler that does this. If you do know one, again, please contact me.
+
+
Optimizing compilers typically deal with code (mostly written by people) that is on a lower level than the compiler theoretically supports. For example, humans tend to write code like this for extracting a bit: x & (1<< b), but compilers tend to have a high-level bit test operation (with exceptions). A reason for having higher-level primitives is that it allows the compiler to do more high-level optimizations, but also some target architectures have a bit test operation, that is more optimal.
+
This is not just the case for “low-level” things like bit tests, but also high level concepts, like a reduction over an array, or even the implementation of a whole algorithm. For example LLVM, since recently, can detect implementations of CRC.
+
LLVM actually doesn’t have many dedicated operations like a bit-test operation, and instead canonicalizes all bit-test patterns to x & (1<< b) !=0, and matches for that in compiler passes that expect bit test operations.
+
Now let’s go back to the x & (1<< b) (bit test) example. Optimizing compilers should be able to detect that, and other bit test patterns (like x & (1<< b) >0), and then replace those with a bit-test operation. But they also have to be able to convert bit-test operations back to their implementation for compilation targets that don’t have a bit-test instruction. Currently, compiler backends do this by having separate patterns for converting bit-test to it’s dedicated operation, and back.
+
+
A better solution is to associate a set of implementations with the bit test operation, and make the compiler automatically reverse those to generate the best implementation (in the instruction selector for example).
+
Implementing pattern/transformation reversion can be challenging however, but it provides many benefits, and all “big” compilers should definitely do this, in my opinion.
+
+
+
+
Runtime Library
+
Compilers typically come with a runtime library that implement more complex operations that aren’t supported by most processors or architectures.
+
The implementation of those functions should also use that pattern matching dialect. This allows your backend to detect code written by users with a similar implementation as in the runtime library, giving you some additional optimizations for free.
+
I don’t think any compiler currently does this either.
+
+
+
+
Problems with Pattern Matching
+
The main problem is ordering the patterns.
+
As an example, consider these three patterns:
+
;; A (add x:Const y) => (add y x)
;; B (sub (add x y:Const) z:Const) => (lea x y (const_neg z))
;; C (add x 1) => (inc x)
+
+
+
Now what should the compiler do when it sees this:
+
(sub (add51) 2)
+
+
+
All three patterns would match:
+
;; apply A (sub (add51) 2) => (sub (add15) 2) ;; only B applies now (sub (add15) 2) => (lea15 (const_neg2)) ;; nothing applies anymore
Now which of those transformations should be performed?
+
This is not as easy to solve as it seems, especially in the context of instruction selection (specifically scheduling), where the performance on processors depends on a sequence of instructions, instead of just a single instruction.
+
+
+
+
Superscalar CPUs
+
Modern processor architecture features like superscalar execution make this even more complicated.
+
As a simple, unrealistic example, let’s imagine a CPU (core) that has one bit operations execution unit, and two ALU execution units / ports. This means that the CPU can execute two instructions in the ALU unit and one instruction in the bit ops unit at the same time.
+
+
+
One might think that always optimizing a & (1 << b) to a bit test operation is good for performance. But in this example, that is not the case.
+
If we have a function that does a lot of bitwise operations next to each other, and the compiler replaces all bit tests with bit test operations, suddenly all operations depend on the bit ops unit, which means that instead of executing 3 instructions at a time (ignoring pipelining), the CPU can only execute one instruction at a time.
+
+
+
This shows that we won’t know if an optimization is actually good, until we are at a late point in the compilation process where we can simulate the CPU’s instruction scheduling.
+
This does not only apply to instruction selection, but also to more higher-level optimizations, such as loop and control flow related optimizations.
+
+
+
+
Conclusion
+
One can see how pattern matching dialects are the best option to approach pattern matching.
+
Someone wanted me to insert a takeaway here, but I won’t.
+
PS: I’ll hunt down everyone who still decides to do pattern matching in their compiler source after reading this article.
+
+
+
diff --git a/build/compiler-pattern-matching.typ.nano.html.d b/build/compiler-pattern-matching.typ.nano.html.d
new file mode 100644
index 0000000..2f89232
--- /dev/null
+++ b/build/compiler-pattern-matching.typ.nano.html.d
@@ -0,0 +1 @@
+build/compiler-pattern-matching.typ.nano.html: common.typ pages/compiler-pattern-matching.typ core-page-style.typ components/header.typ simple-page-layout.typ
diff --git a/build/deploy/article-favicon.typ.desktop.html b/build/deploy/article-favicon.typ.desktop.html
new file mode 100644
index 0000000..dec1507
--- /dev/null
+++ b/build/deploy/article-favicon.typ.desktop.html
@@ -0,0 +1 @@
+The making of the favicon
(this is the second attempt at layouting the circuit)
Step 1: Sketching
While starting doing this, I realised that one wire always overlaps with one node triangle, unless I cheated. Here is a visual representation of this (inaccurate):
This means that I have to modify the layouting from step 0 a bit, which is unfortunate, but in retrospect, I think that it makes the result look better:
That however takes up too much space, so I did another variation:
I also did another variation here, but decided to not use that.
Step 2: Preparation for coloring
I colored the back side of the piece of paper which contains the sketeches with a pencil, put a white piece of paper behind it, and then re-traced the line, to get a lighter version of the sketch onto the white paper.
Then I used modern technology (a copier) to copy that piece of paper multiple times, and also scale it up (to allow for more details).
Step 3: Coloring
It was a disaster…
Some variants actually look nice, but only parts of it.
Step 4: Outsourcing the coloring
After some time, I just gave up, and decided to ask my sister for help…
I only told her (translated):
Can you please color this? It's supposed to be a circuit, and it will be a small logo for a website. The website is mainly black and white, but this (context: persian blue) blue would work too.
And less than half a minute later, she came up with this:
We considered that the logo will end up being quite small, so “we” wanted it to look good when zoomed out. This is a pretty nice idea, because the different colored wires end up blending together nicely.
I put that into the scanner, and meanwhile she experimented with different filling styles.
Then she came up with this (the final version):
Filling the drawing only took her about 20 seconds!
Step 5: Digital Modifications
As last step, I removed some of the sketch lines and some minor imperfections digitally.
Conclusion
I like the final result a lot (as small logo), but it’s a bit too detailed as favicon.
I will re-visit this topic in the future.
\ No newline at end of file
diff --git a/build/deploy/article-favicon.typ.min.html b/build/deploy/article-favicon.typ.min.html
new file mode 100644
index 0000000..3d026b1
--- /dev/null
+++ b/build/deploy/article-favicon.typ.min.html
@@ -0,0 +1 @@
+The making of the favicon
(this is the second attempt at layouting the circuit)
Step 1: Sketching
While starting doing this, I realised that one wire always overlaps with one node triangle, unless I cheated. Here is a visual representation of this (inaccurate):
This means that I have to modify the layouting from step 0 a bit, which is unfortunate, but in retrospect, I think that it makes the result look better:
That however takes up too much space, so I did another variation:
I also did another variation here, but decided to not use that.
Step 2: Preparation for coloring
I colored the back side of the piece of paper which contains the sketeches with a pencil, put a white piece of paper behind it, and then re-traced the line, to get a lighter version of the sketch onto the white paper.
Then I used modern technology (a copier) to copy that piece of paper multiple times, and also scale it up (to allow for more details).
Step 3: Coloring
It was a disaster…
Some variants actually look nice, but only parts of it.
Step 4: Outsourcing the coloring
After some time, I just gave up, and decided to ask my sister for help…
I only told her (translated):
Can you please color this? It's supposed to be a circuit, and it will be a small logo for a website. The website is mainly black and white, but this (context: persian blue) blue would work too.
And less than half a minute later, she came up with this:
We considered that the logo will end up being quite small, so “we” wanted it to look good when zoomed out. This is a pretty nice idea, because the different colored wires end up blending together nicely.
I put that into the scanner, and meanwhile she experimented with different filling styles.
Then she came up with this (the final version):
Filling the drawing only took her about 20 seconds!
Step 5: Digital Modifications
As last step, I removed some of the sketch lines and some minor imperfections digitally.
Conclusion
I like the final result a lot (as small logo), but it’s a bit too detailed as favicon.
I will re-visit this topic in the future.
\ No newline at end of file
diff --git a/build/deploy/article-favicon.typ.min.pdf b/build/deploy/article-favicon.typ.min.pdf
new file mode 100644
index 0000000..2bcb75b
Binary files /dev/null and b/build/deploy/article-favicon.typ.min.pdf differ
diff --git a/build/deploy/article-favicon.typ.nano.html b/build/deploy/article-favicon.typ.nano.html
new file mode 100644
index 0000000..694c040
--- /dev/null
+++ b/build/deploy/article-favicon.typ.nano.html
@@ -0,0 +1 @@
+The making of the favicon
The making of the favicon
Last modified: 26. July 2025 15:04 (Git #c80eb6ef)
(this is the second attempt at layouting the circuit)
Step 1: Sketching
While starting doing this, I realised that one wire always overlaps with one node triangle, unless I cheated. Here is a visual representation of this (inaccurate):
This means that I have to modify the layouting from step 0 a bit, which is unfortunate, but in retrospect, I think that it makes the result look better:
That however takes up too much space, so I did another variation:
I also did another variation here, but decided to not use that.
Step 2: Preparation for coloring
I colored the back side of the piece of paper which contains the sketeches with a pencil, put a white piece of paper behind it, and then re-traced the line, to get a lighter version of the sketch onto the white paper.
Then I used modern technology (a copier) to copy that piece of paper multiple times, and also scale it up (to allow for more details).
Step 3: Coloring
It was a disaster…
Some variants actually look nice, but only parts of it.
Step 4: Outsourcing the coloring
After some time, I just gave up, and decided to ask my sister for help…
I only told her (translated):
Can you please color this? It's supposed to be a circuit, and it will be a small logo for a website. The website is mainly black and white, but this (context: persian blue) blue would work too.
And less than half a minute later, she came up with this:
We considered that the logo will end up being quite small, so “we” wanted it to look good when zoomed out. This is a pretty nice idea, because the different colored wires end up blending together nicely.
I put that into the scanner, and meanwhile she experimented with different filling styles.
Then she came up with this (the final version):
Filling the drawing only took her about 20 seconds!
Step 5: Digital Modifications
As last step, I removed some of the sketch lines and some minor imperfections digitally.
Conclusion
I like the final result a lot (as small logo), but it’s a bit too detailed as favicon.
I will re-visit this topic in the future.
\ No newline at end of file
diff --git a/build/deploy/article-gpu-arch-1.typ.desktop.html b/build/deploy/article-gpu-arch-1.typ.desktop.html
new file mode 100644
index 0000000..db79ef2
--- /dev/null
+++ b/build/deploy/article-gpu-arch-1.typ.desktop.html
@@ -0,0 +1 @@
+Designing a GPU architecture: Waves
Note that the PDF Version of this page might look a bit better styling wise.
Introduction
In this article, we’ll be looking into the hardware of GPUs, and then designing our own. Specifically GPUs with unified shader architecture.
Comparison with CPUs
GPUs focus on operating on a lot of data at once (triangles, vertices, pixels, …), while CPUs focus on high performance on a single core, and low compute delay.
GPU Architecture
GPUs consists of multiple (these days at least 32) compute units (= CU).
Each compute unit has multiple SIMD units, also called “wave”, “wavefront” or “warp”. Compute units also have some fast local memory (tens of kilobytes), main memory access queues, texture units, a scalar unit, and other features. Subscribe to the Atom feed to get notified of future articles.
The main memory (graphics memory) is typically outside of the GPU, and is slow, but high-bandwidth memory.
Waves
A wave is a SIMD processing unit consisting of typically 32 “lanes” (sometimes called threads).
Each wave in a CU has separate control flow, and doesn’t have to be related.
Instructions that waves support:
arithmetic operations
cross-lane data movement
CU local and global memory access: each SIMD lane can access a completely different address. similar to CPU gather / scatter.
synchronization with other CUs in the work group (see future article)
Since only the whole wave can do control flow, and not each lane, all operations can be masked so that they only apply to specific lanes.
=> waves are really similar to SIMD on modern CPUs
In modern GPUs, instruction execution in waves is superscalar, so there are multiple different execution units for executing different kinds of instructions, and multiple instructions can be executed at once, if there are free execution units, and they don’t depend on each other.
We’ll be exploring that in a future article.
Local memory
The local memory inside GPUs is banked, typically into 32 banks. The memory word size is typically 32 bits.
The addresses are interlaved, so for two banks:
addr 0 => bank 0
addr 1 => bank 1
addr 2 => bank 0
addr 3 => bank 1
…
Each bank has an dedicated access port, so for 32 banks, you get 32 access ports.
The lanes of the waves inside a CU get routed to the local memory banks magically.
Why are the banks interlaved?
When the whole wave wants to read a contiguous array of f32, so when each wave performs some_f32_array[lane_id()], all 32 banks can be used at the same time.
Why multiple waves share the same local memory
A wave doesn’t do memory accesses every instruction, but also does computations. This means that there are cycles where the memory isn’t doing anything.
By making multiple waves share the same local memory and access ports, you save resources.
Global memory
Since global memory reads/writes are really slow, they happen asynchronosly.
This means that a wave requests an access, then can continue executing, and then eventually waits for that access to finish.
Because of this, modern compilers automagically start the access before the data is needed, and then wait for the data later on.
Scalar unit
Most newer GPUs also have a scalar unit for saving energy when performing simple operations.
When the controller sees a scalar instruction in the code running on a wave, it automatically makes the code run on the scalar unit.
The scalar unit can be used for:
address calculation
partial reductions
execution of expensive operations not implemented on SIMD because of costs
GPU Programming Terminology
“work item”: typically maps to a SIMD lane
“kernel”: the code for a work item
“work group”: consists of multiple work items. typically maps to an CU. the __local memory in OpenCL applies to this.
“compute task”: a set of work groups
OpenCL and other APIs let you specify both the number of work groups and work items.
Since a program might specify a higher number of work items per work group than we have available, the compiler needs to be able to put multiple work items onto one SIMD lane.
Our own architecture
We’ll go with these specs for now:
N compute units
2 waves per CU
32 lanes per wave.
1KiB local memory per lane => 64 KiB
48 vector registers of 16x32b per wave
one scalar unit per CU
128 global memory ports
16 async task completion “signal” slots per wave
no fancy out of order or superscalar execution
support standard 32 bit floating point, without exceptions.
Note that we won’t specify the exact instruction encoding.
Predefined Constants
We will pre-define 16 constants (as virtual vector registers):
zero
one
sid: 0,1,2,3,4,5,6
wave: the ID of the wave in the compute task, broadcasted to all elements.
u8_max: 255,255,…
n2nd: 1,2,1,2,…
n3rd: 1,2,4,1,…
n4th: 1,2,4,8,1,…
lo16: 1,1,1,… (x16) 0,0,0,… (x16)
ch2: 1,1,0,0,1,1,…
ch4: 1,1,1,1,0,0,0,0,1,…
alo8: 1 (x8) 0 (x8) 1 (x8) 0 (x8)
a few reserved ones
Operands
We define the following instruction operands:
Vreg: vector register
M: (read only) vector gp reg as mask (1b). only first 32 registers can be used as mask. the operand consists of two masks and-ed together, each of which can conditionally be inverted first. this means that this operand takes up 12 bits
Vany: Vreg or M
Simm: immediate scalar value
Sreg: the first element of a vector register, as scalar
Sany: a Simm or an Sreg
dist: Vany, or a Sany broadcasted to each element
sig: one of the 16 completion signal slots
Instructions
We will add more instructions in future articles.
Data Movement
fn mov(out out: Vreg, in wrmask: M, in val: dist)
fn select(out out: Vreg, in select: M, in false: dist, in true: dist)
fn first_where_true(out out: Sreg, in where: M, in values: dist): if none of the elements are true, it doesn’t overwrite the previous value in out.
cross-lane operations: not important for this article
Mathematics
simple (unmasked) u32, i32, and f32 elementwise arithmetic and logic operations: fn add<u32>(out out: Vreg, in left: Vany, in right: dist)
scalar arithmetic and logic operations: fn add<u32>(out out: Sreg, in left: Sany, in right: Sany)
partial reduction operations: “chunks” the input with a size of 8, reduces each chunk, and stores it in the first element of the chunk. this means that every 8th element will contain a partial result.
and operations to finish that reduction into the first element of the vector
Local memory
load 32 bit value at each elem where mask is true: fn local_load32(out out: Vreg, in mask: M, in addr: Vreg)
store 32 bit value at each elem where mask is true: fn local_store32(in addr: Vreg, in mask: M, in val: Vany)
Global (async) memory
start an async global load, and make the given signal correspond to the completion of the access: load 32 bit value at each elem where mask is true: fn global_load32(out sig: sig, out out: Vreg, in mask: M, in addr: Vreg)
see above and local_store32fn global_store32(out sig: sig, in addr: Vreg, in mask: M, in val: Vany)
fn sig_done1(out r: Sreg, in sig: sig)
fn sig_done2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_wait(out r: Sreg, in sig: sig)
fn sig_waitall2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_waitall3(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig)
fn sig_waitall4(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig, in sig4: sig)
As a future extension, we could add a instruction that waits for any of the given signals to complete, and then jump to a specific location, depending on which of those completed.
Control flow (whole wave)
branch if scalar is zero: fn brz(in dest: Simm, in val: Sany)
branch if scalar is not zero: fn brnz(in dest: Simm, in val: Sany)
branch on the whole wave if each element has a true value for the mask: fn br_all(in dest: Simm, in cond: M)
branch on the whole wave if any element has a true value for the mask: fn br_any(in dest: Simm, in cond: M)
Hand-compiling code
Now that we decided on a simple compute-only GPU architecture, we can try hand-compiling an OpenCL program.
I asked an LLM to produce a N*N matmul example (comments written manually):
// convenient number for our specifc hardware #defineTILE_SIZE8
// this kernel will be launched with dimensions: // global[2] = { 128,128 } = { N, N }; // local[2] = { 8,8 } = { TILE_SIZE, TILE_SIZE }; __kernel voidmatmul_tiled( __global float* A, __global float* B, __global float* C, constint N) { int row =get_global_id(1); // y int col =get_global_id(0); // x int local_row =get_local_id(1); // y int local_col =get_local_id(0); // x
for (int t =0; t < N / TILE_SIZE; ++t) { // load tiles into local int tiledRow = row; int tiledCol = t * TILE_SIZE + local_col; float av; if (tiledRow < N && tiledCol < N) av = A[tiledRow * N + tiledCol]; else av =0.0f; Asub[local_row][local_col] = av;
tiledRow = t * TILE_SIZE + local_row; tiledCol = col; float bv; if (tiledRow < N && tiledCol < N) bv; = B[tiledRow * N + tiledCol]; else bv =0.0f; Bsub[local_row][local_col]= bv;
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE);
for (int k =0; k < TILE_SIZE; ++k) sum += Asub[local_row][k] * Bsub[k][local_col];
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE); }
if (row < N && col < N) C[row * N + col] = sum; }
First, we have to decide on how we want to map the kernel to the hardware.
Since the local dimension of the kernel is 8*8, which is 64, we can map each local group to one CU, by mapping 32 kernels to one wave, and using both waves available on one CU for the local group.
Our global dimension is 128*128, which means that we would need 256 compute units. But since we probably don’t have 256 compute units, GPUs, including ours, will have a on-hardware task scheduler, for scheduing tasks onto compute units.
Outro
Modern GPUs are really complex, but designing a simple GPU is not that hard either.
Subscribe to the Atom feed to get notified of future articles.
\ No newline at end of file
diff --git a/build/deploy/article-gpu-arch-1.typ.min.html b/build/deploy/article-gpu-arch-1.typ.min.html
new file mode 100644
index 0000000..20f1ab6
--- /dev/null
+++ b/build/deploy/article-gpu-arch-1.typ.min.html
@@ -0,0 +1 @@
+Designing a GPU architecture: Waves
In this article, we’ll be looking into the hardware of GPUs, and then designing our own. Specifically GPUs with unified shader architecture.
Comparison with CPUs
GPUs focus on operating on a lot of data at once (triangles, vertices, pixels, …), while CPUs focus on high performance on a single core, and low compute delay.
GPU Architecture
GPUs consists of multiple (these days at least 32) compute units (= CU).
Each compute unit has multiple SIMD units, also called “wave”, “wavefront” or “warp”. Compute units also have some fast local memory (tens of kilobytes), main memory access queues, texture units, a scalar unit, and other features. Subscribe to the Atom feed
to get notified of future articles.
The main memory (graphics memory) is typically outside of the GPU, and is slow, but high-bandwidth memory.
Waves
A wave is a SIMD processing unit consisting of typically 32 “lanes” (sometimes called threads).
Each wave in a CU has separate control flow, and doesn’t have to be related.
Instructions that waves support:
arithmetic operations
cross-lane data movement
CU local and global memory access: each SIMD lane can access a completely different address. similar to CPU gather / scatter.
synchronization with other CUs in the work group (see future article)
Since only the whole wave can do control flow, and not each lane, all operations can be masked so that they only apply to specific lanes.
=> waves are really similar to SIMD on modern CPUs
In modern GPUs, instruction execution in waves is superscalar, so there are multiple different execution units for executing different kinds of instructions, and multiple instructions can be executed at once, if there are free execution units, and they don’t depend on each other.
We’ll be exploring that in a future article.
Local memory
The local memory inside GPUs is banked, typically into 32 banks. The memory word size is typically 32 bits.
The addresses are interlaved, so for two banks:
addr 0 => bank 0
addr 1 => bank 1
addr 2 => bank 0
addr 3 => bank 1
…
Each bank has an dedicated access port, so for 32 banks, you get 32 access ports.
The lanes of the waves inside a CU get routed to the local memory banks magically.
Why are the banks interlaved?
When the whole wave wants to read a contiguous array of f32, so when each wave performs some_f32_array[lane_id()], all 32 banks can be used at the same time.
Why multiple waves share the same local memory
A wave doesn’t do memory accesses every instruction, but also does computations. This means that there are cycles where the memory isn’t doing anything.
By making multiple waves share the same local memory and access ports, you save resources.
Global memory
Since global memory reads/writes are really slow, they happen asynchronosly.
This means that a wave requests an access, then can continue executing, and then eventually waits for that access to finish.
Because of this, modern compilers automagically start the access before the data is needed, and then wait for the data later on.
Scalar unit
Most newer GPUs also have a scalar unit for saving energy when performing simple operations.
When the controller sees a scalar instruction in the code running on a wave, it automatically makes the code run on the scalar unit.
The scalar unit can be used for:
address calculation
partial reductions
execution of expensive operations not implemented on SIMD because of costs
GPU Programming Terminology
“work item”: typically maps to a SIMD lane
“kernel”: the code for a work item
“work group”: consists of multiple work items. typically maps to an CU. the __local memory in OpenCL applies to this.
“compute task”: a set of work groups
OpenCL and other APIs let you specify both the number of work groups and work items.
Since a program might specify a higher number of work items per work group than we have available, the compiler needs to be able to put multiple work items onto one SIMD lane.
Our own architecture
We’ll go with these specs for now:
N compute units
2 waves per CU
32 lanes per wave.
1KiB local memory per lane => 64 KiB
48 vector registers of 16x32b per wave
one scalar unit per CU
128 global memory ports
16 async task completion “signal” slots per wave
no fancy out of order or superscalar execution
support standard 32 bit floating point, without exceptions.
Note that we won’t specify the exact instruction encoding.
Predefined Constants
We will pre-define 16 constants (as virtual vector registers):
zero
one
sid: 0,1,2,3,4,5,6
wave: the ID of the wave in the compute task, broadcasted to all elements.
u8_max: 255,255,…
n2nd: 1,2,1,2,…
n3rd: 1,2,4,1,…
n4th: 1,2,4,8,1,…
lo16: 1,1,1,… (x16) 0,0,0,… (x16)
ch2: 1,1,0,0,1,1,…
ch4: 1,1,1,1,0,0,0,0,1,…
alo8: 1 (x8) 0 (x8) 1 (x8) 0 (x8)
a few reserved ones
Operands
We define the following instruction operands:
Vreg: vector register
M: (read only) vector gp reg as mask (1b). only first 32 registers can be used as mask. the operand consists of two masks and-ed together, each of which can conditionally be inverted first. this means that this operand takes up 12 bits
Vany: Vreg or M
Simm: immediate scalar value
Sreg: the first element of a vector register, as scalar
Sany: a Simm or an Sreg
dist: Vany, or a Sany broadcasted to each element
sig: one of the 16 completion signal slots
Instructions
We will add more instructions in future articles.
Data Movement
fn mov(out out: Vreg, in wrmask: M, in val: dist)
fn select(out out: Vreg, in select: M, in false: dist, in true: dist)
fn first_where_true(out out: Sreg, in where: M, in values: dist): if none of the elements are true, it doesn’t overwrite the previous value in out.
cross-lane operations: not important for this article
Mathematics
simple (unmasked) u32, i32, and f32 elementwise arithmetic and logic operations: fn add<u32>(out out: Vreg, in left: Vany, in right: dist)
scalar arithmetic and logic operations: fn add<u32>(out out: Sreg, in left: Sany, in right: Sany)
partial reduction operations: “chunks” the input with a size of 8, reduces each chunk, and stores it in the first element of the chunk. this means that every 8th element will contain a partial result.
and operations to finish that reduction into the first element of the vector
Local memory
load 32 bit value at each elem where mask is true: fn local_load32(out out: Vreg, in mask: M, in addr: Vreg)
store 32 bit value at each elem where mask is true: fn local_store32(in addr: Vreg, in mask: M, in val: Vany)
Global (async) memory
start an async global load, and make the given signal correspond to the completion of the access: load 32 bit value at each elem where mask is true: fn global_load32(out sig: sig, out out: Vreg, in mask: M, in addr: Vreg)
see above and local_store32fn global_store32(out sig: sig, in addr: Vreg, in mask: M, in val: Vany)
fn sig_done1(out r: Sreg, in sig: sig)
fn sig_done2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_wait(out r: Sreg, in sig: sig)
fn sig_waitall2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_waitall3(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig)
fn sig_waitall4(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig, in sig4: sig)
As a future extension, we could add a instruction that waits for any of the given signals to complete, and then jump to a specific location, depending on which of those completed.
Control flow (whole wave)
branch if scalar is zero: fn brz(in dest: Simm, in val: Sany)
branch if scalar is not zero: fn brnz(in dest: Simm, in val: Sany)
branch on the whole wave if each element has a true value for the mask: fn br_all(in dest: Simm, in cond: M)
branch on the whole wave if any element has a true value for the mask: fn br_any(in dest: Simm, in cond: M)
Hand-compiling code
Now that we decided on a simple compute-only GPU architecture, we can try hand-compiling an OpenCL program.
I asked an LLM to produce a N*N matmul example (comments written manually):
// convenient number for our specifc hardware #defineTILE_SIZE8
// this kernel will be launched with dimensions: // global[2] = { 128,128 } = { N, N }; // local[2] = { 8,8 } = { TILE_SIZE, TILE_SIZE }; __kernel voidmatmul_tiled( __global float* A, __global float* B, __global float* C, constint N) { int row =get_global_id(1); // y int col =get_global_id(0); // x int local_row =get_local_id(1); // y int local_col =get_local_id(0); // x
for (int t =0; t < N / TILE_SIZE; ++t) { // load tiles into local int tiledRow = row; int tiledCol = t * TILE_SIZE + local_col; float av; if (tiledRow < N && tiledCol < N) av = A[tiledRow * N + tiledCol]; else av =0.0f; Asub[local_row][local_col] = av;
tiledRow = t * TILE_SIZE + local_row; tiledCol = col; float bv; if (tiledRow < N && tiledCol < N) bv; = B[tiledRow * N + tiledCol]; else bv =0.0f; Bsub[local_row][local_col]= bv;
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE);
for (int k =0; k < TILE_SIZE; ++k) sum += Asub[local_row][k] * Bsub[k][local_col];
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE); }
if (row < N && col < N) C[row * N + col] = sum; }
First, we have to decide on how we want to map the kernel to the hardware.
Since the local dimension of the kernel is 8*8, which is 64, we can map each local group to one CU, by mapping 32 kernels to one wave, and using both waves available on one CU for the local group.
Our global dimension is 128*128, which means that we would need 256 compute units. But since we probably don’t have 256 compute units, GPUs, including ours, will have a on-hardware task scheduler, for scheduing tasks onto compute units.
Outro
Modern GPUs are really complex, but designing a simple GPU is not that hard either.
\ No newline at end of file
diff --git a/build/deploy/article-gpu-arch-1.typ.min.pdf b/build/deploy/article-gpu-arch-1.typ.min.pdf
new file mode 100644
index 0000000..96a9f05
Binary files /dev/null and b/build/deploy/article-gpu-arch-1.typ.min.pdf differ
diff --git a/build/deploy/article-gpu-arch-1.typ.nano.html b/build/deploy/article-gpu-arch-1.typ.nano.html
new file mode 100644
index 0000000..8620f59
--- /dev/null
+++ b/build/deploy/article-gpu-arch-1.typ.nano.html
@@ -0,0 +1 @@
+Designing a GPU architecture: Waves
Designing a GPU architecture: Waves
Last modified: 26. August 2025 21:13 (Git #5a9dfdd7)
In this article, we’ll be looking into the hardware of GPUs, and then designing our own. Specifically GPUs with unified shader architecture.
Comparison with CPUs
GPUs focus on operating on a lot of data at once (triangles, vertices, pixels, …), while CPUs focus on high performance on a single core, and low compute delay.
GPU Architecture
GPUs consists of multiple (these days at least 32) compute units (= CU).
Each compute unit has multiple SIMD units, also called “wave”, “wavefront” or “warp”. Compute units also have some fast local memory (tens of kilobytes), main memory access queues, texture units, a scalar unit, and other features. Subscribe to the Atom feed to get notified of future articles.
The main memory (graphics memory) is typically outside of the GPU, and is slow, but high-bandwidth memory.
Waves
A wave is a SIMD processing unit consisting of typically 32 “lanes” (sometimes called threads).
Each wave in a CU has separate control flow, and doesn’t have to be related.
Instructions that waves support:
arithmetic operations
cross-lane data movement
CU local and global memory access: each SIMD lane can access a completely different address. similar to CPU gather / scatter.
synchronization with other CUs in the work group (see future article)
Since only the whole wave can do control flow, and not each lane, all operations can be masked so that they only apply to specific lanes.
=> waves are really similar to SIMD on modern CPUs
In modern GPUs, instruction execution in waves is superscalar, so there are multiple different execution units for executing different kinds of instructions, and multiple instructions can be executed at once, if there are free execution units, and they don’t depend on each other.
We’ll be exploring that in a future article.
Local memory
The local memory inside GPUs is banked, typically into 32 banks. The memory word size is typically 32 bits.
The addresses are interlaved, so for two banks:
addr 0 => bank 0
addr 1 => bank 1
addr 2 => bank 0
addr 3 => bank 1
…
Each bank has an dedicated access port, so for 32 banks, you get 32 access ports.
The lanes of the waves inside a CU get routed to the local memory banks magically.
Why are the banks interlaved?
When the whole wave wants to read a contiguous array of f32, so when each wave performs some_f32_array[lane_id()], all 32 banks can be used at the same time.
Why multiple waves share the same local memory
A wave doesn’t do memory accesses every instruction, but also does computations. This means that there are cycles where the memory isn’t doing anything.
By making multiple waves share the same local memory and access ports, you save resources.
Global memory
Since global memory reads/writes are really slow, they happen asynchronosly.
This means that a wave requests an access, then can continue executing, and then eventually waits for that access to finish.
Because of this, modern compilers automagically start the access before the data is needed, and then wait for the data later on.
Scalar unit
Most newer GPUs also have a scalar unit for saving energy when performing simple operations.
When the controller sees a scalar instruction in the code running on a wave, it automatically makes the code run on the scalar unit.
The scalar unit can be used for:
address calculation
partial reductions
execution of expensive operations not implemented on SIMD because of costs
GPU Programming Terminology
“work item”: typically maps to a SIMD lane
“kernel”: the code for a work item
“work group”: consists of multiple work items. typically maps to an CU. the __local memory in OpenCL applies to this.
“compute task”: a set of work groups
OpenCL and other APIs let you specify both the number of work groups and work items.
Since a program might specify a higher number of work items per work group than we have available, the compiler needs to be able to put multiple work items onto one SIMD lane.
Our own architecture
We’ll go with these specs for now:
N compute units
2 waves per CU
32 lanes per wave.
1KiB local memory per lane => 64 KiB
48 vector registers of 16x32b per wave
one scalar unit per CU
128 global memory ports
16 async task completion “signal” slots per wave
no fancy out of order or superscalar execution
support standard 32 bit floating point, without exceptions.
Note that we won’t specify the exact instruction encoding.
Predefined Constants
We will pre-define 16 constants (as virtual vector registers):
zero
one
sid: 0,1,2,3,4,5,6
wave: the ID of the wave in the compute task, broadcasted to all elements.
u8_max: 255,255,…
n2nd: 1,2,1,2,…
n3rd: 1,2,4,1,…
n4th: 1,2,4,8,1,…
lo16: 1,1,1,… (x16) 0,0,0,… (x16)
ch2: 1,1,0,0,1,1,…
ch4: 1,1,1,1,0,0,0,0,1,…
alo8: 1 (x8) 0 (x8) 1 (x8) 0 (x8)
a few reserved ones
Operands
We define the following instruction operands:
Vreg: vector register
M: (read only) vector gp reg as mask (1b). only first 32 registers can be used as mask. the operand consists of two masks and-ed together, each of which can conditionally be inverted first. this means that this operand takes up 12 bits
Vany: Vreg or M
Simm: immediate scalar value
Sreg: the first element of a vector register, as scalar
Sany: a Simm or an Sreg
dist: Vany, or a Sany broadcasted to each element
sig: one of the 16 completion signal slots
Instructions
We will add more instructions in future articles.
Data Movement
fn mov(out out: Vreg, in wrmask: M, in val: dist)
fn select(out out: Vreg, in select: M, in false: dist, in true: dist)
fn first_where_true(out out: Sreg, in where: M, in values: dist): if none of the elements are true, it doesn’t overwrite the previous value in out.
cross-lane operations: not important for this article
Mathematics
simple (unmasked) u32, i32, and f32 elementwise arithmetic and logic operations: fn add<u32>(out out: Vreg, in left: Vany, in right: dist)
scalar arithmetic and logic operations: fn add<u32>(out out: Sreg, in left: Sany, in right: Sany)
partial reduction operations: “chunks” the input with a size of 8, reduces each chunk, and stores it in the first element of the chunk. this means that every 8th element will contain a partial result.
and operations to finish that reduction into the first element of the vector
Local memory
load 32 bit value at each elem where mask is true: fn local_load32(out out: Vreg, in mask: M, in addr: Vreg)
store 32 bit value at each elem where mask is true: fn local_store32(in addr: Vreg, in mask: M, in val: Vany)
Global (async) memory
start an async global load, and make the given signal correspond to the completion of the access: load 32 bit value at each elem where mask is true: fn global_load32(out sig: sig, out out: Vreg, in mask: M, in addr: Vreg)
see above and local_store32fn global_store32(out sig: sig, in addr: Vreg, in mask: M, in val: Vany)
fn sig_done1(out r: Sreg, in sig: sig)
fn sig_done2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_wait(out r: Sreg, in sig: sig)
fn sig_waitall2(out r: Sreg, in sig1: sig, in sig2: sig)
fn sig_waitall3(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig)
fn sig_waitall4(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig, in sig4: sig)
As a future extension, we could add a instruction that waits for any of the given signals to complete, and then jump to a specific location, depending on which of those completed.
Control flow (whole wave)
branch if scalar is zero: fn brz(in dest: Simm, in val: Sany)
branch if scalar is not zero: fn brnz(in dest: Simm, in val: Sany)
branch on the whole wave if each element has a true value for the mask: fn br_all(in dest: Simm, in cond: M)
branch on the whole wave if any element has a true value for the mask: fn br_any(in dest: Simm, in cond: M)
Hand-compiling code
Now that we decided on a simple compute-only GPU architecture, we can try hand-compiling an OpenCL program.
I asked an LLM to produce a N*N matmul example (comments written manually):
// convenient number for our specifc hardware #defineTILE_SIZE8
// this kernel will be launched with dimensions: // global[2] = { 128,128 } = { N, N }; // local[2] = { 8,8 } = { TILE_SIZE, TILE_SIZE }; __kernel voidmatmul_tiled( __global float* A, __global float* B, __global float* C, constint N) { int row =get_global_id(1); // y int col =get_global_id(0); // x int local_row =get_local_id(1); // y int local_col =get_local_id(0); // x
for (int t =0; t < N / TILE_SIZE; ++t) { // load tiles into local int tiledRow = row; int tiledCol = t * TILE_SIZE + local_col; float av; if (tiledRow < N && tiledCol < N) av = A[tiledRow * N + tiledCol]; else av =0.0f; Asub[local_row][local_col] = av;
tiledRow = t * TILE_SIZE + local_row; tiledCol = col; float bv; if (tiledRow < N && tiledCol < N) bv; = B[tiledRow * N + tiledCol]; else bv =0.0f; Bsub[local_row][local_col]= bv;
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE);
for (int k =0; k < TILE_SIZE; ++k) sum += Asub[local_row][k] * Bsub[k][local_col];
// sync local access across local grp barrier(CLK_LOCAL_MEM_FENCE); }
if (row < N && col < N) C[row * N + col] = sum; }
First, we have to decide on how we want to map the kernel to the hardware.
Since the local dimension of the kernel is 8*8, which is 64, we can map each local group to one CU, by mapping 32 kernels to one wave, and using both waves available on one CU for the local group.
Our global dimension is 128*128, which means that we would need 256 compute units. But since we probably don’t have 256 compute units, GPUs, including ours, will have a on-hardware task scheduler, for scheduing tasks onto compute units.
Outro
Modern GPUs are really complex, but designing a simple GPU is not that hard either.
Subscribe to the Atom feed to get notified of future articles.
\ No newline at end of file
diff --git a/build/deploy/article-make-regex-engine-1.typ.desktop.html b/build/deploy/article-make-regex-engine-1.typ.desktop.html
new file mode 100644
index 0000000..b86985f
--- /dev/null
+++ b/build/deploy/article-make-regex-engine-1.typ.desktop.html
@@ -0,0 +1 @@
+Introduction to RegEx
Note that the PDF Version of this page might look a bit better styling wise.
Introduction
If you are any kind of programmer, you’ve probably heard of RegEx
RegEx (Regular expression) is kind of like a small programming language used to define string search and replace patterns.
RegEx might seem overwhelming at first, but you can learn the most important features of RegEx very quickly.
It is important to mention that there is not a single standard for RegEx syntax, but instead each “implementation” has it’s own quirks, and additional features. Most common features however behave identically on most RegEx “engines”/implementations.
Syntax
The behavior of RegEx expressions / patterns depends on the match options passed to the RegEx engine.
Common match options:
Anchored at start and end of line
Case insensitive
multi-line or instead whole string
“Atoms”
In this article, we will refer to single expression parts as “atoms”.
Characters
Just use the character that you want to match. For example a to match an a. This however does not work for all characters, because many are part of special RegEx syntax.
Escaped Characters
Thee previously mentioned special characters like [ can be matched by putting a backslash in front of them: \[
Character Groups
RegEx engines already define some groups of characters that can make writing RegEx expressions quicker.
Anchors
^ is used to assert the beginning of a line in multi-line mode, or the beginning of the string in whole-string mode.
$ is used to assert the end of a line in multi-line mode, or the end of the string in whole-string mode.
Some combinators will either match “lazy”, or “greedy”.
Lazy is when the engine only matches as many characters required to get to the next step. This should almost always be used.
Greedy matching is when the engine tries to match as many characters as possible. The problem with this is that it might cause “backtracking”, which happens when the engine goes back in the pattern multiple times to ensure that as many characters as possible where matched. This can cause big performance issues.
Combinators
Multiple atoms can be combined together to form more complex patterns.
Chain
When two expressions are next to each other, they will be chained together, which means that both will be evaluated in-order.
Example: x\d matches a x and then a digit, like for example x9
Or
Two expressions separated by a | cause the RegEx engine to first try to match the left side, and only if it fails, it tries the right side instead.
Note that “or” has a long left and right scope, which means that ab|cd will match either ab or cd
Or-Not
Tries to match the expression on the left to it, but won’t error if it doesn’t succeed.
Note that “or-not” has a short left scope, which means that ab? will always match a, and then try to match b
Repeated
A expression followed by either a * for greedy repeat, or a *? for lazy repeat.
This matches as many times as possible, but can also match the pattern zero times.
Note that this has a short left scope.
Repeated At Least Once
A expression followed by either a + for greedy repeat, or a +? for lazy repeat.
This matches as many times as possible, and at least one time.
Note that this has a short left scope.
(Non-Capture) Group
Groups multiple expressions together for scoping.
Example: (?:abc) will just match abc
Capture Group
Similar to Non-Capture Groups except that they capture the matched text. This allows the matched text of the inner expression to be extracted later.
Capture group IDs are enumerated from left to right, starting with 1.
Example: (abc)de will match abcde, and store abc in group 1.
Character Set
By surrounding multiple characters in square brackets, the engine will match any of them. Special characters or expressions won’t be parsed inside them, which means that this can also be used to escape characters.
For example: [abc] will match either a, b or c.
and [ab(?:c)] will match either a, b, (, ?, :, c, or ).
Character sets can also contain ranges. For example: [0-9a-z] will match either any digit, or any lowercase letter.
Conclusion
RegEx is perfect for when you just want to match some patterns, but the syntax can make patterns very hard to read or modify.
In the next article, we will start to dive into implementing RegEx.
Stay tuned!
\ No newline at end of file
diff --git a/build/deploy/article-make-regex-engine-1.typ.min.html b/build/deploy/article-make-regex-engine-1.typ.min.html
new file mode 100644
index 0000000..027611c
--- /dev/null
+++ b/build/deploy/article-make-regex-engine-1.typ.min.html
@@ -0,0 +1 @@
+Introduction to RegEx
Making a simple RegEx engine: Part 1: Introduction to RegEx
If you are any kind of programmer, you’ve probably heard of RegEx
RegEx (Regular expression) is kind of like a small programming language used to define string search and replace patterns.
RegEx might seem overwhelming at first, but you can learn the most important features of RegEx very quickly.
It is important to mention that there is not a single standard for RegEx syntax, but instead each “implementation” has it’s own quirks, and additional features. Most common features however behave identically on most RegEx “engines”/implementations.
Syntax
The behavior of RegEx expressions / patterns depends on the match options passed to the RegEx engine.
Common match options:
Anchored at start and end of line
Case insensitive
multi-line or instead whole string
“Atoms”
In this article, we will refer to single expression parts as “atoms”.
Characters
Just use the character that you want to match. For example a to match an a. This however does not work for all characters, because many are part of special RegEx syntax.
Escaped Characters
Thee previously mentioned special characters like [ can be matched by putting a backslash in front of them: \[
Character Groups
RegEx engines already define some groups of characters that can make writing RegEx expressions quicker.
Anchors
^ is used to assert the beginning of a line in multi-line mode, or the beginning of the string in whole-string mode.
$ is used to assert the end of a line in multi-line mode, or the end of the string in whole-string mode.
Some combinators will either match “lazy”, or “greedy”.
Lazy is when the engine only matches as many characters required to get to the next step. This should almost always be used.
Greedy matching is when the engine tries to match as many characters as possible. The problem with this is that it might cause “backtracking”, which happens when the engine goes back in the pattern multiple times to ensure that as many characters as possible where matched. This can cause big performance issues.
Combinators
Multiple atoms can be combined together to form more complex patterns.
Chain
When two expressions are next to each other, they will be chained together, which means that both will be evaluated in-order.
Example: x\d matches a x and then a digit, like for example x9
Or
Two expressions separated by a | cause the RegEx engine to first try to match the left side, and only if it fails, it tries the right side instead.
Note that “or” has a long left and right scope, which means that ab|cd will match either ab or cd
Or-Not
Tries to match the expression on the left to it, but won’t error if it doesn’t succeed.
Note that “or-not” has a short left scope, which means that ab? will always match a, and then try to match b
Repeated
A expression followed by either a * for greedy repeat, or a *? for lazy repeat.
This matches as many times as possible, but can also match the pattern zero times.
Note that this has a short left scope.
Repeated At Least Once
A expression followed by either a + for greedy repeat, or a +? for lazy repeat.
This matches as many times as possible, and at least one time.
Note that this has a short left scope.
(Non-Capture) Group
Groups multiple expressions together for scoping.
Example: (?:abc) will just match abc
Capture Group
Similar to Non-Capture Groups except that they capture the matched text. This allows the matched text of the inner expression to be extracted later.
Capture group IDs are enumerated from left to right, starting with 1.
Example: (abc)de will match abcde, and store abc in group 1.
Character Set
By surrounding multiple characters in square brackets, the engine will match any of them. Special characters or expressions won’t be parsed inside them, which means that this can also be used to escape characters.
For example: [abc] will match either a, b or c.
and [ab(?:c)] will match either a, b, (, ?, :, c, or ).
Character sets can also contain ranges. For example: [0-9a-z] will match either any digit, or any lowercase letter.
Conclusion
RegEx is perfect for when you just want to match some patterns, but the syntax can make patterns very hard to read or modify.
In the next article, we will start to dive into implementing RegEx.
Stay tuned!
\ No newline at end of file
diff --git a/build/deploy/article-make-regex-engine-1.typ.min.pdf b/build/deploy/article-make-regex-engine-1.typ.min.pdf
new file mode 100644
index 0000000..7a2974a
Binary files /dev/null and b/build/deploy/article-make-regex-engine-1.typ.min.pdf differ
diff --git a/build/deploy/article-make-regex-engine-1.typ.nano.html b/build/deploy/article-make-regex-engine-1.typ.nano.html
new file mode 100644
index 0000000..cc004ca
--- /dev/null
+++ b/build/deploy/article-make-regex-engine-1.typ.nano.html
@@ -0,0 +1 @@
+Introduction to RegEx
Making a simple RegEx engine: Part 1: Introduction to RegEx
Last modified: 26. July 2025 14:20 (Git #fee2a364)
If you are any kind of programmer, you’ve probably heard of RegEx
RegEx (Regular expression) is kind of like a small programming language used to define string search and replace patterns.
RegEx might seem overwhelming at first, but you can learn the most important features of RegEx very quickly.
It is important to mention that there is not a single standard for RegEx syntax, but instead each “implementation” has it’s own quirks, and additional features. Most common features however behave identically on most RegEx “engines”/implementations.
Syntax
The behavior of RegEx expressions / patterns depends on the match options passed to the RegEx engine.
Common match options:
Anchored at start and end of line
Case insensitive
multi-line or instead whole string
“Atoms”
In this article, we will refer to single expression parts as “atoms”.
Characters
Just use the character that you want to match. For example a to match an a. This however does not work for all characters, because many are part of special RegEx syntax.
Escaped Characters
Thee previously mentioned special characters like [ can be matched by putting a backslash in front of them: \[
Character Groups
RegEx engines already define some groups of characters that can make writing RegEx expressions quicker.
Anchors
^ is used to assert the beginning of a line in multi-line mode, or the beginning of the string in whole-string mode.
$ is used to assert the end of a line in multi-line mode, or the end of the string in whole-string mode.
Some combinators will either match “lazy”, or “greedy”.
Lazy is when the engine only matches as many characters required to get to the next step. This should almost always be used.
Greedy matching is when the engine tries to match as many characters as possible. The problem with this is that it might cause “backtracking”, which happens when the engine goes back in the pattern multiple times to ensure that as many characters as possible where matched. This can cause big performance issues.
Combinators
Multiple atoms can be combined together to form more complex patterns.
Chain
When two expressions are next to each other, they will be chained together, which means that both will be evaluated in-order.
Example: x\d matches a x and then a digit, like for example x9
Or
Two expressions separated by a | cause the RegEx engine to first try to match the left side, and only if it fails, it tries the right side instead.
Note that “or” has a long left and right scope, which means that ab|cd will match either ab or cd
Or-Not
Tries to match the expression on the left to it, but won’t error if it doesn’t succeed.
Note that “or-not” has a short left scope, which means that ab? will always match a, and then try to match b
Repeated
A expression followed by either a * for greedy repeat, or a *? for lazy repeat.
This matches as many times as possible, but can also match the pattern zero times.
Note that this has a short left scope.
Repeated At Least Once
A expression followed by either a + for greedy repeat, or a +? for lazy repeat.
This matches as many times as possible, and at least one time.
Note that this has a short left scope.
(Non-Capture) Group
Groups multiple expressions together for scoping.
Example: (?:abc) will just match abc
Capture Group
Similar to Non-Capture Groups except that they capture the matched text. This allows the matched text of the inner expression to be extracted later.
Capture group IDs are enumerated from left to right, starting with 1.
Example: (abc)de will match abcde, and store abc in group 1.
Character Set
By surrounding multiple characters in square brackets, the engine will match any of them. Special characters or expressions won’t be parsed inside them, which means that this can also be used to escape characters.
For example: [abc] will match either a, b or c.
and [ab(?:c)] will match either a, b, (, ?, :, c, or ).
Note that the PDF Version of this page might look a bit better styling wise.
Introduction
Function calls have some overhead, which can sometimes be a big issue for other optimizations. Because of that, compiler backends (should) inline function calls. There are however many issues with just greedily inlining calls…
Greedy inlining with heuristics
This is the most obvious approach. We can just inline all functions with only one call, and then inline calls where the inlined function does not have many instructions.
Example:
function f32 $square(f32 %x) { @entry: // this is stupid, but I couldn't come up with a better example f32 %e = add %x, 0 f32 %out = add %e, %x ret %out }
The second option is to inline the $hypot call into $tri_hypot. (There are also some other options)
Now in this case, it seems obvious to prefer inlining $square into $hypot.
Issue 2: ABI requirements on argument passing
If we assume the target ABI only has one f32 register for passing arguments, then we would have to generate additional instructions for passing the second argument of $hypot, and then it might actually be more efficient to inline $hypot instead of $square.
This example is not realistic, but this issue actually occurs when compiling lots of code.
Another related issue is that having more arguments arranged in a fixed way will require lots of moving data arround at the call site.
A solution to this is to make the heuristics not just output code size, but also make it depend on the number of arguments / outputs passed to the function.
If the target has a efficient hypot operation, then that operation will only be used if we inline $myfunc into $callsite.
This means that inlining is now depended on… instruction selection??
This is not the only optimization prevented by not inlining the call. If $callsite were to be called in a loop, then not inlining would prevent vectorization.
Function outlining
A related optimization is “outlining”. It’s the opposite to inlining. It moves duplicate code into a function, to reduce code size, and sometimes increase performance (because of instruction caching)
If we do inlining seperately from outlining, we often get unoptimal code.
A better approach
We can instead first inline all inlinable calls, and then perform more agressive outlining.
Step 1: inlining
We inline all function calls, except for:
self recursion (obviously)
functions explicitly marked as no-inline by the user
Step 2: detect duplicate code
There are many algorithms for doing this.
The goal of this step is to both:
maximize size of outlinable section
minimize size of code
Step 3: slightly reduce size of outlinable section
The goal is to reduce size of outlinable sections, to make the code more optimal.
This should be ABI and instruction depended, and have the goal of:
reducing argument shuffles required at all call sites
reducing register preassure
not preventing good isel choices and optimizations.
this is also dependent on the targetted code size.
Step 4: perform outlining
This is obvious.
Issue 1: high compile-time memory usage
Inlining all function calls first will increase the memory usage during compilation by A LOT
I’m sure that there is a smarter way to implement this method, without actually performing the inlining…
Conclusion
Function inlining is much more complex than one might think.
PS: No idea how to implement this…
Subscribe to the Atom feed to get notified about futre compiler-related articles.
\ No newline at end of file
diff --git a/build/deploy/compiler-inlining.typ.min.html b/build/deploy/compiler-inlining.typ.min.html
new file mode 100644
index 0000000..3336242
--- /dev/null
+++ b/build/deploy/compiler-inlining.typ.min.html
@@ -0,0 +1 @@
+Automatically inlining functions is not easy
Function calls have some overhead, which can sometimes be a big issue for other optimizations. Because of that, compiler backends (should) inline function calls. There are however many issues with just greedily inlining calls…
Greedy inlining with heuristics
This is the most obvious approach. We can just inline all functions with only one call, and then inline calls where the inlined function does not have many instructions.
Example:
function f32 $square(f32 %x) { @entry: // this is stupid, but I couldn't come up with a better example f32 %e = add %x, 0 f32 %out = add %e, %x ret %out }
The second option is to inline the $hypot call into $tri_hypot. (There are also some other options)
Now in this case, it seems obvious to prefer inlining $square into $hypot.
Issue 2: ABI requirements on argument passing
If we assume the target ABI only has one f32 register for passing arguments, then we would have to generate additional instructions for passing the second argument of $hypot, and then it might actually be more efficient to inline $hypot instead of $square.
This example is not realistic, but this issue actually occurs when compiling lots of code.
Another related issue is that having more arguments arranged in a fixed way will require lots of moving data arround at the call site.
A solution to this is to make the heuristics not just output code size, but also make it depend on the number of arguments / outputs passed to the function.
If the target has a efficient hypot operation, then that operation will only be used if we inline $myfunc into $callsite.
This means that inlining is now depended on… instruction selection??
This is not the only optimization prevented by not inlining the call. If $callsite were to be called in a loop, then not inlining would prevent vectorization.
Function outlining
A related optimization is “outlining”. It’s the opposite to inlining. It moves duplicate code into a function, to reduce code size, and sometimes increase performance (because of instruction caching)
If we do inlining seperately from outlining, we often get unoptimal code.
A better approach
We can instead first inline all inlinable calls, and then perform more agressive outlining.
Step 1: inlining
We inline all function calls, except for:
self recursion (obviously)
functions explicitly marked as no-inline by the user
Step 2: detect duplicate code
There are many algorithms for doing this.
The goal of this step is to both:
maximize size of outlinable section
minimize size of code
Step 3: slightly reduce size of outlinable section
The goal is to reduce size of outlinable sections, to make the code more optimal.
This should be ABI and instruction depended, and have the goal of:
reducing argument shuffles required at all call sites
reducing register preassure
not preventing good isel choices and optimizations.
this is also dependent on the targetted code size.
Step 4: perform outlining
This is obvious.
Issue 1: high compile-time memory usage
Inlining all function calls first will increase the memory usage during compilation by A LOT
I’m sure that there is a smarter way to implement this method, without actually performing the inlining…
Conclusion
Function inlining is much more complex than one might think.
to get notified about futre compiler-related articles.
\ No newline at end of file
diff --git a/build/deploy/compiler-inlining.typ.min.pdf b/build/deploy/compiler-inlining.typ.min.pdf
new file mode 100644
index 0000000..5354ffc
Binary files /dev/null and b/build/deploy/compiler-inlining.typ.min.pdf differ
diff --git a/build/deploy/compiler-inlining.typ.nano.html b/build/deploy/compiler-inlining.typ.nano.html
new file mode 100644
index 0000000..d1125bd
--- /dev/null
+++ b/build/deploy/compiler-inlining.typ.nano.html
@@ -0,0 +1 @@
+Automatically inlining functions is not easy
Automatically inlining functions is not easy
Last modified: 11. August 2025 16:38 (Git #9c2913af)
Function calls have some overhead, which can sometimes be a big issue for other optimizations. Because of that, compiler backends (should) inline function calls. There are however many issues with just greedily inlining calls…
Greedy inlining with heuristics
This is the most obvious approach. We can just inline all functions with only one call, and then inline calls where the inlined function does not have many instructions.
Example:
function f32 $square(f32 %x) { @entry: // this is stupid, but I couldn't come up with a better example f32 %e = add %x, 0 f32 %out = add %e, %x ret %out }
The second option is to inline the $hypot call into $tri_hypot. (There are also some other options)
Now in this case, it seems obvious to prefer inlining $square into $hypot.
Issue 2: ABI requirements on argument passing
If we assume the target ABI only has one f32 register for passing arguments, then we would have to generate additional instructions for passing the second argument of $hypot, and then it might actually be more efficient to inline $hypot instead of $square.
This example is not realistic, but this issue actually occurs when compiling lots of code.
Another related issue is that having more arguments arranged in a fixed way will require lots of moving data arround at the call site.
A solution to this is to make the heuristics not just output code size, but also make it depend on the number of arguments / outputs passed to the function.
If the target has a efficient hypot operation, then that operation will only be used if we inline $myfunc into $callsite.
This means that inlining is now depended on… instruction selection??
This is not the only optimization prevented by not inlining the call. If $callsite were to be called in a loop, then not inlining would prevent vectorization.
Function outlining
A related optimization is “outlining”. It’s the opposite to inlining. It moves duplicate code into a function, to reduce code size, and sometimes increase performance (because of instruction caching)
If we do inlining seperately from outlining, we often get unoptimal code.
A better approach
We can instead first inline all inlinable calls, and then perform more agressive outlining.
Step 1: inlining
We inline all function calls, except for:
self recursion (obviously)
functions explicitly marked as no-inline by the user
Step 2: detect duplicate code
There are many algorithms for doing this.
The goal of this step is to both:
maximize size of outlinable section
minimize size of code
Step 3: slightly reduce size of outlinable section
The goal is to reduce size of outlinable sections, to make the code more optimal.
This should be ABI and instruction depended, and have the goal of:
reducing argument shuffles required at all call sites
reducing register preassure
not preventing good isel choices and optimizations.
this is also dependent on the targetted code size.
Step 4: perform outlining
This is obvious.
Issue 1: high compile-time memory usage
Inlining all function calls first will increase the memory usage during compilation by A LOT
I’m sure that there is a smarter way to implement this method, without actually performing the inlining…
Conclusion
Function inlining is much more complex than one might think.
PS: No idea how to implement this…
Subscribe to the Atom feed to get notified about futre compiler-related articles.
\ No newline at end of file
diff --git a/build/deploy/compiler-pattern-matching.typ.desktop.html b/build/deploy/compiler-pattern-matching.typ.desktop.html
new file mode 100644
index 0000000..b89a97f
--- /dev/null
+++ b/build/deploy/compiler-pattern-matching.typ.desktop.html
@@ -0,0 +1 @@
+Approaches to Compiler Pattern Matching
Note that the PDF Version of this page might look a bit better styling wise.
Introduction
Compilers often have to deal with pattern matching and rewriting (find-and-replace) inside the compiler IR (intermediate representation).
Common use cases for pattern matching in compilers:
“peephole optimizations”: the most common kind of optimization in compilers. They find a short sequence of code and replace it with some other code. For example replacing x & (1<< b) with a bit test operation.
finding a sequence of operations for complex optimization passes to operate on: advanced compilers have complex optimizations that can’t really be performed with simple IR operation replacements, and instead require complex logic. Patterns are used here to find operation sequences where those optimizations are applicable, and also to extract details inside that sequence.
code generation: converting the IR to machine code / VM bytecode. A compiler needs to find operations (or sequences of operations) inside the IR, and “replace” them with machine code.
Simplest Approach
Currently, most compilers mostly do this inside their source code. For example, in MLIR, most (but not all) pattern matches are performed in C++ code.
The only advantage to this approach is that it doesn’t require a complex pattern matching system.
I only recommend doing this for small compiler toy projects.
Disadvantages
Doing pattern matching this way has many disadvantages.
Some (but not all):
debugging pattern match rules can be hard
IR rewrites need to be tracked manually (for debugging)
source locations and debug information also need to be tracked manually, which often isn’t implemented very well.
verbose and barely readable pattern matching code
overall error-prone
I myself did pattern matching this way in my old compiler backend, and I speak from experience when I say that this approach sucks (in most cases).
Pattern Matching DSLs
A custom language for describing IR patterns and IR transformations (aka rewrites).
I will put this into the category of “structured pattern matching”.
An example is Cranelift’s ISLE DSL:
;; x ^ x == 0. (rule (simplify (bxor (ty_int ty) x x)) (subsume (iconst_u ty 0)))
Fun fact: tinygrad actually decompiles the python code inside the second element of the pair, and runs multiple optimization passes on that.
This approach is used by many popular compilers such as LLVM, GCC, and Cranelift for peephole optimizations and code generation.
Advantages
debugging and tracking of rewrites, source locations, and debug information can be done properly
patterns themselves can be inspected and modified programmatically.
they are easier to use and read than manual pattern matching in the source code.
There is however an even better alternative:
Pattern Matching Dialects
I will also put this method into the category of “structured pattern matching”.
The main example of this is MLIR, with the pdl and the transform dialects. Sadly few projects/people use these dialects, and instead do pattern matching in C++ code. Probably because the dialects aren’t documented very well.
What are compiler dialects?
Modern compilers, especially multi-level compilers, such as MLIR, have their operations grouped in “dialects”.
Each dialect either represents specific kinds of operations, like arithmetic operations, or a specific backend’s/frontend’s operations, such as the llvm, emitc, and the spirv dialects in MLIR.
Dialects commonly contain operations, data types, as well as optimization and dialect conversion passes.
Core Concept
The IR patterns and transformations are represented using the compiler’s IR. This is mostly done in a separate dialect, with dedicated operations for operating on IR.
Examples
MLIR’s pdl dialect can be used to replace arith.addi with my.add like this:
the pattern matching infrastructure can optimize it’s own patterns: The compiler can operate on patterns and rewrite rules like they are normal operations. This removes the need for special infrastructure regarding pattern matching DSLs.
the compiler could AOT compile patterns
the compiler could optimize, analyze, and combine patterns to reduce compile time.
IR (de-)serialization infrastructure in the compiler can also be used to exchange peephole optimizations.
bragging rights: your compiler represents its patterns in it’s own IR
Combining with a DSL
I recommend having a pattern matching / rewrite DSL, that transpiles to pattern matching / rewrite dialect operations.
The advantage of this over just having a rewrite dialect is that it makes patterns even more readable (and maintainable!)
E-Graphs
E-Graphs are magical datastructures that can be used to efficiently encode all possible transformations, and then select the best transformation.
Even though E-Graphs solve most problems, I still recommend using a pattern matching dialect, especially in multi-level compilers, to be more flexible, and have more future-proof pattern matching, or you decide that you want to match some complex patterns manually.
More Advantages of Structured Pattern Matching
Smart Pattern Matchers
Instead of brute-forcing all peephole optimizations (of which there can be a LOT in advanced compilers), the compiler can organize all the patterns to provide more efficient matching. I didn’t yet investigate how to do this. If you have any ideas regarding this, please contact me.
There are other ways to speed up the pattern matching and rewrite process using this too.
Reversible Transformations
I don’t think that there currently is any compiler that does this. If you do know one, again, please contact me.
Optimizing compilers typically deal with code (mostly written by people) that is on a lower level than the compiler theoretically supports. For example, humans tend to write code like this for extracting a bit: x & (1<< b), but compilers tend to have a high-level bit test operation (with exceptions). A reason for having higher-level primitives is that it allows the compiler to do more high-level optimizations, but also some target architectures have a bit test operation, that is more optimal.
This is not just the case for “low-level” things like bit tests, but also high level concepts, like a reduction over an array, or even the implementation of a whole algorithm. For example LLVM, since recently, can detect implementations of CRC.
LLVM actually doesn’t have many dedicated operations like a bit-test operation, and instead canonicalizes all bit-test patterns to x & (1<< b) !=0, and matches for that in compiler passes that expect bit test operations.
Now let’s go back to the x & (1<< b) (bit test) example. Optimizing compilers should be able to detect that, and other bit test patterns (like x & (1<< b) >0), and then replace those with a bit-test operation. But they also have to be able to convert bit-test operations back to their implementation for compilation targets that don’t have a bit-test instruction. Currently, compiler backends do this by having separate patterns for converting bit-test to it’s dedicated operation, and back.
A better solution is to associate a set of implementations with the bit test operation, and make the compiler automatically reverse those to generate the best implementation (in the instruction selector for example).
Implementing pattern/transformation reversion can be challenging however, but it provides many benefits, and all “big” compilers should definitely do this, in my opinion.
Runtime Library
Compilers typically come with a runtime library that implement more complex operations that aren’t supported by most processors or architectures.
The implementation of those functions should also use that pattern matching dialect. This allows your backend to detect code written by users with a similar implementation as in the runtime library, giving you some additional optimizations for free.
I don’t think any compiler currently does this either.
Problems with Pattern Matching
The main problem is ordering the patterns.
As an example, consider these three patterns:
;; A (add x:Const y) => (add y x)
;; B (sub (add x y:Const) z:Const) => (lea x y (const_neg z))
;; C (add x 1) => (inc x)
Now what should the compiler do when it sees this:
(sub (add51) 2)
All three patterns would match:
;; apply A (sub (add51) 2) => (sub (add15) 2) ;; only B applies now (sub (add15) 2) => (lea15 (const_neg2)) ;; nothing applies anymore
Now which of those transformations should be performed?
This is not as easy to solve as it seems, especially in the context of instruction selection (specifically scheduling), where the performance on processors depends on a sequence of instructions, instead of just a single instruction.
Superscalar CPUs
Modern processor architecture features like superscalar execution make this even more complicated.
As a simple, unrealistic example, let’s imagine a CPU (core) that has one bit operations execution unit, and two ALU execution units / ports. This means that the CPU can execute two instructions in the ALU unit and one instruction in the bit ops unit at the same time.
One might think that always optimizing a & (1 << b) to a bit test operation is good for performance. But in this example, that is not the case.
If we have a function that does a lot of bitwise operations next to each other, and the compiler replaces all bit tests with bit test operations, suddenly all operations depend on the bit ops unit, which means that instead of executing 3 instructions at a time (ignoring pipelining), the CPU can only execute one instruction at a time.
This shows that we won’t know if an optimization is actually good, until we are at a late point in the compilation process where we can simulate the CPU’s instruction scheduling.
This does not only apply to instruction selection, but also to more higher-level optimizations, such as loop and control flow related optimizations.
Conclusion
One can see how pattern matching dialects are the best option to approach pattern matching.
Someone wanted me to insert a takeaway here, but I won’t.
PS: I’ll hunt down everyone who still decides to do pattern matching in their compiler source after reading this article.
\ No newline at end of file
diff --git a/build/deploy/compiler-pattern-matching.typ.min.html b/build/deploy/compiler-pattern-matching.typ.min.html
new file mode 100644
index 0000000..01224c7
--- /dev/null
+++ b/build/deploy/compiler-pattern-matching.typ.min.html
@@ -0,0 +1 @@
+Approaches to Compiler Pattern Matching
Compilers often have to deal with pattern matching and rewriting (find-and-replace) inside the compiler IR (intermediate representation).
Common use cases for pattern matching in compilers:
“peephole optimizations”: the most common kind of optimization in compilers. They find a short sequence of code and replace it with some other code. For example replacing x & (1<< b) with a bit test operation.
finding a sequence of operations for complex optimization passes to operate on: advanced compilers have complex optimizations that can’t really be performed with simple IR operation replacements, and instead require complex logic. Patterns are used here to find operation sequences where those optimizations are applicable, and also to extract details inside that sequence.
code generation: converting the IR to machine code / VM bytecode. A compiler needs to find operations (or sequences of operations) inside the IR, and “replace” them with machine code.
Simplest Approach
Currently, most compilers mostly do this inside their source code. For example, in MLIR, most (but not all) pattern matches are performed in C++ code.
The only advantage to this approach is that it doesn’t require a complex pattern matching system.
I only recommend doing this for small compiler toy projects.
Disadvantages
Doing pattern matching this way has many disadvantages.
Some (but not all):
debugging pattern match rules can be hard
IR rewrites need to be tracked manually (for debugging)
source locations and debug information also need to be tracked manually, which often isn’t implemented very well.
verbose and barely readable pattern matching code
overall error-prone
I myself did pattern matching this way in my old compiler backend, and I speak from experience when I say that this approach sucks (in most cases).
Pattern Matching DSLs
A custom language for describing IR patterns and IR transformations (aka rewrites).
I will put this into the category of “structured pattern matching”.
An example is Cranelift’s ISLE DSL:
;; x ^ x == 0. (rule (simplify (bxor (ty_int ty) x x)) (subsume (iconst_u ty 0)))
Fun fact: tinygrad actually decompiles the python code inside the second element of the pair, and runs multiple optimization passes on that.
This approach is used by many popular compilers such as LLVM, GCC, and Cranelift for peephole optimizations and code generation.
Advantages
debugging and tracking of rewrites, source locations, and debug information can be done properly
patterns themselves can be inspected and modified programmatically.
they are easier to use and read than manual pattern matching in the source code.
There is however an even better alternative:
Pattern Matching Dialects
I will also put this method into the category of “structured pattern matching”.
The main example of this is MLIR, with the pdl and the transform dialects. Sadly few projects/people use these dialects, and instead do pattern matching in C++ code. Probably because the dialects aren’t documented very well.
What are compiler dialects?
Modern compilers, especially multi-level compilers, such as MLIR, have their operations grouped in “dialects”.
Each dialect either represents specific kinds of operations, like arithmetic operations, or a specific backend’s/frontend’s operations, such as the llvm, emitc, and the spirv dialects in MLIR.
Dialects commonly contain operations, data types, as well as optimization and dialect conversion passes.
Core Concept
The IR patterns and transformations are represented using the compiler’s IR. This is mostly done in a separate dialect, with dedicated operations for operating on IR.
Examples
MLIR’s pdl dialect can be used to replace arith.addi with my.add like this:
the pattern matching infrastructure can optimize it’s own patterns: The compiler can operate on patterns and rewrite rules like they are normal operations. This removes the need for special infrastructure regarding pattern matching DSLs.
the compiler could AOT compile patterns
the compiler could optimize, analyze, and combine patterns to reduce compile time.
IR (de-)serialization infrastructure in the compiler can also be used to exchange peephole optimizations.
bragging rights: your compiler represents its patterns in it’s own IR
Combining with a DSL
I recommend having a pattern matching / rewrite DSL, that transpiles to pattern matching / rewrite dialect operations.
The advantage of this over just having a rewrite dialect is that it makes patterns even more readable (and maintainable!)
Even though E-Graphs solve most problems, I still recommend using a pattern matching dialect, especially in multi-level compilers, to be more flexible, and have more future-proof pattern matching, or you decide that you want to match some complex patterns manually.
More Advantages of Structured Pattern Matching
Smart Pattern Matchers
Instead of brute-forcing all peephole optimizations (of which there can be a LOT in advanced compilers), the compiler can organize all the patterns to provide more efficient matching. I didn’t yet investigate how to do this. If you have any ideas regarding this, please contact me.
There are other ways to speed up the pattern matching and rewrite process using this too.
Reversible Transformations
I don’t think that there currently is any compiler that does this. If you do know one, again, please contact me.
Optimizing compilers typically deal with code (mostly written by people) that is on a lower level than the compiler theoretically supports. For example, humans tend to write code like this for extracting a bit: x & (1<< b), but compilers tend to have a high-level bit test operation (with exceptions). A reason for having higher-level primitives is that it allows the compiler to do more high-level optimizations, but also some target architectures have a bit test operation, that is more optimal.
This is not just the case for “low-level” things like bit tests, but also high level concepts, like a reduction over an array, or even the implementation of a whole algorithm. For example LLVM, since recently, can detect implementations of CRC.
LLVM actually doesn’t have many dedicated operations like a bit-test operation, and instead canonicalizes all bit-test patterns to x & (1<< b) !=0, and matches for that in compiler passes that expect bit test operations.
Now let’s go back to the x & (1<< b) (bit test) example. Optimizing compilers should be able to detect that, and other bit test patterns (like x & (1<< b) >0), and then replace those with a bit-test operation. But they also have to be able to convert bit-test operations back to their implementation for compilation targets that don’t have a bit-test instruction. Currently, compiler backends do this by having separate patterns for converting bit-test to it’s dedicated operation, and back.
A better solution is to associate a set of implementations with the bit test operation, and make the compiler automatically reverse those to generate the best implementation (in the instruction selector for example).
Implementing pattern/transformation reversion can be challenging however, but it provides many benefits, and all “big” compilers should definitely do this, in my opinion.
Runtime Library
Compilers typically come with a runtime library that implement more complex operations that aren’t supported by most processors or architectures.
The implementation of those functions should also use that pattern matching dialect. This allows your backend to detect code written by users with a similar implementation as in the runtime library, giving you some additional optimizations for free.
I don’t think any compiler currently does this either.
Problems with Pattern Matching
The main problem is ordering the patterns.
As an example, consider these three patterns:
;; A (add x:Const y) => (add y x)
;; B (sub (add x y:Const) z:Const) => (lea x y (const_neg z))
;; C (add x 1) => (inc x)
Now what should the compiler do when it sees this:
(sub (add51) 2)
All three patterns would match:
;; apply A (sub (add51) 2) => (sub (add15) 2) ;; only B applies now (sub (add15) 2) => (lea15 (const_neg2)) ;; nothing applies anymore
Now which of those transformations should be performed?
This is not as easy to solve as it seems, especially in the context of instruction selection (specifically scheduling), where the performance on processors depends on a sequence of instructions, instead of just a single instruction.
Superscalar CPUs
Modern processor architecture features like superscalar execution make this even more complicated.
As a simple, unrealistic example, let’s imagine a CPU (core) that has one bit operations execution unit, and two ALU execution units / ports. This means that the CPU can execute two instructions in the ALU unit and one instruction in the bit ops unit at the same time.
One might think that always optimizing a & (1 << b) to a bit test operation is good for performance. But in this example, that is not the case.
If we have a function that does a lot of bitwise operations next to each other, and the compiler replaces all bit tests with bit test operations, suddenly all operations depend on the bit ops unit, which means that instead of executing 3 instructions at a time (ignoring pipelining), the CPU can only execute one instruction at a time.
This shows that we won’t know if an optimization is actually good, until we are at a late point in the compilation process where we can simulate the CPU’s instruction scheduling.
This does not only apply to instruction selection, but also to more higher-level optimizations, such as loop and control flow related optimizations.
Conclusion
One can see how pattern matching dialects are the best option to approach pattern matching.
Someone wanted me to insert a takeaway here, but I won’t.
PS: I’ll hunt down everyone who still decides to do pattern matching in their compiler source after reading this article.
\ No newline at end of file
diff --git a/build/deploy/compiler-pattern-matching.typ.min.pdf b/build/deploy/compiler-pattern-matching.typ.min.pdf
new file mode 100644
index 0000000..7208fa1
Binary files /dev/null and b/build/deploy/compiler-pattern-matching.typ.min.pdf differ
diff --git a/build/deploy/compiler-pattern-matching.typ.nano.html b/build/deploy/compiler-pattern-matching.typ.nano.html
new file mode 100644
index 0000000..ed6a486
--- /dev/null
+++ b/build/deploy/compiler-pattern-matching.typ.nano.html
@@ -0,0 +1 @@
+Approaches to Compiler Pattern Matching
Approaches to pattern matching in compilers
Last modified: 19. August 2025 09:55 (Git #34fd6adb)
Compilers often have to deal with pattern matching and rewriting (find-and-replace) inside the compiler IR (intermediate representation).
Common use cases for pattern matching in compilers:
“peephole optimizations”: the most common kind of optimization in compilers. They find a short sequence of code and replace it with some other code. For example replacing x & (1<< b) with a bit test operation.
finding a sequence of operations for complex optimization passes to operate on: advanced compilers have complex optimizations that can’t really be performed with simple IR operation replacements, and instead require complex logic. Patterns are used here to find operation sequences where those optimizations are applicable, and also to extract details inside that sequence.
code generation: converting the IR to machine code / VM bytecode. A compiler needs to find operations (or sequences of operations) inside the IR, and “replace” them with machine code.
Simplest Approach
Currently, most compilers mostly do this inside their source code. For example, in MLIR, most (but not all) pattern matches are performed in C++ code.
The only advantage to this approach is that it doesn’t require a complex pattern matching system.
I only recommend doing this for small compiler toy projects.
Disadvantages
Doing pattern matching this way has many disadvantages.
Some (but not all):
debugging pattern match rules can be hard
IR rewrites need to be tracked manually (for debugging)
source locations and debug information also need to be tracked manually, which often isn’t implemented very well.
verbose and barely readable pattern matching code
overall error-prone
I myself did pattern matching this way in my old compiler backend, and I speak from experience when I say that this approach sucks (in most cases).
Pattern Matching DSLs
A custom language for describing IR patterns and IR transformations (aka rewrites).
I will put this into the category of “structured pattern matching”.
An example is Cranelift’s ISLE DSL:
;; x ^ x == 0. (rule (simplify (bxor (ty_int ty) x x)) (subsume (iconst_u ty 0)))
Fun fact: tinygrad actually decompiles the python code inside the second element of the pair, and runs multiple optimization passes on that.
This approach is used by many popular compilers such as LLVM, GCC, and Cranelift for peephole optimizations and code generation.
Advantages
debugging and tracking of rewrites, source locations, and debug information can be done properly
patterns themselves can be inspected and modified programmatically.
they are easier to use and read than manual pattern matching in the source code.
There is however an even better alternative:
Pattern Matching Dialects
I will also put this method into the category of “structured pattern matching”.
The main example of this is MLIR, with the pdl and the transform dialects. Sadly few projects/people use these dialects, and instead do pattern matching in C++ code. Probably because the dialects aren’t documented very well.
What are compiler dialects?
Modern compilers, especially multi-level compilers, such as MLIR, have their operations grouped in “dialects”.
Each dialect either represents specific kinds of operations, like arithmetic operations, or a specific backend’s/frontend’s operations, such as the llvm, emitc, and the spirv dialects in MLIR.
Dialects commonly contain operations, data types, as well as optimization and dialect conversion passes.
Core Concept
The IR patterns and transformations are represented using the compiler’s IR. This is mostly done in a separate dialect, with dedicated operations for operating on IR.
Examples
MLIR’s pdl dialect can be used to replace arith.addi with my.add like this:
the pattern matching infrastructure can optimize it’s own patterns: The compiler can operate on patterns and rewrite rules like they are normal operations. This removes the need for special infrastructure regarding pattern matching DSLs.
the compiler could AOT compile patterns
the compiler could optimize, analyze, and combine patterns to reduce compile time.
IR (de-)serialization infrastructure in the compiler can also be used to exchange peephole optimizations.
bragging rights: your compiler represents its patterns in it’s own IR
Combining with a DSL
I recommend having a pattern matching / rewrite DSL, that transpiles to pattern matching / rewrite dialect operations.
The advantage of this over just having a rewrite dialect is that it makes patterns even more readable (and maintainable!)
E-Graphs
E-Graphs are magical datastructures that can be used to efficiently encode all possible transformations, and then select the best transformation.
Even though E-Graphs solve most problems, I still recommend using a pattern matching dialect, especially in multi-level compilers, to be more flexible, and have more future-proof pattern matching, or you decide that you want to match some complex patterns manually.
More Advantages of Structured Pattern Matching
Smart Pattern Matchers
Instead of brute-forcing all peephole optimizations (of which there can be a LOT in advanced compilers), the compiler can organize all the patterns to provide more efficient matching. I didn’t yet investigate how to do this. If you have any ideas regarding this, please contact me.
There are other ways to speed up the pattern matching and rewrite process using this too.
Reversible Transformations
I don’t think that there currently is any compiler that does this. If you do know one, again, please contact me.
Optimizing compilers typically deal with code (mostly written by people) that is on a lower level than the compiler theoretically supports. For example, humans tend to write code like this for extracting a bit: x & (1<< b), but compilers tend to have a high-level bit test operation (with exceptions). A reason for having higher-level primitives is that it allows the compiler to do more high-level optimizations, but also some target architectures have a bit test operation, that is more optimal.
This is not just the case for “low-level” things like bit tests, but also high level concepts, like a reduction over an array, or even the implementation of a whole algorithm. For example LLVM, since recently, can detect implementations of CRC.
LLVM actually doesn’t have many dedicated operations like a bit-test operation, and instead canonicalizes all bit-test patterns to x & (1<< b) !=0, and matches for that in compiler passes that expect bit test operations.
Now let’s go back to the x & (1<< b) (bit test) example. Optimizing compilers should be able to detect that, and other bit test patterns (like x & (1<< b) >0), and then replace those with a bit-test operation. But they also have to be able to convert bit-test operations back to their implementation for compilation targets that don’t have a bit-test instruction. Currently, compiler backends do this by having separate patterns for converting bit-test to it’s dedicated operation, and back.
A better solution is to associate a set of implementations with the bit test operation, and make the compiler automatically reverse those to generate the best implementation (in the instruction selector for example).
Implementing pattern/transformation reversion can be challenging however, but it provides many benefits, and all “big” compilers should definitely do this, in my opinion.
Runtime Library
Compilers typically come with a runtime library that implement more complex operations that aren’t supported by most processors or architectures.
The implementation of those functions should also use that pattern matching dialect. This allows your backend to detect code written by users with a similar implementation as in the runtime library, giving you some additional optimizations for free.
I don’t think any compiler currently does this either.
Problems with Pattern Matching
The main problem is ordering the patterns.
As an example, consider these three patterns:
;; A (add x:Const y) => (add y x)
;; B (sub (add x y:Const) z:Const) => (lea x y (const_neg z))
;; C (add x 1) => (inc x)
Now what should the compiler do when it sees this:
(sub (add51) 2)
All three patterns would match:
;; apply A (sub (add51) 2) => (sub (add15) 2) ;; only B applies now (sub (add15) 2) => (lea15 (const_neg2)) ;; nothing applies anymore
Now which of those transformations should be performed?
This is not as easy to solve as it seems, especially in the context of instruction selection (specifically scheduling), where the performance on processors depends on a sequence of instructions, instead of just a single instruction.
Superscalar CPUs
Modern processor architecture features like superscalar execution make this even more complicated.
As a simple, unrealistic example, let’s imagine a CPU (core) that has one bit operations execution unit, and two ALU execution units / ports. This means that the CPU can execute two instructions in the ALU unit and one instruction in the bit ops unit at the same time.
One might think that always optimizing a & (1 << b) to a bit test operation is good for performance. But in this example, that is not the case.
If we have a function that does a lot of bitwise operations next to each other, and the compiler replaces all bit tests with bit test operations, suddenly all operations depend on the bit ops unit, which means that instead of executing 3 instructions at a time (ignoring pipelining), the CPU can only execute one instruction at a time.
This shows that we won’t know if an optimization is actually good, until we are at a late point in the compilation process where we can simulate the CPU’s instruction scheduling.
This does not only apply to instruction selection, but also to more higher-level optimizations, such as loop and control flow related optimizations.
Conclusion
One can see how pattern matching dialects are the best option to approach pattern matching.
Someone wanted me to insert a takeaway here, but I won’t.
PS: I’ll hunt down everyone who still decides to do pattern matching in their compiler source after reading this article.
\ No newline at end of file
diff --git a/build/deploy/index.html b/build/deploy/index.html
new file mode 100644
index 0000000..c7e04a4
--- /dev/null
+++ b/build/deploy/index.html
@@ -0,0 +1 @@
+Alexander Nutz
Latest version of my badge: https://alex.vxcc.dev/res/badge.png
Check out these websites:
\ No newline at end of file
diff --git a/build/deploy/index.typ.desktop.html b/build/deploy/index.typ.desktop.html
new file mode 100644
index 0000000..c7e04a4
--- /dev/null
+++ b/build/deploy/index.typ.desktop.html
@@ -0,0 +1 @@
+Alexander Nutz
Latest version of my badge: https://alex.vxcc.dev/res/badge.png
Check out these websites:
\ No newline at end of file
diff --git a/build/deploy/index.typ.min.html b/build/deploy/index.typ.min.html
new file mode 100644
index 0000000..040e9f5
--- /dev/null
+++ b/build/deploy/index.typ.min.html
@@ -0,0 +1 @@
+Alexander Nutz
Latest version of my badge: https://alex.vxcc.dev/res/badge.png
Check out these websites:
\ No newline at end of file
diff --git a/build/deploy/index.typ.min.pdf b/build/deploy/index.typ.min.pdf
new file mode 100644
index 0000000..e6d11f8
Binary files /dev/null and b/build/deploy/index.typ.min.pdf differ
diff --git a/build/deploy/index.typ.nano.html b/build/deploy/index.typ.nano.html
new file mode 100644
index 0000000..b3de103
--- /dev/null
+++ b/build/deploy/index.typ.nano.html
@@ -0,0 +1 @@
+Alexander Nutz
Latest version of my badge: https://alex.vxcc.dev/res/badge.png
Check out these websites:
\ No newline at end of file
diff --git a/build/deploy/project-etc-nand.typ.desktop.html b/build/deploy/project-etc-nand.typ.desktop.html
new file mode 100644
index 0000000..0d66ef4
--- /dev/null
+++ b/build/deploy/project-etc-nand.typ.desktop.html
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/build/deploy/project-etc-nand.typ.min.html b/build/deploy/project-etc-nand.typ.min.html
new file mode 100644
index 0000000..66528d3
--- /dev/null
+++ b/build/deploy/project-etc-nand.typ.min.html
@@ -0,0 +1 @@
+etc-nand
etc-nand
Overview
etc-nand is a real-world ETC.A CPU built from almost only quad NAND gate ICs (74hc00)
It will probably be finished in a few months.
Estimates
Estimated gate count:
2800 NAND gates
320 tristate buffers
Estimated component counts:
700x 74hc00 quad NAND gates
40x 74HC54 octal tristate buffers
a few simple resistors
Planned Specifications
ETC.A base instruction set + byte operations + S&F + Von Neumann
The CPU will communicate with peripherals over a 16 bit data + 15 bit address memory bus
Purchase
You will be able to purchase one in the future.
Stay tuned!
Images
Images of PCBs that are either already manifactured or currently beeing manifactured by JLCPCB.
\ No newline at end of file
diff --git a/build/deploy/project-etc-nand.typ.min.pdf b/build/deploy/project-etc-nand.typ.min.pdf
new file mode 100644
index 0000000..31a38f7
Binary files /dev/null and b/build/deploy/project-etc-nand.typ.min.pdf differ
diff --git a/build/deploy/project-etc-nand.typ.nano.html b/build/deploy/project-etc-nand.typ.nano.html
new file mode 100644
index 0000000..bf50b01
--- /dev/null
+++ b/build/deploy/project-etc-nand.typ.nano.html
@@ -0,0 +1 @@
+etc-nand
etc-nand
Overview
etc-nand is a real-world ETC.A CPU built from almost only quad NAND gate ICs (74hc00)
It will probably be finished in a few months.
Estimates
Estimated gate count:
2800 NAND gates
320 tristate buffers
Estimated component counts:
700x 74hc00 quad NAND gates
40x 74HC54 octal tristate buffers
a few simple resistors
Planned Specifications
ETC.A base instruction set + byte operations + S&F + Von Neumann
The CPU will communicate with peripherals over a 16 bit data + 15 bit address memory bus
Purchase
You will be able to purchase one in the future.
Stay tuned!
Images
Images of PCBs that are either already manifactured or currently beeing manifactured by JLCPCB.