diff --git a/pages/article-gpu-arch-1.typ b/pages/article-gpu-arch-1.typ
index f6faba5..588e4c3 100644
--- a/pages/article-gpu-arch-1.typ
+++ b/pages/article-gpu-arch-1.typ
@@ -45,7 +45,8 @@
 
   Each compute unit has multiple SIMD units, also called "wave", "wavefront" or "warp".
   Compute units also have some fast local memory (tens of kilobytes),
-  main memory access queues, texture units, a scalar unit, and other features. (see future article)
+  main memory access queues, texture units, a scalar unit, and other features.
+  Subscribe to the #flink("atom.xml")[Atom feed] to get notified of future articles.
 
   The main memory (graphics memory) is typically outside of the GPU, and is slow, but high-bandwidth memory.
 ]
@@ -67,6 +68,15 @@
   => waves are really similar to SIMD on modern CPUs
 ]
 
+#section[
+  In modern GPUs, instruction execution in waves is superscalar,
+  so there are multiple different execution units for executing different kinds of instructions,
+  and multiple instructions can be executed at once, if there are free execution units,
+  and they don't depend on each other.
+
+  We'll be exploring that in a future article.
+]
+
 #section[
   == Local memory
   The local memory inside GPUs is banked, typically into 32 banks.
@@ -144,6 +154,7 @@
   - 48 vector registers of 16x32b per wave
   - one scalar unit per CU
   - 128 global memory ports
+  - 16 async task completion "signal" slots per wave
   - no fancy out of order or superscalar execution
   - support standard 32 bit floating point, without exceptions.
 
@@ -181,6 +192,7 @@
   - `Sreg`: the first element of a vector register, as scalar
   - `Sany`: a `Simm` or an `Sreg`
   - `dist`: `Vany`, or a `Sany` broadcasted to each element
+  - `sig`: one of the 16 completion signal slots
 ]
 
 #section[
@@ -210,19 +222,120 @@
 ]
 
 #section[
-  === Memory
-  - `fn local_load`
-  TODO
+  === Local memory
+  - load 32 bit value at each elem where mask is true:
+    `fn local_load32(out out: Vreg, in mask: M, in addr: Vreg)`
+  - store 32 bit value at each elem where mask is true:
+    `fn local_store32(in addr: Vreg, in mask: M, in val: Vany)`
+]
+
+#section[
+  === Global (async) memory
+  - start an async global load, and make the given signal correspond to the completion of the access:
+     load 32 bit value at each elem where mask is true:
+    `fn global_load32(out sig: sig, out out: Vreg, in mask: M, in addr: Vreg)`
+  - see above and `local_store32`
+    `fn global_store32(out sig: sig, in addr: Vreg, in mask: M, in val: Vany)`
+  - `fn sig_done1(out r: Sreg, in sig: sig)`
+  - `fn sig_done2(out r: Sreg, in sig1: sig, in sig2: sig)`
+  - `fn sig_wait(out r: Sreg, in sig: sig)`
+  - `fn sig_waitall2(out r: Sreg, in sig1: sig, in sig2: sig)`
+  - `fn sig_waitall3(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig)`
+  - `fn sig_waitall4(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig, in sig4: sig)`
+
+  As a future extension, we could add a instruction that waits for any of the
+  given signals to complete, and then jump to a specific location, depending on which of those completed.
 ]
 
 #section[
   === Control flow (whole wave)
-  TODO
+  - branch if scalar is zero:
+    `fn brz(in dest: Simm, in val: Sany)`
+  - branch if scalar is not zero:
+    `fn brnz(in dest: Simm, in val: Sany)`
+  - branch on the whole wave if each element has a true value for the mask:
+    `fn br_all(in dest: Simm, in cond: M)`
+  - branch on the whole wave if any element has a true value for the mask:
+    `fn br_any(in dest: Simm, in cond: M)`
 ]
 
 #section[
   = Hand-compiling code
-  TODO
+  Now that we decided on a simple compute-only GPU architecture,
+  we can try hand-compiling an OpenCL program.
+
+  I asked an LLM to produce a N*N matmul example (comments written manually):
+  ```c
+  // convenient number for our specifc hardware
+  #define TILE_SIZE 8
+
+  // this kernel will be launched with dimensions:
+  //   global[2] = { 128,128 } = { N, N };
+  //   local[2]  = { 8,8 } = { TILE_SIZE, TILE_SIZE };
+  __kernel void matmul_tiled(
+    __global float* A,
+    __global float* B,
+    __global float* C,
+    const int N)
+  {
+    int row = get_global_id(1); // y
+    int col = get_global_id(0); // x
+
+    __local float Asub[TILE_SIZE][TILE_SIZE];
+    __local float Bsub[TILE_SIZE][TILE_SIZE];
+
+    float sum = 0.0f;
+
+    for (int t = 0; t < N / TILE_SIZE; ++t) {
+      // load tiles into local
+      int tiledRow = row;
+      int tiledCol = t * TILE_SIZE + get_local_id(0);
+      if (tiledRow < N && tiledCol < N)
+        Asub[get_local_id(1)][get_local_id(0)] = A[tiledRow * N + tiledCol];
+      else
+        Asub[get_local_id(1)][get_local_id(0)] = 0.0f;
+
+      tiledRow = t * TILE_SIZE + get_local_id(1);
+      tiledCol = col;
+      if (tiledRow < N && tiledCol < N)
+          Bsub[get_local_id(1)][get_local_id(0)] = B[tiledRow * N + tiledCol];
+      else
+          Bsub[get_local_id(1)][get_local_id(0)] = 0.0f;
+
+      // sync local access across local grp
+      barrier(CLK_LOCAL_MEM_FENCE);
+
+      for (int k = 0; k < TILE_SIZE; ++k)
+          sum += Asub[get_local_id(1)][k] * Bsub[k][get_local_id(0)];
+
+      // sync local access across local grp
+      barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (row < N && col < N)
+      C[row * N + col] = sum;
+  }
+  ```
+]
+
+#section[
+  First, we have to decide on how we want to map the kernel to the hardware.
+
+  Since the local dimension of the kernel is 8*8, which is 64,
+  we can map each local group to one CU, by mapping 32 kernels to one wave,
+  and using both waves available on one CU for the local group.
+
+  Our global dimension is 128*128, which means that we would need 256 compute units.
+  But since we probably don't have 256 compute units,
+  GPUs, including ours, will have a on-hardware task scheduler,
+  for scheduing tasks onto compute units.
+]
+
+#section[
+  = Outro
+  Modern GPUs are really complex, but designing a simple GPU is not that hard either.
+
+  Subscribe to the #flink("atom.xml")[Atom feed] to get notified of future articles.
 ]
 
 ]