Files
website-build/article-gpu-arch-1.typ.min.html
2025-08-29 15:47:42 +02:00

1 line
24 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html><title>Designing a GPU architecture: Waves</title><meta charset=utf-8><meta content="width=device-width,initial-scale=1" name=viewport><link href=res/favicon.png rel=icon sizes=512x512><link href=res/favicon.png rel=image_src type=image/png><link title="alexs168's blog" href=atom.xml rel=alternate type=application/atom+xml><body><style>@font-face{font-family:DejaVu Sans Mono;src:local(DejaVu Sans Mono),url(res/DejaVuSansMono.woff2)format("woff2"),local(Courier New),local(Courier),local(monospace);font-weight:400;font-style:normal;font-display:swap}body{font-family:DejaVu Sans Mono;font-size:10pt}td{vertical-align:top;width:100%;display:inline}h1,h2,h3,h4{margin-top:1%;margin-bottom:.75%}p{margin-top:.75%;margin-bottom:.75%}ul{margin-top:0%}.current{font-weight:700}pre{margin-top:0;margin-bottom:0;display:inline}a,a:visited{color:#3f51b5;text-decoration:none}</style><div><p><br><h1>Designing a GPU architecture: Waves</h1><p><span style=font-size:9pt><p>Git revision <a href=https://github.com/alex-s168/website/tree/5a9dfdd720bcba1e5d1562279e09c674a30a174b>#5a9dfdd7</a><p><br>Modified at 26. August 2025 21:13<p>Written by <a href=https://alex.vxcc.dev>alex_s168</a></span></div><div><br><span style=text-decoration:underline><h2>Introduction</h2></span> In this article, well be looking into the hardware of GPUs, and then designing our own. Specifically GPUs with unified shader architecture.</div><div><br><span style=text-decoration:underline><h3>Comparison with CPUs</h3></span> GPUs focus on operating on a lot of data at once (triangles, vertices, pixels, …), while CPUs focus on high performance on a single core, and low compute delay.</div><div><p><br><span style=text-decoration:underline><h2>GPU Architecture</h2></span> GPUs consists of multiple (these days at least 32) compute units (= CU).<p>Each compute unit has multiple SIMD units, also called “wave”, “wavefront” or “warp”. Compute units also have some fast local memory (tens of kilobytes), main memory access queues, texture units, a scalar unit, and other features. Subscribe to the <a href=atom.xml>Atom feed</a><p>to get notified of future articles.<p>The main memory (graphics memory) is typically outside of the GPU, and is slow, but high-bandwidth memory.</div><div><p><br><span style=text-decoration:underline><h3>Waves</h3></span> A wave is a SIMD processing unit consisting of typically 32 “lanes” (sometimes called threads).<p>Each wave in a CU has separate control flow, and doesnt have to be related.<p>Instructions that waves support:<ul><li>arithmetic operations<li>cross-lane data movement<li>CU local and global memory access: each SIMD lane can access a completely different address. similar to CPU gather / scatter.<li>synchronization with other CUs in the work group (see future article)</ul><p>Since only the whole wave can do control flow, and not each lane, all operations can be masked so that they only apply to specific lanes.<p>=> waves are really similar to SIMD on modern CPUs</div><div><p><br>In modern GPUs, instruction execution in waves is superscalar, so there are multiple different execution units for executing different kinds of instructions, and multiple instructions can be executed at once, if there are free execution units, and they dont depend on each other.<p>Well be exploring that in a future article.</div><div><p><br><span style=text-decoration:underline><h3>Local memory</h3></span> The local memory inside GPUs is banked, typically into 32 banks. The memory word size is typically 32 bits.<p>The addresses are interlaved, so for two banks:<ul><li>addr 0 => bank 0<li>addr 1 => bank 1<li>addr 2 => bank 0<li>addr 3 => bank 1<li></ul><p>Each bank has an dedicated access port, so for 32 banks, you get 32 access ports.<p>The lanes of the waves inside a CU get routed to the local memory banks magically.</div><div><br><span style=text-decoration:underline><h4>Why are the banks interlaved?</h4></span> When the whole wave wants to read a contiguous array of <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>f32</code></span>, so when each wave performs <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>some_f32_array[lane_id()]</code></span>, all 32 banks can be used at the same time.</div><div><p><br><span style=text-decoration:underline><h4>Why multiple waves share the same local memory</h4></span> A wave doesnt do memory accesses every instruction, but also does computations. This means that there are cycles where the memory isnt doing anything.<p>By making multiple waves share the same local memory and access ports, you save resources.</div><div><p><br><span style=text-decoration:underline><h3>Global memory</h3></span> Since global memory reads/writes are really slow, they happen asynchronosly.<p>This means that a wave requests an access, then can continue executing, and then eventually waits for that access to finish.<p>Because of this, modern compilers automagically start the access before the data is needed, and then wait for the data later on.</div><div><p><br><span style=text-decoration:underline><h3>Scalar unit</h3></span> Most newer GPUs also have a scalar unit for saving energy when performing simple operations.<p>When the controller sees a scalar instruction in the code running on a wave, it automatically makes the code run on the scalar unit.<p>The scalar unit can be used for:<ul><li>address calculation<li>partial reductions<li>execution of expensive operations not implemented on SIMD because of costs</ul></div><div><p><br><span style=text-decoration:underline><h2>GPU Programming Terminology</h2></span><ul><li>“work item”: typically maps to a SIMD lane<li>“kernel”: the code for a work item<li>“work group”: consists of multiple work items. typically maps to an CU. the <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>__local</code></span> memory in OpenCL applies to this.<li>“compute task”: a set of work groups</ul></div><div><p><br>OpenCL and other APIs let you specify both the number of work groups and work items.<p>Since a program might specify a higher number of work items per work group than we have available, the compiler needs to be able to put multiple work items onto one SIMD lane.</div><div><p><br><span style=text-decoration:underline><h2>Our own architecture</h2></span> Well go with these specs for now:<ul><li>N compute units<li>2 waves per CU<li>32 lanes per wave.<li>1KiB local memory per lane => 64 KiB<li>48 vector registers of 16x32b per wave<li>one scalar unit per CU<li>128 global memory ports<li>16 async task completion “signal” slots per wave<li>no fancy out of order or superscalar execution<li>support standard 32 bit floating point, without exceptions.</ul><p>Note that we wont specify the exact instruction encoding.</div><div><p><br><span style=text-decoration:underline><h3>Predefined Constants</h3></span> We will pre-define 16 constants (as virtual vector registers):<ul><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>zero</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>one</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>sid</code></span>: 0,1,2,3,4,5,6<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>wave</code></span>: the ID of the wave in the compute task, broadcasted to all elements.<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>u8_max</code></span>: 255,255,…<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>n2nd</code></span>: 1,2,1,2,…<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>n3rd</code></span>: 1,2,4,1,…<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>n4th</code></span>: 1,2,4,8,1,…<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>lo16</code></span>: 1,1,1,… (x16) 0,0,0,… (x16)<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>ch2</code></span>: 1,1,0,0,1,1,…<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>ch4</code></span>: 1,1,1,1,0,0,0,0,1,…<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>alo8</code></span>: 1 (x8) 0 (x8) 1 (x8) 0 (x8)<li>a few reserved ones</ul></div><div><p><br><span style=text-decoration:underline><h3>Operands</h3></span> We define the following instruction operands:<ul><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Vreg</code></span>: vector register<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>M</code></span>: (read only) vector gp reg as mask (1b). only first 32 registers can be used as mask. the operand consists of two masks and-ed together, each of which can conditionally be inverted first. this means that this operand takes up 12 bits<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Vany</code></span>: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Vreg</code></span> or <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>M</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Simm</code></span>: immediate scalar value<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Sreg</code></span>: the first element of a vector register, as scalar<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Sany</code></span>: a <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Simm</code></span> or an <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Sreg</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>dist</code></span>: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Vany</code></span>, or a <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>Sany</code></span> broadcasted to each element<li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>sig</code></span>: one of the 16 completion signal slots</ul></div><div><br><span style=text-decoration:underline><h3>Instructions</h3></span> We will add more instructions in future articles.</div><div><p><br><span style=text-decoration:underline><h4>Data Movement</h4></span><ul><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn mov(out out: Vreg, in wrmask: M, in val: dist)</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn select(out out: Vreg, in select: M, in false: dist, in true: dist)</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn first_where_true(out out: Sreg, in where: M, in values: dist)</code></span>: if none of the elements are true, it doesnt overwrite the previous value in out.<li>cross-lane operations: not important for this article</ul></div><div><p><br><span style=text-decoration:underline><h4>Mathematics</h4></span><ul><li>simple (unmasked) <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>u32</code></span>, <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>i32</code></span>, and <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>f32</code></span> elementwise arithmetic and logic operations: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn add&lt;u32>(out out: Vreg, in left: Vany, in right: dist)</code></span><li>scalar arithmetic and logic operations: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn add&lt;u32>(out out: Sreg, in left: Sany, in right: Sany)</code></span><li>partial reduction operations: “chunks” the input with a size of 8, reduces each chunk, and stores it in the first element of the chunk. this means that every 8th element will contain a partial result.<li>and operations to finish that reduction into the first element of the vector</ul></div><div><p><br><span style=text-decoration:underline><h4>Local memory</h4></span><ul><li>load 32 bit value at each elem where mask is true: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn local_load32(out out: Vreg, in mask: M, in addr: Vreg)</code></span><li>store 32 bit value at each elem where mask is true: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn local_store32(in addr: Vreg, in mask: M, in val: Vany)</code></span></ul></div><div><p><br><span style=text-decoration:underline><h4>Global (async) memory</h4></span><ul><li>start an async global load, and make the given signal correspond to the completion of the access: load 32 bit value at each elem where mask is true: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn global_load32(out sig: sig, out out: Vreg, in mask: M, in addr: Vreg)</code></span><li>see above and <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>local_store32</code></span> <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn global_store32(out sig: sig, in addr: Vreg, in mask: M, in val: Vany)</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn sig_done1(out r: Sreg, in sig: sig)</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn sig_done2(out r: Sreg, in sig1: sig, in sig2: sig)</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn sig_wait(out r: Sreg, in sig: sig)</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn sig_waitall2(out r: Sreg, in sig1: sig, in sig2: sig)</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn sig_waitall3(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig)</code></span><li><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn sig_waitall4(out r: Sreg, in sig1: sig, in sig2: sig, in sig3: sig, in sig4: sig)</code></span></ul><p>As a future extension, we could add a instruction that waits for any of the given signals to complete, and then jump to a specific location, depending on which of those completed.</div><div><p><br><span style=text-decoration:underline><h4>Control flow (whole wave)</h4></span><ul><li>branch if scalar is zero: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn brz(in dest: Simm, in val: Sany)</code></span><li>branch if scalar is not zero: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn brnz(in dest: Simm, in val: Sany)</code></span><li>branch on the whole wave if each element has a true value for the mask: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn br_all(in dest: Simm, in cond: M)</code></span><li>branch on the whole wave if any element has a true value for the mask: <span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><code style=white-space:pre-wrap>fn br_any(in dest: Simm, in cond: M)</code></span></ul></div><div><p><br><span style=text-decoration:underline><h2>Hand-compiling code</h2></span> Now that we decided on a simple compute-only GPU architecture, we can try hand-compiling an OpenCL program.<p>I asked an LLM to produce a N*N matmul example (comments written manually):<div style=margin-top:4pt><span style="border:1pt solid #000;border-radius:2pt;padding:1.6pt;display:inline-block"><pre><code><span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> convenient number for our specifc hardware</span><br><span style=color:#d73a49>#define</span> <span style=color:#4b69c6>TILE_SIZE</span> <span style=color:#b60157>8</span><br><br><span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> this kernel will be launched with dimensions:</span><br><span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> global[2] = { 128,128 } = { N, N };</span><br><span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> local[2] = { 8,8 } = { TILE_SIZE, TILE_SIZE };</span><br>__kernel <span style=color:#d73a49>void</span> <span style=color:#4b69c6>matmul_tiled</span>(<br> __global <span style=color:#d73a49>float</span><span style=color:#d73a49>*</span> A,<br> __global <span style=color:#d73a49>float</span><span style=color:#d73a49>*</span> B,<br> __global <span style=color:#d73a49>float</span><span style=color:#d73a49>*</span> C,<br> <span style=color:#d73a49>const</span> <span style=color:#d73a49>int</span> N)<br>{<br> <span style=color:#d73a49>int</span> row <span style=color:#d73a49>=</span> <span style=color:#4b69c6>get_global_id</span>(<span style=color:#b60157>1</span>); <span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> y</span><br> <span style=color:#d73a49>int</span> col <span style=color:#d73a49>=</span> <span style=color:#4b69c6>get_global_id</span>(<span style=color:#b60157>0</span>); <span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> x</span><br> <span style=color:#d73a49>int</span> local_row <span style=color:#d73a49>=</span> <span style=color:#4b69c6>get_local_id</span>(<span style=color:#b60157>1</span>); <span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> y</span><br> <span style=color:#d73a49>int</span> local_col <span style=color:#d73a49>=</span> <span style=color:#4b69c6>get_local_id</span>(<span style=color:#b60157>0</span>); <span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> x</span><br><br> __local <span style=color:#d73a49>float</span> Asub[TILE_SIZE][TILE_SIZE];<br> __local <span style=color:#d73a49>float</span> Bsub[TILE_SIZE][TILE_SIZE];<br><br> <span style=color:#d73a49>float</span> sum <span style=color:#d73a49>=</span> <span style=color:#b60157>0</span><span style=color:#b60157>.</span><span style=color:#b60157>0</span><span style=color:#d73a49>f</span>;<br><br> <span style=color:#d73a49>for</span> (<span style=color:#d73a49>int</span> t <span style=color:#d73a49>=</span> <span style=color:#b60157>0</span>; t <span style=color:#d73a49>&lt;</span> N <span style=color:#d73a49>/</span> TILE_SIZE; <span style=color:#d73a49>++</span>t) {<br> <span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> load tiles into local</span><br> <span style=color:#d73a49>int</span> tiledRow <span style=color:#d73a49>=</span> row;<br> <span style=color:#d73a49>int</span> tiledCol <span style=color:#d73a49>=</span> t <span style=color:#d73a49>*</span> TILE_SIZE <span style=color:#d73a49>+</span> local_col;<br> <span style=color:#d73a49>float</span> av;<br> <span style=color:#d73a49>if</span> (tiledRow <span style=color:#d73a49>&lt;</span> N <span style=color:#d73a49>&&</span> tiledCol <span style=color:#d73a49>&lt;</span> N)<br> av <span style=color:#d73a49>=</span> A[tiledRow <span style=color:#d73a49>*</span> N <span style=color:#d73a49>+</span> tiledCol];<br> <span style=color:#d73a49>else</span><br> av <span style=color:#d73a49>=</span> <span style=color:#b60157>0</span><span style=color:#b60157>.</span><span style=color:#b60157>0</span><span style=color:#d73a49>f</span>;<br> Asub[local_row][local_col] <span style=color:#d73a49>=</span> av;<br><br> tiledRow <span style=color:#d73a49>=</span> t <span style=color:#d73a49>*</span> TILE_SIZE <span style=color:#d73a49>+</span> local_row;<br> tiledCol <span style=color:#d73a49>=</span> col;<br> <span style=color:#d73a49>float</span> bv;<br> <span style=color:#d73a49>if</span> (tiledRow <span style=color:#d73a49>&lt;</span> N <span style=color:#d73a49>&&</span> tiledCol <span style=color:#d73a49>&lt;</span> N)<br> bv; <span style=color:#d73a49>=</span> B[tiledRow <span style=color:#d73a49>*</span> N <span style=color:#d73a49>+</span> tiledCol];<br> <span style=color:#d73a49>else</span><br> bv <span style=color:#d73a49>=</span> <span style=color:#b60157>0</span><span style=color:#b60157>.</span><span style=color:#b60157>0</span><span style=color:#d73a49>f</span>;<br> Bsub[local_row][local_col]<span style=color:#d73a49>=</span> bv;<br><br> <span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> sync local access across local grp</span><br> <span style=color:#4b69c6>barrier</span>(CLK_LOCAL_MEM_FENCE);<br><br> <span style=color:#d73a49>for</span> (<span style=color:#d73a49>int</span> k <span style=color:#d73a49>=</span> <span style=color:#b60157>0</span>; k <span style=color:#d73a49>&lt;</span> TILE_SIZE; <span style=color:#d73a49>++</span>k)<br> sum <span style=color:#d73a49>+=</span> Asub[local_row][k] <span style=color:#d73a49>*</span> Bsub[k][local_col];<br><br> <span style=color:#8a8a8a>//</span><span style=color:#8a8a8a> sync local access across local grp</span><br> <span style=color:#4b69c6>barrier</span>(CLK_LOCAL_MEM_FENCE);<br> }<br><br> <span style=color:#d73a49>if</span> (row <span style=color:#d73a49>&lt;</span> N <span style=color:#d73a49>&&</span> col <span style=color:#d73a49>&lt;</span> N)<br> C[row <span style=color:#d73a49>*</span> N <span style=color:#d73a49>+</span> col] <span style=color:#d73a49>=</span> sum;<br>}</code></pre></span></div></div><div><p><br>First, we have to decide on how we want to map the kernel to the hardware.<p>Since the local dimension of the kernel is 8*8, which is 64, we can map each local group to one CU, by mapping 32 kernels to one wave, and using both waves available on one CU for the local group.<p>Our global dimension is 128*128, which means that we would need 256 compute units. But since we probably dont have 256 compute units, GPUs, including ours, will have a on-hardware task scheduler, for scheduing tasks onto compute units.</div><div><p><br><span style=text-decoration:underline><h2>Outro</h2></span> Modern GPUs are really complex, but designing a simple GPU is not that hard either.<p>Subscribe to the <a href=atom.xml>Atom feed</a><p>to get notified of future articles.</div><script>var gotoVariant=(a=>{window.location.href=window.location.href.replace(/\.\w+.html/g,a)});window.addEventListener(`beforeprint`,a=>{gotoVariant(`.min.pdf`)})</script><script async src=coffee.js></script>