u/New-Juggernaut4693

What even is the point of smol-GPU with this many simplifications?
▲ 4 r/gpu+1 crossposts

What even is the point of smol-GPU with this many simplifications?

https://github.com/Grubre/smol-gpu

The designer says it's for educational purposes, but the amount of stuff stripped away makes me question how much it actually teaches about real GPU architecture.

Here's what's been simplified away:

  1. Sequential warp scheduling : one warp runs to completion, then the next. No latency hiding at all.

  2. No warp-level parallelism within a core : only one warp occupies resources at a time.

  3. No cache hierarchy : cores talk directly to global memory.

  4. Separated program and data memory : Harvard style, not unified.

  5. No shared memory / scratchpad : so no cooperative algorithms between threads.

  6. No barrier / synchronization primitives : no __syncthreads() equivalent.

  7. No reconvergence stack in hardware : divergence is handled purely through manual masking.

  8. No memory coalescing : each thread issues its own memory request.

  9. No FPU, no special function units : integer only.

  10. No atomics, no fence : subset of RV32I.

At this point it's basically executing one warp after another on each core. If you squint, this is just a multicycle processor that happens to run 32 threads in lockstep. Yes, the SIMT model and execution masking are there, but without pipelining, warp interleaving, or caches, you're not really seeing what makes GPUs fast.

Is there any deeper reasoning behind stripping this much out? And more importantly, I've gone through the RTL and spotted what look like potential race conditions in a few places. Is this repo even a legit baseline to build a more advanced GPU on top of, or would you be better off starting from scratch?

u/New-Juggernaut4693 — 16 hours ago
▲ 0 r/RISCV

What even the point of making smol-GPU

Although the designer mentioned it's for educational purposes, why did he simplify stuff so much.

https://github.com/Grubre/smol-gpu

What are the reasons behind these simplifications:

  1. Sequential warp scheduling

  2. No warp-level parallelism within a core

  3. No cache hierarchy

  4. Separated program and data memory

  5. No shared memory / scratchpad

  6. No barrier / synchronization primitives

  7. No reconvergence stack in hardware

and many more....

Is there any reasoning behind these simplifications?

I have also checked the RTL, there were few cases of possible race conditions. Is this repo even a legit baseline to make an advanced gpu on top of it?

reddit.com
u/New-Juggernaut4693 — 16 hours ago