u/New-Juggernaut4693

https://github.com/Grubre/smol-gpu

The designer says it's for educational purposes, but the amount of stuff stripped away makes me question how much it actually teaches about real GPU architecture.

Here's what's been simplified away:

Sequential warp scheduling : one warp runs to completion, then the next. No latency hiding at all.
No warp-level parallelism within a core : only one warp occupies resources at a time.
No cache hierarchy : cores talk directly to global memory.
Separated program and data memory : Harvard style, not unified.
No shared memory / scratchpad : so no cooperative algorithms between threads.
No barrier / synchronization primitives : no __syncthreads() equivalent.
No reconvergence stack in hardware : divergence is handled purely through manual masking.
No memory coalescing : each thread issues its own memory request.
No FPU, no special function units : integer only.
No atomics, no fence : subset of RV32I.

At this point it's basically executing one warp after another on each core. If you squint, this is just a multicycle processor that happens to run 32 threads in lockstep. Yes, the SIMT model and execution masking are there, but without pipelining, warp interleaving, or caches, you're not really seeing what makes GPUs fast.

Is there any deeper reasoning behind stripping this much out? And more importantly, I've gone through the RTL and spotted what look like potential race conditions in a few places. Is this repo even a legit baseline to build a more advanced GPU on top of, or would you be better off starting from scratch?

What even is the point of smol-GPU with this many simplifications?

What even the point of making smol-GPU