u/Remarkable_Garage_40

GitHub: github.com/Deepesh1024/NVMirror

NVMirror compiles LLVM IR all the way down to custom GPU assembly instruction selection, register allocation, and instruction scheduling, built from scratch as an out-of-tree LLVM backend.

https://preview.redd.it/87560vtrl6tg1.png?width=1738&format=png&auto=webp&s=00ae36a214d136a6d6f1d64f446394628728a36f

The scheduler's job is simple: don't let the GPU sit idle waiting 20 cycles for memory. It does this by finding independent instructions and filling that wait window with useful work. On matrix multiply, this eliminates 47.6% of all cycles. On vector add where there's almost no independent work to fill the window only 31.7%. The numbers tell you exactly where ILP exists and where it doesn't.

One design question I'd love input on: I used Linear Scan over Graph Coloring for register allocation. With 256 physical registers, spills almost never happen so the compile-time cost of Graph Coloring never felt justified. Has anyone actually benchmarked this tradeoff on a large-register-file GPU backend?

Built a complete out-of-tree LLVM backend for a custom 32-bit SIMT GPU ISA