Built a complete out-of-tree LLVM backend for a custom 32-bit SIMT GPU ISA
GitHub: github.com/Deepesh1024/NVMirror
NVMirror compiles LLVM IR all the way down to custom GPU assembly instruction selection, register allocation, and instruction scheduling, built from scratch as an out-of-tree LLVM backend.
The scheduler's job is simple: don't let the GPU sit idle waiting 20 cycles for memory. It does this by finding independent instructions and filling that wait window with useful work. On matrix multiply, this eliminates 47.6% of all cycles. On vector add where there's almost no independent work to fill the window only 31.7%. The numbers tell you exactly where ILP exists and where it doesn't.
One design question I'd love input on: I used Linear Scan over Graph Coloring for register allocation. With 256 physical registers, spills almost never happen so the compile-time cost of Graph Coloring never felt justified. Has anyone actually benchmarked this tradeoff on a large-register-file GPU backend?