r/CUDA

▲ 62 r/CUDA+1 crossposts

Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule

The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. I built a hackable LLM compiler from scratch and am documenting the process. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs.

Currently, on RTX 5090, the emitted FP32 kernels run at geomean 1.11× vs PyTorch eager and 1.20× vs torch.compile, with full-block parity on TinyLlama-128 and Qwen2.5-7B at seq=128. Wins on small reductions / SDPA / kv-projections (up to 4.7×); losses on dense matmul at seq=512.

Part 1 took an RMSNorm layer end-to-end and walked the upper half of that pipeline in detail. This second part closes the gap and explains Tile IR, Kernel IR, and associated lowering rules in depth.

Full article: A Principled ML Compiler Stack in 5,000 Lines of Python Repo: deplodock

The article focuses on producing a GPU schedule for an operation written in loop-nest form (Loop IR). Example for RMSNorm:

v0 = reciprocal(2048)
for a0 in 0..32:  # free
    for a1 in 0..2048:  # reduce
        in2 = load x[0, a0, a1]
        v1 = multiply(in2, in2)
        acc0 <- add(acc0, v1)
    v2 = multiply(acc0, v0)
    v3 = add(v2, 1e-06)
    v4 = rsqrt(v3)
    for a2 in 0..2048:  # free
        in3 = load x[0, a0, a2]
        in4 = load p_weight[a2]
        v5 = multiply(in3, v4)
        v6 = multiply(v5, in4)
        merged_n0[0, a0, a2] = v6

The stack mimics a sequence of optimization steps a CUDA engineer would perform when optimizing kernels: stage inputs to smem, reduce bank conflicts, increase occupancy, and so on.

LoopOp
  │
  ▼
[001] tileify                 — lift outer free Loops to thread axes
[002] chunk_matmul_k          — chunk the K reduce into K-outer × K-inner (intra-CTA)
[003] split_matmul_k          — promote the K-outer chunk loop into a grid dimension
[004] cooperative_reduce      — let multiple threads share one reduce; tree-merge with Combine
[005] blockify_launch         — pick block extents; partition free axes into BLOCK and THREAD
[006] chunk_reduce            — chunk non-matmul reduces so their Loads fit in shared memory
[007] stage_inputs            — hoist hot input slabs into Stage nodes
[008] register_tile           — replicate the inner tile so each thread owns a register block
[009] permute_register_tile   — reorder the register strip so bank-conflicting loads land on far columns
[010] double_buffer           — promote K-outer Stages to BufferedStage (ping-pong)
[011] tma_copy                — narrow eligible BufferedStages to TmaBufferedStage (sm_90+)
[012] split_inner_for_swizzle — split the inner cache axis of a TmaBufferedStage for swizzle
[013] async_copy              — narrow the rest to AsyncBufferedStage (cp.async, sm_80+)
[014] pad_smem                — pad shared-memory strides to break bank conflicts
[015] pipeline_k_outer        — rotate the K-outer loop into prologue/steady-state/epilogue (cp.async + TMA)
[016] mark_unroll             — annotate small inner loops for #pragma unroll
  │
  ▼
TileOp (fully scheduled)

Each stage can be reproduced with a CLI command. For example, the stage_inputs pass stages input buffers into smem if possible and if there is a benefit in doing that (inputs are being read multiple times within CTA). To see it, the following command can be used:

deplodock compile \
  -c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \
  --ir tile -vv \
  | awk '/^>>> t:007/,/^<<< t:007/'
>>> t:007_stage_inputs
@@ matched at rms_norm (in-place) @@
@@ -2,6 +2,7 @@
   v0 = reciprocal(2048)
   Tile(axes=(a0:256=THREAD, a1:32=BLOCK)):
+      x_smem = Stage(x, origin=(0, a1, 0), slab=(a2:2048@2))
       StridedLoop(a2 = a0; < 2048; += 256):  # reduce
-          in2 = load x[0, a1, a2]
+          in2 = load x_smem[a2]
           v1 = multiply(in2, in2)
           acc0 <- add(acc0, v1)
@@ -11,5 +12,5 @@
       v4 = rsqrt(v3)
       StridedLoop(a2 = a0; < 2048; += 256):  # free
-          in3 = load x[0, a1, a2]
+          in3 = load x_smem[a2]
           in4 = load p_weight[a2]
           v5 = multiply(in3, v4)
<<< t:007_stage_inputs

The final CUDA kernel for the RMSNorm layer:

deplodock compile \
  -c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \
  --target sm_120 --ir cuda
extern "C" __global__
__launch_bounds__(256) void k_rms_norm_reduce(
    const float* x, const float* p_weight, float* rms_norm) {
    float v0 = 1.0f / 2048.0f;
    int a1 = blockIdx.x;
    int a0 = threadIdx.x;
    int lane = threadIdx.x & 31;
    int warp = threadIdx.x >> 5;
    float acc0 = 0.0f;
    __shared__ float x_smem[2048];
    for (int x_smem_flat = a0; x_smem_flat < 2048; x_smem_flat += 256) {
        float x_smem_v = x[a1 * 2048 + x_smem_flat];
        x_smem[x_smem_flat] = x_smem_v;
    }
    __syncthreads();
    for (int a2 = a0; a2 < 2048; a2 += 256) {
        float in2 = x_smem[a2];
        float v1 = in2 * in2;
        acc0 += v1;
    }
    float acc0_w = acc0;
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 16);
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 8);
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 4);
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 2);
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 1);
    __shared__ float acc0_smem[8];
    if (lane == 0) {
        acc0_smem[warp] = acc0_w;
    }
    __syncthreads();
    for (int s = 4; s > 0; s >>= 1) {
        if (warp < s) {
            acc0_smem[warp] = acc0_smem[warp] + acc0_smem[warp + s];
        }
        __syncthreads();
    }
    float acc0_b = acc0_smem[0];
    float v2 = acc0_b * v0;
    float v3 = v2 + 1e-06f;
    float v4 = rsqrtf(v3);
    for (int a2 = a0; a2 < 2048; a2 += 256) {
        float in3 = x_smem[a2];
        float in4 = p_weight[a2];
        float v5 = in3 * v4;
        float v6 = v5 * in4;
        rms_norm[a1 * 2048 + a2] = v6;
    }
}
cloudrift.ai
u/NoVibeCoding — 2 days ago
▲ 61 r/CUDA+3 crossposts

Link to the Job

I’ve been actively leetcoding for sometime (500+), I would love to get any insights on how to prep for this role, as the process is still unclear, What I’ve gotten from the email is there could be c++ coding or ml background etc.

Would love to hear any inputs.

u/gradschoolai2023 — 9 days ago
▲ 20 r/CUDA+1 crossposts

Hi everyone,

I’m an independent developer with a background in algorithms, HPC, and robotics infrastructure. Recently I’ve been working on a lightweight inference engine built around hand-written CUDA kernels, focusing on small-batch and real-time performance (especially for VLA and robotics workloads).

Here are some recent results on Thor and Blackwell:

  • Pi0.5 — Jetson AGX Thor (SM110): 44 ms (23 Hz)
  • Pi0 — Jetson AGX Thor (SM110): 46 ms (22 Hz)
  • Pi0.5 — RTX 5090 (SM120): 17.58 ms (57 Hz)
  • Pi0 — RTX 5090 (SM120): 18.43 / 21.16 / 24.48 ms (54 / 47 / 41 Hz)
  • GROOT N1.6 — Jetson AGX Thor: 45 ms (T=50) / 41 ms (T=16) → 22 / 24 Hz
  • GROOT N1.6 — RTX 5090: 13.08 ms (T=50) / 12.53 ms (T=16) → 76 / 80 Hz
  • Pi0-FAST (token)
    • Thor: 8.1 ms/token (123 tok/s)
    • RTX 5090: 2.39 ms/token (418 tok/s)

The focus is on pushing true real-time inference under small-batch settings, which tends to be underserved by typical large-batch optimized stacks.

Still early, but happy to share more details or discuss if anyone is working on similar workloads 🙂

Feeback welcome!:https://github.com/LiangSu8899/FlashRT

u/Diligent-End-2711 — 3 days ago
▲ 51 r/CUDA

I'm starting a Master's focused around AI systems, GPU computing, and HPC, and I'm trying to better understand where the genuinely important problems are in AI infrastructure and inference engineering.

My background so far has mostly been applied ML systems work:

  • production LLM serving with vLLM
  • real-time ASR → LLM → TTS pipelines
  • LoRA fine-tuning/merging
  • latency-sensitive voice agents
  • benchmarking and pipeline optimization

Over time I realized many of the hardest problems weren't really model problems, but systems problems:

  • GPU underutilization
  • irregular batching
  • memory movement
  • scheduling/backpressure
  • latency propagation through pipelines
  • inference efficiency under real-time constraints

Long term, I’m much more interested in the systems/infrastructure side of AI than pure modeling. Things like:

  • inference runtimes
  • GPU systems
  • CUDA/Triton
  • compiler/runtime optimization
  • distributed inference
  • memory efficiency
  • scheduling for irregular workloads
  • HPC for AI workloads
  • AI-assisted systems optimization

Right now I’m trying to figure out what problems in this space are:

  • genuinely important
  • underexplored
  • likely to matter over the next 5 years
  • realistic for a strong Master’s thesis

A few questions I’d really love practitioner/researcher opinions on:

  1. What problems in AI infrastructure or inference engineering still feel painfully unsolved in practice?
  2. What research directions seem overhyped vs genuinely valuable?
  3. Which areas do you think will matter most over the next few years: kernel optimization, distributed inference, memory systems, compiler/runtime work, scheduling, networking, etc?
  4. What’s a technically deep but realistic Master’s-level research problem in this space?

Happy to hear “your framing is wrong because X” too — that’s probably the most useful feedback I can get right now.

reddit.com
u/Quirky-Guide-762 — 7 days ago
▲ 6 r/CUDA

Nvidia Interview Help

I am interviewing for System Software - Autonomous Vehicles. Gave my 2nd round 2 weeks ago. Followed up on phone multiple time, HR said feedback is positive. Havent heard back for a week with the F2F interview timeline. Is this normal?

Also how many rounds do they generally take? And what would they be asking in F2F interviews?

reddit.com
u/Gullible_Stomach6765 — 5 days ago
▲ 19 r/CUDA

For edge inference, when do you drop below TensorRT/ONNX and write custom CUDA kernels?

Question for people who do CUDA work on production inference paths.

For large vision / multimodal models running on edge devices, the first pass is usually export/compile/quantize with TensorRT, ONNX Runtime, vendor SDKs, etc. But sometimes a small set of operators or pre/post-processing steps dominates the latency trace enough that custom CUDA kernels become worth it.

Recent datapoint from a Jetson Orin NX deployment I worked on: multimodal classifier, 111ms cold start, 100% of decisions inside a 150ms budget, zero cloud calls.

Curious how CUDA folks decide when custom kernels are worth the maintenance cost:

- What trace/profile signs make you reach for custom CUDA?

- Do you usually target model ops, preprocessing, memory layout/conversion, or batching?

- How do you keep custom kernels portable across Jetson vs larger NVIDIA GPUs?

- Any profiling workflow you trust for this kind of edge latency work?

reddit.com
u/Hairy_Strawberry7028 — 5 days ago
▲ 7 r/CUDA

I'm trying to efficiently axiswise sum the output of a 16x16 interaction. I recently discovered this magic intrinsic that is not only much faster at summing along the x axis than my previous manual loop with synchronizations, but also deterministic:
__shfl_down_sync

As i understand this only works within warps, and to sum along y i'll need a stride larger than the warpsize. So what do i do? Is there a trick to do reductions along other dimensions than x?
Can i "transpose" the thread-local data in the threadblock, so i can do another reduction along x, but it being effectively on y data?

Here's the semi-pseudo code of what i'm trying to do:

// blockdim=16,16,1
__global__ void NbInteractionKernel(const Data* const superClusters, ForceEnergy* const results) {
  static_assert(SuperCluster::maxParticles == 16, "This kernel relies on   SuperCluster::nParticles being 16");
  //__shared__ PData pqd[SuperCluster::maxParticles * 2];
  __shared__ Data data;

  __shared__ ForceEnergy feSc0[SuperCluster::maxParticles];
  __shared__ ForceEnergy feSc1[SuperCluster::maxParticles];

  // Setup inputdata //


  const ForceEnergy forceEnergy = LJ::ComputeParticleParticleNB(data);

  // warpreduce along x-dim
  ForceEnergy forceEnergyX = forceEnergy;
  for (int offset = 8; offset > 0; offset >>= 1)
  {
    forceEnergyX.force.x += __shfl_down_sync(0xffffffff, forceEnergyX.force.x, offset, 16);
  // repeat for other components..
  }
  if (threadIdx.x == 0) {
    feSc1[threadIdx.y] = forceEnergyX;
  }



  // warpreduce along y-dim
  ForceEnergy forceEnergyY = forceEnergy;
  for (/*??*/)
  {
    // ????
    // repeat for other components..
  }
  if (threadIdx.x == 0) {
    feSc0[threadIdx.x] = forceEnergy;
  }


  // Push results to Global Mem //
}

# Criterias

  1. The output must be deterministic
  2. I cant put the data in shared memory 16bytes*16^2=4kb, which is quite alot above the budget for this kernel
reddit.com
u/ElectronGoBrrr — 11 days ago
▲ 8 r/CUDA+2 crossposts

I applied to the below role and have an interview lined up with nvidia.

https://jobs.nvidia.com/careers/job/893393807731

I have no idea what to expect in the interview process and how best to prepare. The interview is a 60 min screening coding interview on hackerrank. I want to give this my best shot given I really want a job, so would appreciate any help.

u/Complete-Resolve-201 — 8 days ago