r/Compilers

Looking for People Interested in LLVM/MLIR and Compiler Development

I have been working on compilers for the past 6 months and have explored and learned a lot about areas like the middle-end, SelectionDAG, GlobalISel, register allocation, instruction selection, scheduling, etc. I’ve also contributed a few LLVM patches spanning these topics.

Recently, I’ve been diving deeper into backend development, especially AMDGPU and NVPTX, since I’m very interested in GPU compilers and code generation.

Lately, it has started getting difficult to learn and keep up entirely alone, so I’m looking for a few people who are interested in compilers/LLVM/MLIR/GPU backends to connect with, discuss ideas, learn together, or maybe even work on projects/contributions together.

reddit.com
u/Jumpy-Fox-3177 — 18 hours ago

What Should I Read After Crafting Interpreters?

Hello everyone,

I’m not a native English speaker, and I’ve only recently started becoming more comfortable with English. I’ve been interested in programming for a long time, since I was around 15 years old. Back then, one of the first questions I remember asking myself was: “How is a programming language made?”

Unfortunately, because of financial difficulties and some mental health struggles, I wasn’t always able to dedicate as much time to programming as I wanted. Over time, with treatment and a strong desire to keep learning, I’ve been able to return to it. I don’t have a university degree, and I’m mostly doing this as a hobby, but it is something I care about deeply.

My goal is not necessarily to build a huge programming language, but I genuinely want to understand compilers and interpreters deeply, both from a theoretical and practical perspective.

So my question is: after Crafting Interpreters, which books or resources would you recommend?

Also, I’m not sure how much mathematics is required for studying compilers. Apart from basic arithmetic, my math knowledge has become quite rusty because I haven’t practiced it in a long time. Do you think I should study mathematics again? If so, which topics would be most useful?.

reddit.com
u/Jumpy-Win-2973 — 20 hours ago
▲ 56 r/Compilers+3 crossposts

Phase — a statically-typed bytecode-interpreted language in C, with an essay on implementation

Phase is a statically-typed bytecode-interpreted programming language written in ~4,800 lines of C with zero external dependencies. It features a 25-opcode stack-based VM, 21 error types with source-mapped diagnostics, 5 primitive types, and a standard interpreter pipeline (lexer, parser, type checker, bytecode generator, VM).

I also wrote a technical piece on how it works by following out("Hello world!") end-to-end through every stage.

Writing: williamalexakis.com/interpreter-in-c

Repo: github.com/williamalexakis/phase

u/williamalexakis — 1 day ago

How to remove left recursion from Prefix translation Scheme. urgent please

so i actually did Translation scheme c++ parser by removing left recursion now i want that how can i remove left recursion from this prefix Translation scheme so i can make Parser from it Ts for prefix could please remove left recursion from it

exp -> {cout<<"+"} exp + term

| {cout<<"-"} exp - term

| term

term -> digit

digit -> {cout<<"0"} 0 | {cout<<"1"} 1 | ... | {cout<<"9"} 9

i did with postfix ts and made parser in c++ below you can see, also if you could help how can i build it

Code for Postfix TS using removed TS. COde must be structure like that like postfix

#include <iostream>

#include <cstdlib>

using namespace std;

 

char input[50];

int i = 0;

char lookahead;

 

void exp();

void term();

void factor();

void paren();

 

void rest1();

void rest2();

void rest3();

 

void digit();

 

void match(char);

void eror();

 

int main() {

cout << "Enter String of Token: ";

cin >> input;

 

lookahead = input[i];

 

cout << "Postfix: ";

 

exp();

 

if (lookahead == '\0') {

cout << "\nValid Expression";

}

else {

eror();

}

 

return 0;

}

 

void exp() {

if ((lookahead >= '0' && lookahead <= '9') || lookahead == '(') {

term();

rest1();

}

else {

eror();

}

}

 

void rest1() {

if (lookahead == '+') {

match('+');

term();

cout << '+';   // postfix

rest1();

}

else if (lookahead == '-') {

match('-');

term();

cout << '-';   // postfix

rest1();

}

}

 

void term() {

if ((lookahead >= '0' && lookahead <= '9') || lookahead == '(') {

factor();

rest2();

}

else {

eror();

}

}

 

void rest2() {

if (lookahead == '*') {

match('*');

factor();

cout << '*';   // postfix

rest2();

}

else if (lookahead == '/') {

match('/');

factor();

cout << '/';   // postfix

rest2();

}

}

 

void factor() {

if ((lookahead >= '0' && lookahead <= '9') || lookahead == '(') {

paren();

rest3();

}

else {

eror();

}

}

 

void rest3() {

if (lookahead == '^') {

match('^');

paren();

cout << '^';   // postfix

rest3();

}

}

 

void paren() {

if (lookahead == '(') {

match('(');

exp();

match(')');

}

else if (lookahead >= '0' && lookahead <= '9') {

digit();

}

else {

eror();

}

}

 

void digit() {

if (lookahead >= '0' && lookahead <= '9') {

cout << lookahead;   // operand in postfix

match(lookahead);

}

else {

eror();

}

}

 

void match(char t) {

if (lookahead == t) {

lookahead = input[++i];

}

else {

eror();

}

}

 

void eror() {

cout << "Syntax Error";

exit(0);

}

reddit.com
u/Radiant-Aspect-2345 — 24 hours ago
▲ 8 r/Compilers+1 crossposts

Writing an LLM compiler from scratch [Part 3]: Autotuning — A Search Loop Over Tile-IR Rewrites

The third and final article of building a hackable ML compiler from scratch. The previous parts built a six-IR pipeline (Torch → Tensor → Loop → Tile → Kernel → CUDA) and lowered TinyLlama / Qwen2.5-7B through it.

Block sizes, register tiles, staging decisions, etc., were determined by a heuristic that didn't generalize beyond the matmul shapes it was fitted on.

This part swaps those heuristics for a search loop. An SP-MCTS that explores the cross-product of rule parameters, benchmarks each candidate, and persists winners in a SQLite cache keyed by structural op hash. The cache replays on subsequent compiles.

On RTX 5090, the tuned stack lands at geomean 0.96× vs PyTorch eager (vs 0.87× for the heuristic and 0.91× for torch.compile), with 32 of 84 kernel shapes faster than PyTorch hand-optimized kernels. Best kernels are 5.6× faster than PyTorch (tall-skinny matmuls).

Passes

Pass                 Forks
tileify              —
chunk_matmul_k       one per legal K-chunk size (divisors of K, 16..128)
split_matmul_k       apply or skip — turn K into a parallel reduction
cooperative_reduce   —
blockify_launch      one per threads-per-block ∈ {64,128,256,512}
chunk_reduce         —
stage_inputs         which inputs to stage in smem (2^k combinations)
register_tile        one per (F_M, F_N) divisor pair
permute_reg_tile     inner-loop order ∈ {km, mk}
double_buffer        apply or skip — split stage buffers for overlap
tma_copy             apply or skip on sm_90+
async_copy           apply or skip on sm_80+ (cp.async)
pad_smem             —
pipeline_k_outer     apply or skip
mark_unroll          —

A dense matmul with six staging-relevant inputs, three legal K-chunks, four threads-per-block values, eight register-tile shapes, two pipelining choices, and two double-buffering choices spans 2^6 × 3 × 4 × 8 × 2 × 2 ≈ 24,000 terminals.

Search loop

SP-MCTS with max-Q propagation, normalized UCB1, and a patience termination criterion (stop after N consecutive measured terminals without a new best):

def sp_mcts(root, patience, c):
    best_reward = 0.0
    visits_at_best = 0
    while root.visits - visits_at_best &lt; patience:
        # SELECT
        # descend to a frontier node by UCB1 over normalized max-Q
        node = root
        while node.children and node.has_unfinished_descendant():
            node = max(
                (ch for ch in node.children if ch.has_unfinished_descendant()),
                key=lambda ch: ucb(ch, node, c),
            )

        # SIMULATE / EXPAND — advance one rule
        # spawn forks or bench a terminal
        result = advance_one_rule(node.candidate)
        if result.forks:
            node.children = [Node(c, parent=node) for c in result.forks]
            continue
        reward = 1.0 / bench_latency(result.cuda_op)

        # BACKPROP — walk parent links
        # bump visits, max-update best_reward
        n = node
        while n is not None:
            n.visits += 1
            n.best_reward = max(n.best_reward, reward)
            n = n.parent

Structural keys

The entire cache is keyed by structural digests that describe the kernel's structure. To produce a structural key, eight normalization passes are used: drop size-1 free axes, sequential SSA rename, sort commutative args, canonicalize external buffer names, collapse op clusters: sub ↔ add (FMA), mod ↔ divide (SFU), the compare family; then hash the result.

Under this transformation, the following ops become identical and the same scheduling decisions will be applied:

# Op A
for i in range(M):
    for j in range(1):
        tmp = load(X[i])
        result = tmp + bias[i]
        Y[i, j] = result

# Op B
# different names and '-' instead of '+'
for i in range(M):
    a = load(input0[i])
    b = load(input1[i]) 
    c = a - b
    output0[i] = c

Run CLI example from the repo:

# Eager 25 µs, Deplodock 38.9 µs (0.64× eager)
deplodock run --bench -c \
  "a=torch.randn(1,32,2048);b=torch.randn(2048,5632);torch.matmul(a,b)"

# Tune (default patience 60). 207 variants explored in 67.7s,
# best 22.54 µs at BM=32, BN=64, F_M=8, F_N=2 (worst was 293.75 µs).
deplodock tune -v -c \
  "a=torch.randn(1,32,2048);b=torch.randn(2048,5632);torch.matmul(a,b)"

# Re-run with the cached knobs — 22.7 µs (1.10× eager)
deplodock run --bench -c \
  "a=torch.randn(1,32,2048);b=torch.randn(2048,5632);torch.matmul(a,b)"
open.substack.com
u/NoVibeCoding — 1 day ago

Antlr is very very very slow

Am i the only one who saw that java antlr is very slow, I generated java codes from java grammar.

To parse a package name only takes 500ms and a file of 50 lines takes 3min

My laptop has 12GB of RAM

reddit.com
u/shyakaSoft — 2 days ago
▲ 48 r/Compilers+1 crossposts

Progress on my C compiler in C.

Hey everybody,

I've been making a C11 compiler in C, following "Writing a C Compiler" book. It's my first compiler.

Here's the link: https://github.com/stjmm/CinC

It has:
- An on demand lexer
- Parser with Pratt expression parsing
- Semantic Analysis/Type system with a couple of passes
- Three Address IR
- x86_64 code emission

Right now it supports:
- expressions
- if/else/break/continue
- for/while/dowhile/break/continue/goto
- switch/case/default
- arithmetic, bitwise, logical operations
- int (and voids for functions) types
- extern/static/auto storage classes
- functions, function calls
- it assembles/links via gcc so you can already call functions like `putchar()`

Now I'm going to start implementing the rest of types, and maybe a preprocessor. Eventually I want to implement such a subset of C11, to compile more real world projects.

I'd be grateful for input on the code. It's has been a great meta-learning experience about my favorite language.

u/shetrynajerkme — 3 days ago

Need help with semantic actions &amp; type checking for my LL(1) Mini-Pascal compiler (following professor’s slides)

Hi everyone,

I’m working on a compiler for a small Pascal-like language. I have already finished the grammar, removed left recursion and left factoring, and turned it into an LL(1) grammar with embedded semantic actions (translation scheme style).

The problem is that I’m struggling with semantic actions and type checking. My professor uses a very specific style from his slides (Chapter 5 & 6):

Every Statement has a .type attribute (void or type_error)

Expressions use inherited attribute .in and synthesized .type

Declarations use addtype() and list handling

I have the full grammar with some actions already written, but many parts still feel confusing to me (especially the tail productions like Expression', SimpleExpression', Term' and how to pass the left operand type using .in).

Here is my current grammar (with embedded actions):

yacc

Program → Header Declarations Block . { } ... (I can paste the full grammar if needed)

I would really appreciate any help, examples, or explanations on how to correctly implement the semantic actions and type checking according to the standard Dragon Book / professor’s slide convention.

Any guidance, small examples, or links to similar student projects would be very helpful.

Thank you!

reddit.com
u/WittyResearcher8525 — 3 days ago

Help: Writing a Python to C transpiler

I'm thinking about embarking on a journey for writing a Python to C transpiler. It'll provide an interesting challenge and also will be useful, considering I am targeting an environment that can only take a subset of C as input. Given that I haven't ever written a compiler but I have written an interpreter about a decade ago and have forgotten most of the process, what are some things I'd need to familiarize myself with in order to write this transpiler? Also, what intermediate representation would be wise for such a project?

reddit.com
u/nanoman1 — 4 days ago
▲ 3 r/Compilers+1 crossposts

IR for my compiler

Since few days, I have been working on a compiler written in python. I have successfully implemented a frontend and I want a backend to generate IR for. I have decided to make my own IR but I don't know a lot about them. Can y'all please provide me with information with types of IR, optimization tricks they utilise, and how they are implemented or work. Any type of material will be thankful.

reddit.com
u/juicyroaster — 4 days ago

Compiler implementation language

Currently starting “Writing a C Compiler” by Nora Sandler. I initially wanted to start the project in C, but she herself suggests doing it in another language. Since I like it and she also suggests a language with pattern matching, Rust seems like a very good alternative.

Hoping to get some thoughts on this, particularly from people who’ve gone through this book before.

I like doing projects in C if they need speed and low-level detail (though more often it’s because coding like a caveman is fun), but the repetitive boilerplate, weak generic system and standard library make larger projects a pain to work through.

Rust is much more convenient here, though I’m aware it may get somewhat verbose or constricting for low-level work (at least from my limited experience).

reddit.com
u/Big-Rub9545 — 5 days ago

Advice before getting started

I just finished every challenge in the excellent game Turing Complete (https://store.steampowered.com/app/1444480/Turing_Complete/) which involves creating an 8-bit processor and writing assembly for it. As for my background I've written a fair bit of assembly in my career for Microchip PIC compilers, written a NES and GB emulator, so I've got a solid background. But I don't have a CS degree and I never took a compiler course.

I'd like to try building a 16-bit processor in Turing Complete's sandbox mode, and then I'd like to write C code on my computer and cross-compile it to the new processor. I haven't decided on an instruction set yet, it might be fun to spin my own, but also I could take an existing one.

How would I port something like gcc or llvm or something over to my new processor? I'd like to get advice up front before I select/design the instruction set. I'm not looking to run any sort of OS, just bare metal C code.

I'm not looking to write a compiler (at least not yet), I just want to use something existing.

EDIT: I'm not looking to run the compiler on the game's processor. I want to cross-compile small programs targeted for my game's processor. The simulation still runs at least at 1-10 MHz, so we're talking 80's level computer here, so I'm not expecting anything impressive. but I still want to write in C. I'm also happy to write in a limited subset of C.

u/StaticMoose — 4 days ago

Yet another tensor graph compiler

Hello, I've spend some last few years learning ML and this is the result - an ML library that spans the whole pytorch stack - from backends (fully complete CUDA, OpenCL, WGPU + partially implemented PTX, HIP and SPIRV) all the way to neural network modules in zyx-nn.

For lovers of python, python bindings are also available and don't differ from rust. Python wheel is 4 MB. Supports all pytorch ops across more hardware than pytorch. The current drawback is speed, but on small models, it should not be that bad. On my 2060, MNIST example takes 0.9ms per step, while compiled torch is 0.7ms.

Zyx was inspired by tinygrad to use minimal opset. Therefore zyx uses only these nodes in the graph: leaf, unary, binary, cast, expand, permute, pad, reshape, reduce.

Most of the design decisions have finally stabilized. I've spend years trying different lowering approaches and finally got to the point where it seems to click.

So what is the secret sauce? Zyx is fully DYNAMIC. The graph is build dynamically and supports all branching, but execution is lazy. Currently the graph is chopped into kernels by heuristics and autotune searches over possible optimizations on each kernel. The novel part here is that all optimization passes are both optional and can be stacked in any order. There is no complex lowering. Optimizations include register tiling, hierarchical local memory reduce, LICM, CSE, DCE, etc. Some like tensor cores and load/store vectorization are partially implemented, while others like local memory tiling are not started yet.

Typical compilers like XLA and TVM and tinygrad instead use multi-stage lowering pipeline where optimizations have to be applied in certain order to be valid.

Zyx has probably the smallest (by number of ops) unified IR you can see in this space. It makes pattern matching more complex, but is fully expressive and complexity of optimizations is very low, while their orthogonality produces interesting results.

Why not use zyx

  1. performance - Currently mainly due to unfinished tensor core support and local memory tiling. The other part is non-ideal graph partitioning.
  2. if there is some other reason, please tell me. I believe zyx can improve faster than others due to stackability of ops and small codebase to be the most ergonomic, support the most hardware and be most correct in terms of numerical stability guarantees across dtypes and kernels.

Next steps

Performance, performance, performance. Other than adding those few obligatory optimization passes and improving autotune with better cost function (btw. with the cost function and fast IR, zyx can search through tens of thousands of kernel variants per second per core - yes, autotune is multithreaded), the main next step is a rewrite of the graph splitting part. There are no user API changes and the rewrite is partially finished.

The new kernelizer will generate a hypergraph of both kernel fusion strategies AND device allocation strategies. Zyx already has automatic parallel pipelining with heuristics, but after the rewrite, it'll have proper search for this.

The part that I am most excited about is since graph in zyx is made up of so simple nodes, I can write pattern matcher that will map parts of the graph to kernels in existing stacks - cuDNN, oneDNN, NPU kernels, etc. With hypergraph search, zyx will be able to select the best path among these and custom zyx kernels will fill the blanks. Expect this to take a few months before ready.

Minimalism

Zyx is not only minimal in it's graph, but also dependencies. All of zyx + dependencies is <50k LOC of pure Rust. I also wrote simple onnx bindings. The goal is to make zyx both the tiny runtime that runs all models on every hardware correctly, albeit with varying performance, as well as making zyx the nicest library to use for the following reasons:

  1. I carefully crafted every user facing function to be as intuitive as possible
  2. Zyx will not wish you good luck like pytorch if you mutate tensors before backprop, instead zyx tensors are immutable
  3. Zyx will not run out of memory if you fill your VRAM, instead zyx will fallback to RAM
  4. Zyx will not take 30s to import in python, instead it takes 10ms
  5. Zyx won't tell you that some ops are unsupported for some dtypes, instead all dtypes are supported with all ops, except for some ops that require float dtypes for mathematical meaningfulness
  6. Zyx won't complain that the graph is too large, too dynamic or too complex, a single node is 16 bytes, so huge graphs run just fine
  7. Zyx will fuse your kernels and won't run out of recompilation passes, recompilation happens at kernel level, not graph level
  8. Zyx will fuse ALL graphs, no matter how complex
  9. And perhaps a bit slowly, but zyx will keep running your code correctly even on the oldest of hardware (e.g. GT 710, RX 480, or even CPUs without AVX)

I wish you pleasant experimentation.

https://github.com/zk4x/zyx

https://crates.io/crates/zyx

https://crates.io/crates/zyx-nn

https://crates.io/crates/zyx-optim

https://pypi.org/project/zyx-py

reddit.com
u/zk4x — 4 days ago
▲ 24 r/Compilers+1 crossposts

Compile time evaluation

I want to include compile time evaluation in my language.

For context I have most of the language design and compiler planned out and am currently implementing it. Currently writing the parser that turns a stream of tokens into the ast. The plan is to target llvm ir for the moment.

While I have enough to do before I have to address it I want to get some information about executing code at compile time. I am explicitly not talking about macros for them I already have an idea. My questions are:

  1. How to decide rather to execute a function or keep the call in code

  2. Security concern about access to the system on compile time evaluation

  3. How to execute it without writing a separate interpreter

  4. Does llvm provide any supporting features for this

Thanks

reddit.com
u/RedCrafter_LP — 5 days ago

A suitable name for my compiler

Hello guys, I have been writing a compiler from a few days. I have completed the frontend basics like a lexer, parser, semantic analysis. It doesn't have a IR yet and no backend. I want the name to have a fast, quick and lightweight feel. It should have a no libc feel has I will write the standard library myself. Any type of suggestion is welcomed.

reddit.com
u/juicyroaster — 5 days ago

Standard Optimizations using SSA form

I was trying to look at all optimizations that use SSA. I could find many such as DCE, LICM, GVN, PRE, however the standard books such as the Dragon Book or Muchnick don't mention the algorithms or examples explicitly using SSA programs. I could only find PRE, GVN given in SSA.

Where can I find the actual SSA implementations explicitly.

Both GCC, LLVM have these implemented in SSA form however the explicit implementation details are not found.

reddit.com
u/Usual_Structure2818 — 5 days ago

Are there any junior roles out there right now?

Sorry of this kind of post isn't allowed, but I figure it's the best place to ask this. I've got 3 years of experience doing GPU compiler work (specifically Torch-MLIR and IREE) and I got laid off at the end of 2024. Since then my life has been a cycle of applying for senior roles because they're all I can find, impressing technical interviewers, and getting rejected anyway because someone with 5 more years of experience showed up.

Am I not looking in the right places, or are junior roles in GPU compilers that few and far between right now? Does anyone have any advice for the current job market? I'm kind of at a loss for what I can do other than pivot out of compilers entirely. I don't think building projects would do much to add to my on-paper experience and I'm already doing well on technical interviews.

reddit.com
u/Gapmeister — 5 days ago

What's the interview process like for ML/AI compiler intern roles?

Hi everyone,

Apologies if this has been asked before. I'm trying to get a clearer picture of what the interview process actually looks like for ML/AI compiler intern roles, and most of what I can find online is either generic SWE prep or NVIDIA-specific.

If you've interviewed (offer or no offer) for an ML/AI compiler intern, kernel engineer, or compiler engineer intern role within the last ~2 years, I'd really appreciate any insight on the following:

  • Which company did you interview with?
  • How many rounds total, and what did each focus on (coding / system design / project deep-dive / compiler theory / behavioral)?
  • Was the coding LeetCode-style or compiler-flavored (topo sort, graph coloring, dominators, SSA construction, IR traversal)? What difficulty?
  • How deep did they go on C++? Object lifetime, templates, memory model, UB corners, reading generated assembly?
  • For compiler theory, was it standard topics (SSA, dataflow analysis, register allocation, loop opts) or more domain-specific (MLIR dialects, kernel fusion, quantization, autotuning)?
  • Did they test ML/domain knowledge — transformers, autodiff, kernel fusion, tensor layouts, quantization arithmetic?
  • What surprised you, and what do you wish you'd prepared more for?

Companies I'm especially curious about:

  • NVIDIA
  • Modular
  • Tenstorrent
  • Apple (ANE)
  • Google (XLA / MLIR teams)
  • Smaller AI chip startups (Groq, Cerebras, d-Matrix, SambaNova, SiMa, FuriosaAI, Tiny Corp, etc.)

Even one data point on any company would be hugely helpful. Thanks in advance!

reddit.com
u/redd3moon — 5 days ago
▲ 8 r/Compilers+2 crossposts

I built a lightweight VM/runtime for AI-generated scripts from scratch

Most runtimes used by AI agents today are designed for humans, not disposable AI-generated code.

I’ve been experimenting with a small scripting runtime called Autolang focused on:

  • low startup latency,
  • strict static restrictions,
  • restartable arena memory,
  • opcode execution limits,
  • and lightweight orchestration around existing ecosystems (Python/C++/JS).

The goal isn’t replacing Python, but creating a safer intermediary layer for short-lived AI-generated scripts.

I’m curious whether people here think this direction makes sense for modern AI-agent systems, especially compared to approaches like Wasm, Lua, or sandboxed Python.

I’d also genuinely appreciate feedback on runtime/compiler design and possible performance improvements if the project sounds interesting.

autolang.adagroup.com.vn
u/TomatoKindly7082 — 5 days ago