
I built a self-hosting x86-64 toolchain from scratch. Part 1: The compiler
A while ago I wrote about the overall development journey of building a self-hosting toolchain from scratch. Today I want to get into the details of the compiler specifically — how it works, what it produces, and some metrics on how it performs. Here you can take a look at the source code if you're interested: bjornc
What it is
bjornc is a single-pass, IR-free compiler for Björn, a statically typed systems language I designed alongside the toolchain. It takes .bjo source files and emits x86-64 assembly. No LLVM, no libc, no intermediate representation. The assembly goes to my own assembler, which produces a custom object format (.cub), which goes to my own linker, which produces an ELF executable. But that's for future posts — today is about the compiler.
Single pass, no IR
Most serious compilers go source → IR → optimised IR → assembly. The IR is where all the interesting work happens — dead code elimination, register coalescing, loop transformations, inlining.
bjornc goes source → AST → assembly. One pass, direct emission, nothing in between.
The obvious question is: why? Partly philosophy — I wanted to understand the direct relationship between source constructs and machine instructions without any abstraction layer in the way. Partly practicality — the compiler was co-designed alongside the assembler and linker, and the language evolved constantly as I built those tools. An IR schema that needed updating every time I added a language feature would have slowed everything down.
The cost is real though. No IR means no optimisation pipeline. The two optimisations I have — constant folding and addressing mode optimisation — are applied opportunistically during code generation, limited to what's locally visible at each AST node.
The source-to-assembly expansion factor for a non-trivial program is roughly 4-5:1. That's in the same ballpark as GCC at -O0 for comparable C code. At -O2 GCC closes that gap significantly. I don't.
Register allocation without live ranges
This is the part that required the most original thinking.
Graph coloring and linear scan — the two standard approaches — both require an IR to compute live ranges. No IR means no live ranges, which means neither algorithm was available to me.
What I ended up with is a tree-scoped pinning strategy. When the code generator needs a register for an AST node, it requests one from a central allocator which marks it as pinned. The register stays pinned for the duration of that subtree and is released on exit — either manually once the value has been consumed, or automatically at a safe exit point — which is another problem on its own, knowing whether the AST node is safe for freeing a register.
When no registers are available for usage, they are spilled to the stack and a spill record is created and linked to the AST node that pushed the registers, making sure that every AST node cleans up after they're done. The live value being calculated is moved around between safe registers as needed to ensure the calculation is done properly, no register gets clobbered and we are left with the final computed value instead of stale intermediate numbers.
To put some numbers on this — I analysed the 31K LOC assembly output for the assembler source itself:
| Register | Spill Count |
|---|---|
| rdi | 388 |
| rsi | 362 |
| rax | 98 |
| rbx | 16 |
864 total spills across 31K LOC — roughly one per 36 lines. Two things are important to address:
rdi and rsi dominate because they're the argument registers and get pinned at every call site and spilled if we have nested/linked function calls. Around 80-90% of rdi and rsi spills come from my assembler source code usage of the builder pattern for creating the mnemonic encoding lookup table. Code such as
new()->func1(a1)->func2(b1)->func3(c1);may push rdi 3 times, potentially explaining the metrics.rax and rbx aren't spilled often, but their spilling frequency could be decreased by either using more registers — my compiler uses only 4 general purpose registers (rax,rbx,rcx,rdx) besides the 6 System V ABI parameter registers — or having some sort of heuristics to avoid using rax for general computing when we are at a division node (rax is explicitly needed for division as per x86-64 requirements).
I decided to stick to 4 register to stress test my allocation algorithm, if it works under these constraints, it'll work with more registers and then once I got it working, I never bothered to change it. It is important to note that not all of those 864 are unnecessary — a significant fraction are genuine saves across boundaries where the register is actually live. A liveness-aware allocator could eliminate the redundant ones, but that requires an IR I don't have.
What the compiler actually produces
Here's a concrete example. This Björn function:
func int32 add(int32 a, int32 b){
return a + b;
}
Produces this assembly:
_add_i32_i32:
push rbp
mov rbp, rsp
sub rsp, 8
mov dword [rbp - 4], edi ; save a
mov dword [rbp - 8], esi ; save b
mov eax, dword [rbp - 4] ; load a
mov ebx, dword [rbp - 8] ; load b
add eax, ebx ; a + b, result in eax
.ret_from_add_i32_i32:
add rsp, 8
pop rbp
ret
The name mangling — _add_i32_i32 — incorporates the parameter types, which is how function overloading is resolved at compile time without runtime dispatch.
Field access folds the offset directly into the memory operand:
mov eax, dword [rax + 4] ; v->y where y is at offset 4
No intermediate register needed. This was a deliberate optimisation — the naive approach materialises the offset in a register and adds it, which wastes a register and an instruction.
Compilation performance
I measured compilation time against AST node count rather than raw lines of code — a single line can contain a trivially simple expression or a deeply nested one, so LOC is a poor proxy for compiler workload. Six files from the assembler source were selected across a wide range of AST counts, measured with hyperfine over 300 runs:
| File | AST Nodes | Compilation Time (ms) |
|---|---|---|
| arena.bjo | 558 | 1.1 ± 0.4 |
| assembler_ops.bjo | 1,364 | 2.6 ± 0.8 |
| register.bjo | 1,701 | 2.1 ± 0.4 |
| main.bjo | 2,347 | 4.8 ± 0.6 |
| ast.bjo | 4,283 | 5.9 ± 0.6 |
| analyzer.bjo | 9,249 | 23.3 ± 2.0 |
Linear regression gives R² = 0.9537 and p-value = 0.000815 — approximately linear behaviour, statistically significant.
The notable outlier is analyzer.bjo. That file contains deeply nested template builder calls and nested foreach loops whose lowering to for requires AST copying, generating disproportionately complex AST structures relative to raw node count. It's not a random spike — it's exactly the file you'd expect to be slow given what it does.
The full assembler — 6228 LOC of Björn across 10 files — compiles in roughly 47ms. The single-pass IR-free design means there's no IR to construct, no optimisation passes to run, and no backend lowering phase. It's a single tree traversal and it shows.
Closing thoughts
It was never my intention to create a production-ready optimised and competitive compiler. It was however my intention to know how to develop a compiler, own every line of code, ponder about design decisions and face challenges that required solutions of my own. Along the way, I also wanted to get more familiar with x86-64 assembly. At the end of the day, I did this for curiosity and personal drive rather than grinding for a project entry in my CV — potentially explaining how I was able to stuck with this project for 1.5 years, out of which, around 10 months ish were spent working on the compiler. Curiosity was also the reason why I avoided the obvious shortcuts — LLVM for the compiler backend, Bison for parsing, NASM for assembling, GNU ld for linking, libc for the runtime. Every one of those would have been the sensible choice. None of them would have taught me what I wanted to know. To me this was like watching a show you wanted to take a look at for long time, sure you can look up highlights, a synopsis of it online and a couple clips, and you kinda get the whole idea of the plot and how it comes to an end, but if you really wanted to watch it, you just would have. I was in for the ride, not the destination.
Next post will be the runtime libraries — malloc, printf, variadic arguments, all implemented from scratch on top of direct Linux syscalls with no libc. After that, I'll post about the assembler, the custom binaries, and my own linker. If you have questions about anything here, happy to go into more detail.