
Prysma: Anatomy of an LLVM Compiler Built from Scratch in 8 Weeks
Prysma: https://github.com/prysma-llvm/prysma
This is a compiler development project I started about 8 weeks ago. I’m a CEGEP student, and since systems engineering of this scale isn’t taught at my level, I decided to build my own low-level ecosystem from scratch. Prysma isn’t just a student project; it’s a complete language and a modular infrastructure designed with the constraints of industrial production tools in mind. This document is a technical dissection of the architecture, my engineering choices, and the effort invested in the project.
1. Meta-generation and automation of the frontend
Developing a compiler normally requires manually coding hundreds of classes for the Abstract Syntax Tree (AST) and its visitors, which generates a lot of technical debt. To avoid this, I created a compiler generator in Python.
Prysma’s grammar is defined in an ast.yaml file. My Python engine (engine_generation.py), which uses Jinja2, reads this specification and generates all the C++ code for the frontend (classes, virtual methods, interfaces). This strategy is inspired by LLVM’s TableGen. It allows me to add a new operator in 30 seconds. Without this technique, it would take me about an hour to add a single node, because I would have to manually modify the token, the lexer, the parser, and the visitors, with a high risk of errors. Now, everything is handled by automated templates.
2. Parallel Orchestration with llvm::ThreadPool
A modern compiler needs to be fast, so I architected the orchestrator around llvm::ThreadPool. Prysma processes files in parallel for the lexing, parsing, and IR generation phases. The technical challenge was that LLVM contexts are not thread-safe. I had to isolate each compilation unit in its own context and memory module before the final merging by the linker. Managing race conditions on global symbols required strict adherence to the object lifecycle.
3. Native Object Model and V-Tables
Prysma implements a class model directly in LLVM IR, including encapsulation (public, private, protected). Implementing polymorphism was one of the most complex aspects. I modeled navigation in virtual method tables (V-Tables) at the binary level using LLVM’s opaque types (llvm::StructType). Call resolution is handled at runtime with GetElementPtr (GEP) instructions to retrieve function pointers. Because a single-byte error causes Segfaults, this part is still in an unstable version in the compiler.
4. Memory Management: Arena and Heap
Memory allocation is crucial for speed. For the AST nodes, I use a memory arena (llvm::BumpPtrAllocator). The compiler reserves a massive block and simply advances a pointer for each allocation in $O(1)$. Everything is freed at once at the end, as in Clang.
For the Prysma language itself, I implemented dynamic allocation with the new and delete keywords, which communicate with libc’s malloc and free. Loops also manage their stack via LLVM’s alloca instruction.
5. Unit and Functional Testing System
To ensure the reliability of the backend, I implemented a robust pipeline. I use Catch2 for C++ tests of the AST and the register. I also developed a test orchestrator in Python (orchestrator_test.py) that uses templates to compile and execute hundreds of files simultaneously. This allows testing recursion, variable shading, and thread collisions. Deployment is blocked by GitHub Actions if a single test fails.
6. Execution Volume and Work Methodology
Systems engineering demands a significant amount of execution time. To make this much progress in 8 weeks, I worked 14 hours a day, 7 days a week. Designing an LLVM backend requires reading thousands of pages of documentation and debugging complex memory errors.
AI was a great help in understanding this complexity. My method was iterative: I generated LLVM IR code (version 18) from C++ code to inspect and understand each line. I combined Doxygen’s technical documentation with questions posed to the AI to master everything. To maintain this pace, I managed my fatigue with caffeine (a maximum of three times a week to avoid upregulation), accepting the impact on my mental health to achieve my goals. I was completely absorbed by the project.
7. Data-Oriented Design (Work by Félix-Olivier Dumas)
Félix-Olivier Dumas joined the Prysma team to restructure the project’s algorithmic foundation. He implemented a Data-Oriented Design (DOD) architecture for managing the AST, which is more efficient.
In its system (currently being finalized), a node is a simple integer (node_id_t). Data (name, type) is stored in Sparse Sets as flat arrays. The goal is to maximize the L1/L2 cache: by traversing aligned arrays, the CPU can preload data and avoid cache misses. It also uses Tag Dispatching in C++ to link components at no runtime cost (zero-cost abstraction), without v-tables or switch statements.
8. Current State of the Language
Prysma is currently a functional language with stable capabilities:
Syntax: Primitive types (int32, float, bool), full arithmetic, and operator precedence.
Structures: If-else conditions and while loops.
Functions: Recursion support and passing arguments by value.
Memory & OOP: Native arrays, classes, inheritance, and heap allocation.
Tools: Error diagnostics (row/column), Graphviz export of the AST, and a VS Code extension for syntax highlighting.
9. Roadmap and Future Vision
The project is evolving, and here are the planned objectives:
Short term (v1.1): Development of the Standard Library (lists, stacks, queues) and an import system for linking C libraries.
Medium term (v1.2): Support for Generics (templates), addition of Namespaces, and stricter semantic analysis for type checking.
Long term: Just-In-Time (JIT) compilation, integration of the inline assembler (asm {}), and custom SSA optimization passes.
The project is open source, and anyone interested in LLVM or Data-Oriented Design can contribute to the project on GitHub. The code is the only judge.