Art

Posted on May 11

We Wrote a Compiler in the Language It Compiles. It Works.

#showdev #opensource #cpp #compiling

Four sprints. One afternoon. 62 tests, all green.

ForgeIL is the compiler layer of Forge - an open source UI framework built around
a radical simplicity: one language, one binary, no runtime installation.

The Idea

There is a test every language eventually faces. Not a benchmark. Not a syntax
comparison. A question:

Can the language compile itself?

A self-hosted compiler is the closest thing software has to a proof of maturity.
If the language can express a lexer, a parser, and a code generator - and the
result of running that code is a working binary - then the language has crossed
a threshold. It is no longer a demo. It is a tool.

SMS is the scripting language at the heart of Forge. It is interpreted at
development time and compiled to LLVM IR at release time. We decided to test
whether SMS could implement its own compiler front-end: a lexer, a parser, and
an LLVM IR emitter - all written in SMS itself.

Four sprints. One afternoon. It works.

What We Built

The self-hosted compiler lives in the forgeil/ directory of
sms-cpp - three SMS source files that
together form a complete compilation pipeline:

SMS Source
    |
    v
tokenize(src)        sprint1-lexer.sms     ->  array of Token
    |
    v
parse(tokens)        sprint2-parser.sms    ->  AST of Node
    |
    v
codegen(ast)         sprint3-codegen.sms   ->  LLVM IR text
    |
    v
clang                                      ->  native binary

A single convenience function wraps it all:

fun compile(src) {
    return codegen(parse(tokenize(src)))
}

That is the entire public API. Give it SMS source text. Get back LLVM IR.

Sprint 1: Lexer

The lexer turned out to be the place where SMS's integer-character API became
a design constraint rather than a limitation. str.charAt(i) returns the
integer char code of the character at position i. So the entire lexer is
built on integer comparisons - no regex, no character class library.

fun isAlpha(code) {
    return (code >= 65 && code <= 90) || (code >= 97 && code <= 122) || code == 95
}

fun isDigit(code) {
    return code >= 48 && code <= 57
}

Two-character operators (==, !=, <=, >=, &&, ||) are handled with
a single lookahead:

if (ch == 61) {  // '='
    if (i + 1 < len && src.charAt(i + 1) == 61) {
        tokens.add(Token("OP", "==", line))
        i = i + 2
    } else {
        tokens.add(Token("ASSIGN", "=", line))
        i = i + 1
    }
}

The result: tokenize(src) returns an array of Token(type, value, line) data
class instances. 12 tests, all green on the first run.

Sprint 2: Parser

A recursive descent parser, written in SMS, for SMS.

The interesting constraint here: SMS does not allow forward declarations. Every
function must be defined before it is called - except at the top level, where
all definitions are registered before main() runs. So mutual recursion between
parseExpr and parseStmt works perfectly, because both are defined at the
module level before any call site is reached.

The precedence chain follows the standard pattern:

parsePrimary -> parsePostfix -> parseUnary -> parseMul ->
parseAdd -> parseCompare -> parseEquality -> parseAnd -> parseOr -> parseExpr

One non-obvious detail: the parser cursor is a single-element array used as a
mutable integer box. SMS arrays are passed by shared reference - field
assignments inside helper functions are visible to the caller. This is how the
cursor advances across the recursive descent without needing a global variable.

fun curTok(cur, tokens) { return tokens[cur[0]] }
fun advance(cur)         { cur[0] = cur[0] + 1 }
fun consume(cur, tokens, expected) {
    var tok = curTok(cur, tokens)
    if (tok.type != expected) { ... }
    advance(cur)
    return tok
}

24 tests. All green.

Sprint 3: LLVM IR Emitter

The code generator takes an AST and produces a complete LLVM IR text string.
Integer-only subset: every SMS value is i64, variables use alloca (memory-form
SSA), and clang/llc applies mem2reg in the optimization pipeline.

The context object accumulates emitted instructions:

data class Ctx(tempCnt, labelCnt, code, terminated, loops)

Because SMS data class instances use shared references (shared_ptr under the
hood), every helper function that mutates ctx.code or ctx.tempCnt has its
changes visible to every other function holding the same ctx. This is the
foundation of the entire emitter design.

A comparison compiles to an icmp followed by a zext - because LLVM
comparisons produce i1 and SMS uses i64 everywhere:

if (op == "<") { emit(ctx, cmpReg $ " = icmp slt i64 " $ left $ ", " $ right) }
emit(ctx, reg $ " = zext i1 " $ cmpReg $ " to i64")

For a function like fun add(a, b) { return a + b }, the emitter produces:

define i64 @sms_add(i64 %_p_a, i64 %_p_b) {
entry:
    %a = alloca i64
    store i64 %_p_a, i64* %a
    %b = alloca i64
    store i64 %_p_b, i64* %b
    %t0 = load i64, i64* %a
    %t1 = load i64, i64* %b
    %t2 = add  i64 %t0, %t1
    ret i64 %t2
}

If the source contains a fun main(), a C-compatible entry point is appended:

define i32 @main() {
entry:
    %ret64 = call i64 @sms_main()
    %ret32 = trunc i64 %ret64 to i32
    ret i32 %ret32
}

19 tests. All green.

Sprint 4: End-to-End

The final sprint is the proof. A new C API function -
sms_native_execute_string_result - captures the string value produced by the
SMS interpreter, instead of the integer it previously returned. This lets the
host application receive the generated IR text directly.

The test then does exactly what you would do on the command line:

// 1. Run the SMS compiler in the interpreter, capture the IR string
std::string ir = get_ir(load_all(), "fun main() { return 42 }");

// 2. Write to a temp .ll file
// 3. Run: clang -O0 -o /tmp/test_bin /tmp/test.ll
// 4. Run the binary, check exit code == 42

Exit code 42. Not chosen at random. The compiler's first words are a nod to
the only question that ever mattered.

forgeil_sprint4_tests: all tests passed (7)

The pipeline holds for arithmetic, if/else branches, and while loops. The tests
skip gracefully when clang is not available, so they run cleanly in any CI
environment.

What the Numbers Look Like

Sprint 1 - Lexer        12 tests  v
Sprint 2 - Parser       24 tests  v
Sprint 3 - Code gen     19 tests  v
Sprint 4 - Self-host     7 tests  v
---------------------------------
Total                   62 tests  all green

What This Is Not

SMS is not trying to replace LLVM's front-end infrastructure. The self-hosted
compiler covers the integer-only subset of the language: functions, variables,
if/else, while, break/continue, arithmetic, and comparisons. Strings, arrays,
data classes, and the standard library are outside its current scope.

The point is not feature completeness. The point is that the language has
enough expressive power to reason about itself. That is a different claim - and
a meaningful one.

The Code

Everything is open source under GPL-3.0 (with a commercial option):

sms-cpp: codeberg.org/CrowdWare/sms-cpp
Lexer: forgeil/sprint1-lexer.sms
Parser: forgeil/sprint2-parser.sms
Code gen: forgeil/sprint3-codegen.sms
Tests: tests/forgeil_sprint{1..4}_tests.cpp

If you want to try it:

git clone https://codeberg.org/CrowdWare/sms-cpp.git
git clone https://codeberg.org/CrowdWare/sml-cpp.git
cmake -B build -DBUILD_TESTING=ON -DSML_CPP_DIR=../sml-cpp
cmake --build build
cd build && ctest -R forgeil --output-on-failure

*Forge is being built in public at crowdware.info.
SMS, ForgeIL, and the self-hosted compiler are part of a longer project:
a UI framework that runs anywhere without asking anything of the user's machine.

Top comments (4)

GoDaddy LLC • May 11

Building a compiler in the same language it compiles is one of those “either this works or the language gets exposed instantly” moments 😄. Really impressive how SMS went from interpreted scripting language to self-hosting compiler pipeline with lexer, parser, and LLVM IR generation fully implemented in itself. The recursive descent parser and shared-reference context design are especially elegant — simple ideas used very effectively. Also appreciated the honesty about scope: integer-only subset, focused goals, clean architecture. A lot of projects try to look smart with complexity, while this project looks smart because of its simplicity. And honestly, “62 tests, all green” might be the most beautiful love story in software engineering 😂. Self-hosting is a huge milestone — this feels less like a prototype and more like a language proving it belongs in the room.

Art • May 11

Just had a call with a dev friend today... He told me that we need a bit more than just a compiler. So we made it self hosted. Tutorials will follow these days. Also he wanted some binaries for easy testing... That will all come soonish. Also a playground for the interpreter might follow soon. It's similar to Kotlin for some reason 😉

GoDaddy LLC • May 11

That’s actually the point where a language starts feeling “real” 😄. Self-hosting changes the conversation completely because it proves the language can reason about its own tooling stack. The Kotlin-like feel is interesting too — lightweight syntax with LLVM underneath is a pretty strong combination. A playground and prebuilt binaries will definitely help adoption since developers love experimenting before compiling anything locally 😂. I’d genuinely like to follow the project as it evolves, especially the compiler/runtime architecture side. You seem deeply invested in language design and systems engineering — would love to connect and exchange ideas sometime if you’re open to it.
Can I get your contact info?

Art • May 11 • Edited

Join our Dojo than. I wanne create a new OS based ion Ahimsa.
dev.to/artanidos/forgeos-dojo-lear...