JIT Compilation in Neutron¶
Neutron implements a multi-tier JIT (Just-In-Time) compilation system based on the research paper "A Lightweight Method for Generating Multi-Tier JIT Compilation Virtual Machine". The JIT compiles frequently executed bytecode ("hot spots") into native machine code for significant performance gains.
Supported architectures: x86-64 (Intel/AMD) and ARM64/AArch64 (Apple Silicon, Linux ARM).
Architecture Overview¶
The JIT uses a three-tier execution model:
┌─────────────────────────────────────────────────────────┐
│ Execution Pipeline │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐│
│ │ Interpreter │──▶│ Tier-1 │──▶│ Tier-2 ││
│ │ (Bytecode) │ │ (Threaded) │ │ (Tracing JIT)││
│ └──────────────┘ └──────────────┘ └──────────────┘│
│ ▲ │ │
│ │ Guard Failure │ │
│ └──────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐│
│ │ Hot-Spot Profiler ││
│ │ Tracks execution counts per (method, bytecode_pc) ││
│ └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘
| Tier | Name | Strategy | Compile Speed | Code Quality |
|---|---|---|---|---|
| 0 | Interpreter | Bytecode dispatch | N/A | Baseline |
| 1 | Tier-1 | Subroutine-threaded code | Fast | Moderate |
| 2 | Tier-2 | Tracing JIT → native x86-64 / ARM64 | Slower | Aggressive |
Components¶
File Layout¶
include/jit/
├── jit_config.h # Compilation thresholds, cache sizes, feature flags
├── jit_manager.h # MultiTierJITManager orchestrator
├── jit_memory.h # Platform-specific executable memory (mmap/VirtualAlloc)
├── jit_profiler.h # HotSpotProfiler for detecting hot loops/methods
├── jit_tier1.h # Tier1Compiler (threaded code generation)
├── jit_tier2.h # Tier2Compiler (tracing JIT with IR + native codegen)
├── jit_codegen.h # X86_64CodeGen (raw machine code emitter)
└── jit_codegen_arm64.h # AArch64CodeGen (ARM64 machine code emitter)
src/jit/
├── jit_manager.cpp # Tier coordination, compilation requests, statistics
├── jit_memory.cpp # JIT memory allocation (mmap/MAP_JIT, mprotect, W^X)
├── jit_profiler.cpp # Execution counting, warmup detection
├── jit_tier1.cpp # Threaded code compilation and inline caching
├── jit_tier2.cpp # Trace recording, IR optimization, x86-64 codegen (~1800 lines)
├── jit_tier2_arm64.cpp # ARM64 native codegen (compiled only on AArch64 targets)
└── jit_codegen.cpp # x86-64 instruction encoding (REX, ModRM, SSE2)
Note:
jit_tier2_arm64.cppis guarded by#if defined(__aarch64__) || defined(__arm64__)and only compiles on ARM64 targets. The x86-64 codegen injit_tier2.cppis untouched — both backends share the same IR and optimization pipeline.
Hot-Spot Profiler¶
Class: HotSpotProfiler (jit_profiler.h)
The profiler detects hot code by counting backward-jump executions (loop iterations) and method invocations.
How It Works¶
- Every backward jump (loop) and method call is recorded via
recordExecution(method_id, bytecode_offset). - When a location's execution count exceeds a threshold, it is marked "hot" and promoted to the next tier.
- The profiler tracks per-method statistics (call count, total execution time) and per-location profiles (execution count, current tier).
Thresholds¶
| Parameter | Value | Description |
|---|---|---|
TIER1_COMPILATION_THRESHOLD |
50 | Backward jumps to trigger Tier-1 |
TIER2_COMPILATION_THRESHOLD |
50 | Tier-1 executions to trigger Tier-2 |
PROFILE_SAMPLE_INTERVAL |
100 | Profiling sample frequency |
MAX_PROFILED_METHODS |
10,000 | Max tracked methods |
MAX_PROFILED_LOOPS |
5,000 | Max tracked loops |
Warmup Detection¶
The profiler tracks a warmup phase (first 5 seconds or until Tier-2 compilations begin). During warmup, the JIT gathers type feedback and execution patterns before committing to aggressive optimizations.
Tier-1: Lightweight Threaded Code¶
Class: Tier1Compiler (jit_tier1.h)
Tier-1 provides a fast first compilation by converting bytecode into subroutine-threaded code — a sequence of direct CALL instructions to pre-existing bytecode handler functions.
Compilation Strategy¶
Bytecode: OP_GET_LOCAL 0 | OP_CONSTANT 1 | OP_ADD | OP_SET_LOCAL 0
↓ ↓ ↓ ↓
Threaded: MOV RAX, &handler_get_local MOV RAX, &handler_add
CALL RAX CALL RAX
MOV RAX, &handler_constant MOV RAX, &handler_set_local
CALL RAX CALL RAX
Each bytecode instruction becomes a MOV RAX, imm64; CALL RAX pair that invokes the interpreter's handler function directly, eliminating the dispatch overhead.
Inline Caching¶
Tier-1 implements monomorphic inline caches for method dispatch:
- Each
OP_CALLsite gets a cache entry storing the expected method ID and handler address. - On a cache hit (same method), dispatch is a direct
CALL— no lookup needed. - On a cache miss, a dynamic lookup is performed and the cache is updated.
- Hit/miss ratios are tracked per cache entry for profiling.
Shallow Tracing¶
Tier-1 uses shallow tracing to safely profile both branches of conditionals without side effects. Instructions with side effects (OP_SET_LOCAL, OP_SET_GLOBAL, OP_CALL, OP_THROW) are wrapped in a we_are_jitted() guard that skips execution during trace recording, preventing state corruption.
Code Cache¶
| Parameter | Value |
|---|---|
TIER1_CODE_CACHE_SIZE |
10 MB |
TIER1_FUNCTION_CACHE_CAPACITY |
1,000 functions |
Tier-2: Tracing JIT Compiler¶
Class: Tier2Compiler (jit_tier2.h)
Tier-2 is the aggressive optimizing compiler. It records execution traces (linear sequences of actually-executed instructions through a loop) and compiles them to optimized native x86-64 machine code.
Compilation Pipeline¶
Bytecode → recordTrace() → IR (intermediate representation)
↓
optimizeTrace()
• Type specialization (guards)
• Loop unrolling
• Constant folding
• Dead code elimination
• Common subexpression elimination
• Store-load forwarding
↓
compileTrace() → Native x86-64
↓
executeTrace() → Direct execution
Phase 1: Trace Recording¶
recordTrace() walks through bytecode starting at a loop entry point and records all instructions until hitting an OP_LOOP (backward jump) or OP_RETURN. This captures the hot path through the loop body.
- Nested control flow (jumps, conditionals) is tracked via depth counting.
- The trace is bounded by
MAX_TRACE_LENGTH(5,000 instructions).
Phase 2: Bytecode → IR Conversion¶
convertToIR() translates bytecode instructions into a typed intermediate representation. Key translations:
| Bytecode | IR Instruction | Notes |
|---|---|---|
OP_GET_LOCAL / OP_LOAD_LOCAL_0..3 |
LOAD_LOCAL |
Slot number in operand1 |
OP_SET_LOCAL |
STORE_LOCAL |
|
OP_GET_GLOBAL |
LOAD_GLOBAL |
Resolved to direct Value* pointer at IR-build time |
OP_SET_GLOBAL |
STORE_GLOBAL |
Also resolved to Value* |
OP_CONSTANT / OP_CONST_INT8 / OP_CONST_ZERO / OP_CONST_ONE |
LOAD_CONST |
Value pointer stored in data field |
OP_ADD / OP_ADD_INT |
ADD |
Both generic and specialized map to same IR |
OP_LESS_JUMP |
LESS + JUMP_IF_FALSE |
Fused opcodes are expanded |
OP_INC_LOCAL_INT |
LOAD_LOCAL + LOAD_CONST(1) + ADD + STORE_LOCAL |
Super-instructions are expanded |
OP_INCREMENT_LOCAL |
Same 4-op expansion | |
OP_LOOP |
LOOP_BACK |
Marks end of trace |
Global variable resolution: At IR-build time, OP_GET_GLOBAL and OP_SET_GLOBAL look up the global variable name in the VM's globals map and store a direct Value* pointer in the IR instruction's data field. This eliminates hash-map lookups at runtime — the JIT-compiled code reads/writes the Value struct directly via its memory address.
Phase 3: Optimization Passes¶
Optimizations are applied in order of impact:
1. Type Specialization (Guards)¶
Inserts GUARD_TYPE instructions at trace entry to verify runtime types match the profiled types. If a guard fails at runtime, execution falls back to the interpreter.
2. Loop Unrolling¶
For very hot loops (execution count > 2 × TIER2_COMPILATION_THRESHOLD), the loop body is duplicated N times with UNROLL_MARKER separators. The final LOOP_BACK jumps to the beginning of the unrolled body.
3. Constant Folding & Dead Code Elimination¶
UNROLL_MARKERinstructions are stripped.- Consecutive
LOAD_CONSTpairs are candidates for folding.
4. Common Subexpression Elimination (CSE)¶
Instructions are hashed by (opcode, operand1, operand2). Duplicate pure computations (ADD, MULTIPLY, EQUAL, LESS) are identified for potential reuse.
5. Store-Load Forwarding¶
The pattern STORE(X) / POP / LOAD(X) is optimized by eliminating the POP and LOAD — the stored value remains available from the prior STORE.
Phase 4: Native Code Generation¶
compileTrace() generates x86-64 machine code from the optimized IR. The generated code follows the System V AMD64 ABI.
Validation¶
Before compilation, the trace is validated:
- Only numeric operations are supported (non-numeric constants → bail out)
- Nested loops are rejected (single LOOP_BACK only)
- Unresolved globals → bail out
- Failed traces are recorded to avoid retrying
Register Allocation¶
x86-64:
RBP = ExecutionFrame* (preserved across loop)
RBX = Value* locals (base pointer for local variable access)
R12-R15 = Cached global Value* addresses (top 4 most-accessed globals)
RAX = Scratch register / integer cache for bitwise ops
RCX = Second scratch for bitwise operand B
RDX = Scratch for constants (avoids clobbering RAX integer cache)
XMM0-14 = Operand stack (max 15 slots, doubles in SSE2)
XMM15 = Scratch register for modulo and negation
ARM64 (AArch64):
X20 = ExecutionFrame* (callee-saved, = RBP on x86-64)
X19 = Value* locals (callee-saved, = RBX on x86-64)
X21-X24 = Cached global Value* addresses (callee-saved, = R12-R15 on x86-64)
X9 = Scratch / integer cache for bitwise ops (= RAX on x86-64)
X1 = Temp for bitwise operand B and constant loading (= RCX on x86-64)
X0 = First argument (ExecutionFrame*), also temp
D0-D14 = Operand stack (D8-D14 are callee-saved, saved/restored in prologue/epilogue)
D15 = Scratch FP register (= XMM15 on x86-64)
Global Register Caching: The compiler analyzes all LOAD_GLOBAL/STORE_GLOBAL instructions in the trace, counts access frequency, and assigns the top 4 most-accessed global variables to callee-saved registers (R12–R15 on x86-64, X21–X24 on ARM64). This eliminates the MOV RAX, imm64 (10 bytes on x86-64) or MOVZ+MOVK sequence (up to 16 bytes on ARM64) for every access of these hot global variables.
Integer Register Cache: For bitwise operation chains (AND, OR, XOR, SHIFT), the compiler tracks whether the integer scratch register (RAX on x86-64, X9 on ARM64) already holds the int64 conversion of an operand-stack register. Consecutive bitwise operations skip the redundant CVTTSD2SI (x86-64) or FCVTZS (ARM64) conversion.
Code Shape¶
The generated native code has this structure:
x86-64
; ──── PROLOGUE ────
push rbx, rbp, r12, r13, r14, r15 ; save callee-saved
mov rbp, rdi ; RBP = ExecutionFrame*
mov rbx, [rbp+32] ; RBX = local_variables ptr
mov r12, <global_addr_1> ; cache hot globals
mov r13, <global_addr_2>
...
; ──── LOOP START ────
loop_top:
movsd xmm0, [rbx + slot*16 + 8] ; LOAD_LOCAL → SSE2 load
movsd xmm1, [rbx + slot*16 + 8] ; LOAD_LOCAL
ucomisd xmm0, xmm1 ; LESS comparison
jae exit_stub ; if not less, exit loop
addsd xmm2, xmm3 ; ADD → SSE2 add
movsd [rbx + slot*16 + 8], xmm2 ; STORE_LOCAL → SSE2 store
movsd [r12 + 8], xmm4 ; STORE_GLOBAL via cached register
jmp loop_top ; LOOP_BACK
; ──── EXIT STUB ────
exit_stub:
; locals already synced (STORE_LOCAL writes to memory every iteration)
pop r15, r14, r13, r12, rbp, rbx ; restore callee-saved
ret
ARM64 (AArch64)
; ──── PROLOGUE ────
stp x30, x19, [sp, #-48]! ; save LR + callee-saved GP regs
stp x20, x21, [sp, #16]
stp x22, x23, [sp, #32]
stp d8, d9, [sp, #-56]! ; save callee-saved FP regs (D8-D14)
stp d10, d11, [sp, #16]
stp d12, d13, [sp, #32]
str d14, [sp, #48]
mov x20, x0 ; X20 = ExecutionFrame*
ldr x19, [x20, #32] ; X19 = local_variables ptr
movz x21, #<global_addr_1_lo> ; cache hot globals
movk x21, #<global_addr_1_hi>, lsl 16
...
; ──── LOOP START ────
loop_top:
ldr d0, [x19, #slot*16 + 8] ; LOAD_LOCAL → FP load
ldr d1, [x19, #slot*16 + 8] ; LOAD_LOCAL
fcmp d0, d1 ; LESS comparison (sets NZCV)
b.ge exit_stub ; if not less, exit loop
fadd d2, d2, d3 ; ADD → FP add
str d2, [x19, #slot*16 + 8] ; STORE_LOCAL → FP store
str d4, [x21, #8] ; STORE_GLOBAL via cached register
b loop_top ; LOOP_BACK
; ──── EXIT STUB ────
exit_stub:
; restore callee-saved FP regs
ldr d14, [sp, #48]
ldp d12, d13, [sp, #32]
ldp d10, d11, [sp, #16]
ldp d8, d9, [sp], #56
; restore callee-saved GP regs
ldp x22, x23, [sp, #32]
ldp x20, x21, [sp, #16]
ldp x30, x19, [sp], #48
ret ; return via LR (X30)
Memory Layout¶
Locals are accessed as Value structs at [RBX + slot * 16 + 8]:
- Each Value is 16 bytes (8-byte type tag + 8-byte union)
- Offset +8 reaches the as.number field (the double payload)
Globals are accessed through direct Value* pointers, either cached in R12–R15 or loaded via MOV RAX, imm64.
Executable Memory¶
Generated code is allocated and made executable via the JITMemory class (jit_memory.h):
| Platform | Allocation | Execution | Cache Coherency |
|---|---|---|---|
| Linux x86-64 | mmap(PROT_READ\|PROT_WRITE) |
mprotect(+PROT_EXEC) |
Not needed (unified cache) |
| Linux ARM64 | mmap(PROT_READ\|PROT_WRITE) |
mprotect(+PROT_EXEC) |
__builtin___clear_cache() |
| macOS ARM64 | mmap(MAP_JIT) |
pthread_jit_write_protect_np() |
__builtin___clear_cache() |
| Windows | VirtualAlloc(MEM_COMMIT) |
VirtualProtect(PAGE_EXECUTE_READWRITE) |
FlushInstructionCache() |
ARM64 note: ARM64 has separate instruction and data caches (Harvard-style). After writing machine code to memory,
__builtin___clear_cache()must be called to ensure the instruction cache sees the new code. macOS additionally enforces W^X (write XOR execute) viaMAP_JIT+pthread_jit_write_protect_np()toggling.
Code Cache¶
| Parameter | Value |
|---|---|
TIER2_CODE_CACHE_SIZE |
50 MB |
MAX_TRACE_LENGTH |
5,000 instructions |
MAX_LOOP_TRACES |
500 |
JIT Manager¶
Class: MultiTierJITManager (jit_manager.h)
The manager orchestrates the entire JIT pipeline:
- Compilation Requests:
requestCompilation()decides whether to compile a method to Tier-1 or Tier-2 based on profiler data. - Tier Transitions:
transitionTier()handles the state transition when code moves between execution tiers. - Execution Dispatch:
executeCompiledCode()runs compiled code at the appropriate tier, with on-demand Tier-2 compilation if no trace exists. - Monitoring: Optional compilation event listeners track
TIER1_COMPILATION_TRIGGERED,TIER2_COMPILATION_COMPLETED,TIER_TRANSITION,COMPILATION_FAILED, etc. - Statistics: Tracks compilation counts, compilation time, cache sizes, tier distribution (time spent in interpreter vs Tier-1 vs Tier-2).
Tier Promotion Criteria¶
| Transition | Condition |
|---|---|
| Interpreter → Tier-1 | Method call count ≥ TIER1_COMPILATION_THRESHOLD (50) |
| Tier-1 → Tier-2 | Call count ≥ TIER2_COMPILATION_THRESHOLD (50) AND total execution time > 10ms |
Tier-1 Execution (Updated March 2026)¶
As of March 2026, Tier-1 execution is fully implemented with on-demand compilation:
- Code Caching: Compiled Tier-1 code is cached in
tier1_code_cache_for reuse - On-Demand Compilation: When
executeCompiledCode()is called for Tier-1: - Check cache for existing compiled code
- If not found, compile on-demand using available bytecode handlers
- Cache the result for future executions
- Fallback: If compilation fails, gracefully falls back to interpreter execution
- Reset Support: Tier-1 cache is properly cleared during JIT reset
// Simplified execution flow
case CompilationTier::TIER1: {
auto it = tier1_code_cache_.find(method_id);
if (it == tier1_code_cache_.end() && frame.chunk) {
// Compile on-demand
auto tier1_code = tier1_compiler_->compile(method_id, *frame.chunk, handlers);
if (tier1_code) {
tier1_code_cache_[method_id] = std::move(tier1_code);
}
}
// Execute compiled code if available
if (it != tier1_code_cache_.end() && it->second) {
success = tier1_compiler_->execute(*it->second, &frame);
} else {
// Fallback to interpreter
success = true;
}
break;
}
Code Generators¶
x86-64 Code Generator¶
Class: X86_64CodeGen (jit_codegen.h)
Low-level utility class that emits individual x86-64 instructions with correct encoding:
General Purpose:
- MOV r64, r64 / MOV r64, imm64 / MOV r64, [mem] / MOV [mem], r64
- ADD r64, r64 / CMP r64, r64 / TEST r64, r64
- PUSH r64 / POP r64 / LEA r64, [mem]
- CALL rel32 / CALL r64 / RET
- JMP rel32 / JE rel32
SSE2 Double Precision:
- ADDSD / SUBSD / MULSD / DIVSD (arithmetic)
- MOVSD xmm, [mem] / MOVSD [mem], xmm (load/store)
- UCOMISD xmm, xmm (comparison, sets CPU flags)
Encoding Utilities:
- emitRexPrefix() — REX prefix for 64-bit operations and extended registers
- encodeModRM() — ModRM byte encoding (mod, reg, rm fields)
- makeExecutable() — Platform-specific memory protection for W^X
ARM64 (AArch64) Code Generator¶
Class: AArch64CodeGen (jit_codegen_arm64.h)
Header-only code generator that emits 32-bit fixed-width ARM64 instructions:
General Purpose:
- MOV Xd, Xn (via ORR Xd, XZR, Xn) / MOV Xd, #imm64 (via MOVZ + up to 3 MOVK)
- LDR Xd, [Xn, #imm] / STR Xd, [Xn, #imm] (unsigned scaled & unscaled offsets)
- STP/LDP pairs for prologue/epilogue (pre-index store, post-index load)
- ADD Xd, Xn, #imm12 / SUB Xd, Xn, #imm12
- RET (return via X30/LR)
FP Double Precision:
- FADD / FSUB / FMUL / FDIV / FNEG (D-register arithmetic)
- LDR Dd, [Xn, #imm] / STR Dd, [Xn, #imm] (FP load/store)
- FMOV Dd, Dn / FMOV Dd, Xn (FP-FP and GP-to-FP moves)
- FCMP Dn, Dm (comparison, sets NZCV flags)
Conversion:
- FCVTZS Xd, Dn — FP double → signed int64 (for bitwise operations)
- SCVTF Dd, Xn — signed int64 → FP double (convert back after bitwise)
Bitwise (GP):
- AND / ORR / EOR (XOR) / MVN (NOT)
- LSLV (variable shift left) / ASRV (arithmetic shift right)
Branch:
- B <offset> — unconditional branch (±128 MB range)
- B.cond <offset> — conditional branch with FP condition codes (GE, LT, GT, LE, EQ, NE)
- Patch helpers: patchB() / patchBCond() for back-patching forward branches
Optimization Feature Flags¶
Configurable in jit_config.h:
| Flag | Default | Description |
|---|---|---|
ENABLE_INLINE_CACHING |
true |
Monomorphic inline caches for Tier-1 dispatch |
ENABLE_LOOP_UNROLLING |
true |
Loop body duplication in Tier-2 |
ENABLE_METHOD_INLINING |
true |
Method inlining in Tier-2 |
ENABLE_SHALLOW_TRACING |
true |
Side-effect-free tracing in Tier-1 |
ENABLE_TYPE_SPECIALIZATION |
true |
Type guards for specialized code paths |
MAX_INLINING_DEPTH |
3 | Maximum depth for recursive inlining |
Current Status¶
The JIT infrastructure is fully implemented and integrated with the VM execution loop. The components exist as standalone, tested modules:
- ✅ Hot-spot profiler with threshold-based tier promotion
- ✅ Tier-1 threaded code compiler with inline caching
- ✅ Tier-2 tracing JIT with IR, optimization passes, and native x86-64 codegen
- ✅ Tier-2 tracing JIT with native ARM64 (AArch64) codegen — same IR/optimization pipeline, separate backend
- ✅ x86-64 code generator with GPR + SSE2 support
- ✅ ARM64 code generator with GP + FP double-precision support
- ✅ Platform-specific executable memory management (Linux, macOS, Windows — including
MAP_JITfor Apple Silicon) - ✅ JIT manager with tier coordination and event monitoring
- ✅ VM integration - JIT connected to interpreter's
OP_LOOP/ backward-jump dispatch - ✅ On-stack replacement (OSR) - Transitioning mid-execution from JIT to interpreter
- ✅ Deoptimization - Guard failure fallback to interpreter
- ✅ Unit tests (
tests/jit/test_*.nt) - 10 comprehensive JIT tests
Supported Operations (Tier-2 Native Codegen)¶
Both the x86-64 and ARM64 backends support the same set of IR operations. The Tier-2 compiler can generate native code for traces containing:
- Locals:
LOAD_LOCAL,STORE_LOCAL(all slot numbers) - Globals:
LOAD_GLOBAL,STORE_GLOBAL(resolved to directValue*pointers) - Constants: Numeric constants only (
LOAD_CONST) - Arithmetic:
ADD,SUBTRACT,MULTIPLY,DIVIDE,MODULO - Bitwise:
AND,OR,XOR,NOT,LEFT_SHIFT,RIGHT_SHIFT - Comparison:
LESS,GREATER,EQUAL,NOT_EQUAL - Control flow:
JUMP_IF_FALSE(loop exit + internal conditionals),JUMP(forward),LOOP_BACK - Stack:
POP,NEGATE
Traces containing unsupported operations (method calls, string operations, object access, closures, exceptions) are rejected and marked as failed to avoid retry.
Performance Monitoring¶
When monitoring is enabled (initialize(true)), the JIT manager emits CompilationEvent records for:
| Event | Description |
|---|---|
TIER1_COMPILATION_TRIGGERED |
Tier-1 compilation started |
TIER1_COMPILATION_COMPLETED |
Tier-1 compilation finished |
TIER2_COMPILATION_TRIGGERED |
Tier-2 trace compilation started |
TIER2_COMPILATION_COMPLETED |
Tier-2 trace compilation finished |
TIER_TRANSITION |
Code moved from one tier to another |
COMPILATION_FAILED |
Compilation attempt failed |
Statistics available via getStatistics():
- Total compilations per tier, failed compilations
- Total compilation time, total execution time
- Code cache sizes and hit ratios
- Time distribution across interpreter / Tier-1 / Tier-2
- Warmup duration