Differentiable Meta-Circular Interpretation

Compile once,
differentiate everywhere.

Write a scientific model as a tiny Scheme program. A compiler turns a Scheme interpreter into a differentiable computation graph — once — and then gradient descent flows through it to recover the unknown constants inside any program you feed it. No custom gradient code. No per-program recompilation.

The dilemma

Known physics, unknown constants

Scientists usually know the shape of an equation — Coulomb's law, a pendulum's swing, how a population grows — but not its exact constants. Classical machine learning forces a bad trade-off.

Write fixed code

Fast and exact — but frozen. You cannot gradient-optimize the constants you don't know. Every unknown becomes a manual sweep.

exact · not learnable

Train a neural net

Trainable — but an opaque approximation of structure you already knew. Data-hungry, and it extrapolates badly outside the training set.

learnable · opaque

Compile a program

Keep the structure exactly as written and frozen; make only its numeric constants learnable. Hard physics, soft parameters — the best of both.

exact and learnable

The first insight

A program's structure (its control flow, recursion, composition) can stay frozen while its numeric constants become trainable parameters fit by gradient descent. This erases the boundary between fixed code and trainable network. In the original Neural Compiler, a compiled 1-D heat-equation model recovers the diffusivity to 0.0% error — where a physics-informed neural net (PINN) lands at 92.7%.

Background · 2 minutes

A program is just an expression

The system speaks a small subset of Scheme, a minimalist Lisp. A program is an S-expression — a parenthesized tree — and it evaluates to a number. Try it: edit the program or the inputs and run it. This little interpreter runs entirely in your browser.

The pieces you'll need

Everything in this tutorial is built from these forms. Click a chip to load it into the editor.

(+ - * /)arithmetic
(if t a b)conditional
(let ((x 1)) …)local binding
(lambda (x) …)function / closure
(define (f x) …)named definition
sin cos exp log √math primitives

Here's Coulomb's law as a program, with an unknown constant k: (/ (* k (* q1 q2)) (* r r)). Hold that thought — in a moment we'll learn k from data by descending through this exact expression.

The compiler · how a program becomes a graph

Parse → ANF → Compute Graph

To make a program differentiable, the compiler lowers it through three stages. Type a program below and step through each one — the highlighting links every stage to the next.

1

Parse

A hand-written tokenizer + recursive-descent parser turns text into an immutable AST. Surface forms like cond, let* and when desugar to a small core.

2

A-Normal Form

Every compound subexpression is hoisted into a named let-binding, so each binding maps 1:1 to a graph node. Think "SSA for functional languages" — showing your work line by line.

3

Compute Graph

Each binding becomes a node in a dataflow DAG. Walk it in PyTorch and, because every primitive is differentiable, autograd flows from a loss back to the constants.

The whole point · live

Gradient descent through a program

Here is the idea made tangible. We have data and a program with an unknown constant. Press play and watch real reverse-mode autodiff — running in your browser on the graph we just built — pull the constant to its true value. This is exactly what the paper does, scaled up.

Why this beats a neural network here

The program contributes zero approximation error on its valid domain — all the error lives in the one constant we're learning. That's why a model with 1–4 parameters routinely beats an 8,500-parameter MLP, and why it extrapolates. On the LLM-generated benchmark, the compiled model's loss is up to 4,100× lower than a pure MLP — which fails outright on Coulomb's 1/r² singularity.

The leap · the heart of paper 2

Don't compile the program.
Compile the interpreter.

The original Neural Compiler compiled each program on its own. But the compiler is expressive enough to compile something far more powerful: a Scheme interpreter written in Scheme. Compile that once, and every program becomes data you hand to it.

The central insight

"The model is not differentiable; the interpreter is." A scientific model expressed as a Scheme program is not itself a differentiable computation graph — it is input data to the compiled interpreter, which is a differentiable computation graph. Because the interpreter is compiled once and verified once, any program it evaluates automatically inherits differentiability. Gradients flow through the interpreter's dispatch logic, environment management, and heap operations to reach the learnable parameters θ inside P(θ).

This is a real, self-hosting interpreter

The evaluator below is the actual core of bootstrap/compiler.scm — 215 lines of Scheme, 25 primitives. The compiler turns this into the differentiable graph. Notice it's just a cond over expression shapes: a number evaluates to itself, a symbol looks up the environment, a list dispatches on its head.

Self-hosting is the proof of expressiveness: the compiler can compile its own evaluator, which can then evaluate Scheme — including itself.

215lines of Scheme
25primitives
~285 KBportable artifact
bootstrap/compiler.scmscheme
(define (scheme-eval expr env)
  (cond
    ((number? expr) expr)                 ; numbers self-evaluate
    ((symbol? expr) (env-lookup expr env)) ; variables → environment
    ((pair? expr)
     (let ((head (car expr)))
      (cond
        ((eq? head 'quote) (car (cdr expr)))
        ((eq? head 'if)
         (let ((t (scheme-eval (car (cdr expr)) env)))
           (if t (scheme-eval (car (cdr (cdr expr))) env)
                 (scheme-eval (car (cdr (cdr (cdr expr)))) env))))
        ((eq? head 'lambda)
         (list 'closure (car (cdr expr))   ; capture params,
               (car (cdr (cdr expr))) env)) ; body, & environment
        (#t (eval-apply head (cdr expr) env)))))
    (#t 0)))

Compile once. Ship it. Run any program.

compile-once.pypython
from neural_compiler import compile_interpreter, evaluate_program, save_compiled, load_compiled

interp = compile_interpreter()             # compile the evaluator ONCE
save_compiled(interp, "interpreter.ncg")   # ~285 KB portable JSON — ship it anywhere

interp = load_compiled("interpreter.ncg")  # consumer: no Scheme toolchain, no recompilation
y = evaluate_program(interp, "(* a (exp (* (- 0 b) x)))", {"x": 1.5, "a": 2.5, "b": 0.8})
z = evaluate_program(interp, "(+ (* a x) (* b (* x x)))", {"x": 2.0, "a": 3.0, "b": 1.5})
# Two different programs — same compiled graph. Gradients flow to a, b in both.
Under the hood · what makes it work

Three tricks make it differentiable

For autograd to flow through an interpreter, every value, every memory operation, and every branch has to stay on the differentiable path. Three mechanisms make that true.

01

Tagged values

Every runtime value — a number, a boolean, a pair, even a closure — is the same shape: a fixed 14-dimensional tensor. The first 10 dimensions are a one-hot type tag that routes dispatch; the last 4 are a payload. Crucially, gradients flow through the payload while the tag merely decides what to do. Click a type to inspect it.

02

A write-once heap

Pairs, lists and closures live on a dictionary-backed heap keyed by integer addresses. Because the language is pure, every slot is written exactly once and never mutated. PyTorch's autograd version-counter is never tripped, so the gradient chain survives arbitrarily many cons / car / cdr operations.

Step through building a list on both a write-once heap and a naive mutating buffer — watch the autograd chain stay intact on one and shatter on the other.

03

Soft control flow

For non-recursive conditionals whose branches are both total, if is a differentiable multiplexer: sel·then + (1−sel)·else (recursive conditionals use lazy evaluation to bound depth). Branch decisions are made by program structure, not by the constants we're learning — so the shape of the autograd tape stays fixed even as parameter values change. Slide the test value to see the gate blend continuously.

The honest caveat

If a learnable parameter sits inside a branch condition — say (< x α) — the comparison is a hard 0/1 step and α receives zero gradient. The compiler doesn't introduce this; it inherits the source program's own non-differentiability. This is a real boundary, not a bug, and the paper demonstrates it directly.

Guarantees · proven, not just observed

Compile the interpreter once, verify once

Because correctness is established for the interpreter, it holds for every program the interpreter runs. Three theorems, stated over a 13-primitive core language.

Theorem 1

Compilation correctness

Every supported program, run by the compiled graph, produces the same value as the source semantics — to floating-point precision. Proved by structural induction. Empirically, compiled-interpreter and direct runs are bit-for-bit identical.

Theorem 2

Gradient correctness a.e.

For learnable constants θ, the gradient through the compiled graph equals the true gradient for almost every θ — a set of full Lebesgue measure. The exceptions are a measure-zero union of branch boundaries, where the program inherits the source's own non-differentiability.

Theorem 3

Composition preservation

Compose two almost-everywhere-gradient-correct programs and the result is too. The boundary set stays measure zero. This is what makes runtime composition of models trustworthy.

"Almost everywhere" is load-bearing

Gradients are exact on the open trace-constant region Θtc — where the discrete execution trace (which branches are taken, which tags dispatch) doesn't change with θ. Drag the point around the parameter space: inside a region the gradient is well-defined; crossing a boundary, a branch flips and the gradient is undefined exactly there.

The reassuring part: this region has full measure. A random initialization lands inside it with probability 1, and gradient steps keep you there until you deliberately cross a decision boundary.

Correctness ≠ success. Compilation makes the gradients correct; recovering a constant still depends on the loss landscape and the optimizer. Gradient descent is not magic — but now it's available, everywhere the math allows.

Evidence · real data from the paper

What the experiments show

Ten experiments (A–J) stress the system from gradient fidelity to throughput at scale. Every chart below is drawn from the paper's committed result data. Hover for exact values.

① Gradient equivalence & convergence Exp A & C

DMCI, direct compilation, and a hand-coded PyTorch interpreter trace the same convergence curve — they overlap to within 7×10⁻⁷ final-loss difference. A pure MLP, given the same task, diverges by orders of magnitude.

Convergence — recursive model C08 (cascaded EMA)
log loss vs epoch · DMCI = direct = hand-coded, MLP ~1,100× worse
Wall-clock by method Exp A
seconds to converge, log scale · autograd methods vs gradient-free

② LLM-generated models: physics vs MLP Exp B

An LLM wrote 15 scientific models from natural-language descriptions. All 15 compiled unmodified. DMCI matched direct compilation on every one (75/75 zero loss difference). A pure MLP fails badly where structure matters — note the log scale.

Final loss per model — DMCI vs pure MLP
15 LLM-generated models · lower is better · log scale

③ The latency cost — and how batching erases it Exp H

A single DMCI evaluation is ~14× slower than direct compilation. But that's a latency cost, not a throughput cost: over 99.8% of sequential time is Python bookkeeping paid once per batch. Batching amortizes it into an 852× speedup at batch size 1024.

Throughput scaling DiffSoc-S
ms per evaluation & speedup vs batch size · log axes
Where the per-eval time goes
overhead decomposition · raw arithmetic is just 0.1%

④ Compile once, and the honest limits Exp E, I, J

Compiling the interpreter is O(1) in program size, while per-program approaches scale linearly — a flat-versus-linear crossover. DMCI is also the most robust optimizer as models grow — but it is deliberately not a discrete program-synthesis engine.

Compile time vs size Exp J
DMCI O(1) vs baselines O(N) · log scale
Recovery error vs model size Exp I
DMCI stays robust where L-BFGS diverges
Discrete operator recovery Exp E
structure search is NOT DMCI's job
0%
autograd convergence · 171/171 runs (Exp A·B·C)
0×
lower loss than MLP (model M14)
0×
population-optimization speedup
0
relative gradient error vs direct
When to reach for DMCI

The case is reach and assurance — not speed

Great fit

  • Fitting continuous constants inside a known program structure
  • Programs generated by an LLM — differentiable with zero extra work
  • Composing or hot-swapping models at runtime (20–35 ms, gradients flow through the composite)
  • Recursive / iterative scientific models JAX's vmap can't vectorize
  • One source → four backends (PyTorch, JAX, NumPy, CuPy), verified once

Reach for something else

  • Latency-critical single evaluations (≈14× overhead per call)
  • Discrete program synthesis — only ~11% operator recovery vs 100% exhaustive
  • Optimizing a constant that lives inside a branch condition (zero gradient)
  • Problems with no known structure at all — train a network
"The ability to generate programs via LLMs, swap programs at runtime, or compose them programmatically — with gradient flow through the composite — is the architectural advantage that justifies the latency cost."
Get started

Read it, run it, cite it

Two papers and one open-source repository. Everything you need to reproduce the results or build on the system.

Install & first run

terminalbash
git clone https://github.com/sheneman/nncompile.git
cd nncompile && pip install -e .

python -c "from neural_compiler.compiler import run_scheme; \
print(run_scheme('(+ (* 3 4) 5)'))"   # 17.0

Fit a program's constants

fit.pypython
graph = compile_program("(* a x)", inputs={"a": None, "x": None})
a = torch.nn.Parameter(torch.tensor(0.0))
opt = torch.optim.Adam([a], lr=0.1)
for _ in range(200):
    pred = unwrap_number(evaluate(graph, {"a": make_float(a), "x": make_float(x)}))
    loss = (pred - y) ** 2
    opt.zero_grad(); loss.backward(); opt.step()
print(a.item())   # ≈ 3.0  — recovered by descending through the program

Cite this work

@article{sheneman2026dmci,
  title  = {Compile Once, Differentiate Everywhere:
            A Differentiable Meta-Circular Interpreter},
  author = {Sheneman, Lucas},
  year   = {2026},
  note   = {code at https://github.com/sheneman/nncompile}
}

@article{sheneman2026neuralcompiler,
  title   = {The Neural Compiler: Program-to-Network Translation
             for Hybrid Scientific Machine Learning},
  author  = {Sheneman, Lucas},
  journal = {arXiv preprint arXiv:2605.22498},
  year    = {2026}
}