Apache 2.0 · Open Source

Verify LLM output with Energy-Based reasoning

Constraint verification that catches what LLMs miss. Adversarial-robust, self-learning, real-time guided decoding. Built in Rust + JAX. pip install carnot

160+
Experiments
4
Energy Tiers
0.006ms
Verify Latency
2,251
Tests

LLMs predict. They don't check.

Autoregressive models commit to tokens one by one. Carnot introduces a holistic verification layer that treats the entire output as a single energy state.

⚠️

Autoregressive Failure

Token-by-token generation has no global consistency check. One wrong token poisons the entire sequence with no way to backtrack.

Energy-Based Repair

Energy-Based Models (EBMs) assign energy to the whole output. Low energy = valid. High energy = violation. Repair is done via gradient descent.

Verify, Repair, and Guide LLM Output

🧠

Four Energy Tiers

KAN (default): learnable splines, 0.994 AUROC, 8.7x fewer params — best for verification. Ising: fastest sampling, maps to FPGA/TSU hardware — best for real-time guided decoding. Gibbs: MLP for complex patterns. Boltzmann: deep residual for research. KAN for accuracy, Ising for speed.

🛠️

Verify-Repair Pipeline

VerifyRepairPipeline: extract constraints, verify via Ising, repair with LLM feedback. +15% on adversarial math, +6% on HumanEval. 5 lines of Python.

Energy-Guided Decoding

Constraint energy steers token generation in real-time. 0.006ms per check — 100x faster than a single token generation step. Kona-style reasoning.

🛡️

Adversarial Robust

Apple proved LLMs can't do math (accuracy drops 21% with irrelevant sentences). Carnot recovers +15% because Ising verification ignores irrelevant context entirely.

🤖

Agentic Verification

ConstraintStateMachine tracks verified facts across multi-step agent workflows. Rollback to last consistent state on violation. 60/60 violations caught in workflow benchmark.

📈

Self-Learning

Gets smarter with use: 67.6% → 97.0% over 500 questions. Online constraint generation from memory patterns, persistent across sessions. 96% factual claim coverage via Wikidata.

Measured Performance

GSM8K Full (1,319 Questions)

Math Reasoning at Scale

Baseline: 70-77% +Repair: 84-88%
Adversarial Math (Apple GSM8K)

+24-28% on Number-Swapped

LLM drops to: 46-53% +Repair: 74-78%
Self-Learning (Tier 1)

Gets Smarter Over Time

Fixed: 67.6% Adaptive: 97.0%
KAN Energy Tier

8.7x Fewer Parameters, Same Accuracy

Ising: 20,100 params KAN: 2,310 (0.994 AUROC)
Guided Decoding

Real-Time Constraint Steering

Latency: 0.006ms 100% CSR
Agentic Verification

Multi-Step Workflow Constraint Catching

Violations: 60 Caught: 60/60 (100%)

Implement in Minutes

from carnot.pipeline import VerifyRepairPipeline

# Verify any LLM output in 3 lines
pipeline = VerifyRepairPipeline()
result = pipeline.verify("What is 47 + 28?", "47 + 28 = 76")
print(result.verified)   # False
print(result.violations)  # [47 + 28 = 76 (correct: 75)]

# Auto-repair with LLM feedback loop
fixed = pipeline.verify_and_repair(
    "What is 47 + 28?", model="Qwen/Qwen3.5-0.8B"
)
print(fixed.final_response)  # "47 + 28 = 75"