Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
XY.AI Labs Team
Published on April 8, 2026

This article summarizes our peer-reviewed research paper published on arXiv:2604.05150. Authors: Geert Trooskens, Aaron Karlsberg, Anmol Sharma, Lamara De Brouwer, Max Van Puyvelde, Matthew Young, John Thickstun, Gil Alterovitz, Walter A. De Brouwer — from XY.AI Labs, Stanford University School of Medicine, Cornell University, and Harvard Medical School.
57×
token reduction at 1K tx
450×
latency improvement
96%
task completion
40×
cost reduction at scale
Abstract
We study compiled AI, a paradigm in which large language models generate executable code artifacts during a compilation phase, after which workflows execute deterministically without further model invocation. This paradigm has antecedents in prior work on declarative pipeline optimization (DSPy) and hybrid neural-symbolic planning (LLM+P); our contribution is a systems-oriented study of its application to high-stakes enterprise workflows, with particular emphasis on healthcare settings where reliability and auditability are critical.
By constraining generation to narrow business-logic functions embedded in validated templates, compiled AI trades runtime flexibility for predictability, auditability, cost efficiency, and reduced security exposure. We introduce (i) a system architecture for constrained LLM-based code generation, (ii) a four-stage generation-and-validation pipeline that converts probabilistic model output into production-ready code artifacts, and (iii) an evaluation framework measuring operational metrics including token amortization, determinism, reliability, security, and cost.
We evaluate on two task types: function-calling (BFCL, n=400) and document intelligence (DocILE, n=5,680 invoices). On function-calling, compiled AI achieves 96% task completion with zero execution tokens, breaking even with runtime inference at approximately 17 transactions and reducing token consumption by 57× at 1,000 transactions. On document intelligence, our Code Factory variant matches Direct LLM on key field extraction (KILE: 80.0%) while achieving the highest line item recognition accuracy (LIR: 80.4%). Security evaluation across 135 test cases demonstrates 96.7% accuracy on prompt injection detection and 87.5% on static code safety analysis with zero false positives.
Introduction: The Problem with Runtime Inference
Large language models are increasingly deployed to automate enterprise workflows, evolving from question-answering systems toward autonomous agent architectures. While these approaches demonstrate flexibility, they rely on repeated model invocation during execution, leading to high token consumption, variable latency, and non-deterministic behavior.
The idea of using LLMs as compilers rather than interpreters is not new. DSPy compiles declarative LLM calls into optimized pipelines. LLM+P translates natural language to PDDL, then uses classical planners to produce optimal solutions. Text-to-SQL systems instantiate the same pattern. Our contribution is to study this paradigm as a first-class production systems concern, evaluating it with operational metrics suited to enterprise deployment, and demonstrating its particular value in healthcare settings where reliability and auditability are regulatory requirements.
Empirical studies reveal persistent reliability challenges in runtime agent systems. Cemri et al. (2025) found that 79% of multi-agent failures stem from specification and coordination issues rather than infrastructure. Salesforce's CRMArena-Pro benchmark shows agent success rates degrading from 58% in single-turn to 35% in multi-turn interactions. Non-determinism persists even at temperature=0: accuracy varies up to 15% across runs, and output variance of 18–75% has been documented due to architectural factors including Mixture-of-Experts routing.
What Is Compiled AI?
Compiled AI is a workflow system satisfying three properties:
- One-time LLM invocation— the model runs once at generation time, not at transaction time.
- Zero-token deterministic execution— deployed workflows run as static code with no further model calls.
- Mandatory multi-stage validation— every artifact passes security, syntax, execution, and accuracy checks before deployment.
| System | Runtime LLM? | Deterministic? | Validation Required? |
|---|---|---|---|
| Agents (AutoGen, LangChain) | Yes | No | No |
| DSPy | Partial | Partial | No |
| Compiled AI | No | Yes | Yes |
System Architecture: The Code Foundry
The compiled AI architecture is motivated by a fundamental asymmetry in enterprise workflow automation: generating correct business logic benefits from LLM reasoning, but executing that logic thousands of times per day does not. Runtime agent systems conflate these two phases. Our architecture separates them, confining LLM invocation to a one-time compilation step and deploying validated, static code for all subsequent execution.
Design Principles
- Constrained Generation:LLM output is limited to narrow, well-defined functions (20–50 lines). Templates provide infrastructure, bounding the output space and reducing hallucination risk.
- Compilation over Interpretation:Generated code is validated, tested, and deployed as static artifacts—the LLM exits the execution loop entirely.
- Validation as Requirement: Every artifact passes a four-stage pipeline before deployment, feasible precisely because we generate code rather than interpret configurations.
- Compliance by Construction: Regulatory constraints (HIPAA, PCI-DSS, SOC 2) are encoded directly in templates and prompt blocks, ensuring generated code inherits compliance properties by default.
Component Overview
The system takes a YAML workflow specification as input and produces a validated Temporal activity as output. An Orchestrator receives the specification and selects appropriate templates, modules, and compliance constraints. A Template Library provides tested code patterns for common workflow types. A Module Library provides reusable functional capabilities. Prompt Blocks encode domain constraints. The orchestrator assembles these components into a prompt, invokes the LLM once to generate business logic, and passes the assembled artifact through a four-stage validation pipeline before deployment.
Four-Stage Validation Pipeline
- Security: Static analysis (Bandit, Semgrep, custom rules) checking for SQL injection, command injection, path traversal, and secrets exposure.
- Syntax: AST parsing, type checking (mypy), and linting (ruff).
- Execution: Sandboxed execution against test fixtures verifying successful completion, error handling, and output structure.
- Accuracy: Comparison against golden datasets using task-specific thresholds.
Bounded Agentic Invocation & Security
Some workflow steps require runtime judgment that cannot be fully precompiled. The architecture supports bounded agentic invocation: generated code may call LLMs for specific, narrow subtasks (e.g., extracting structured fields from an ambiguous clinical note) while maintaining deterministic overall flow with defined schemas, fallback logic, drift monitoring, and human escalation thresholds.
A three-gate security pipeline protects against OWASP LLM Top 10 vulnerabilities: an Input Gate (DeBERTa-v3 prompt injection detection + Presidio PII scanning), a Code Gate (static analysis for CWE-94, CWE-89, CWE-502, CWE-78, CWE-328), and an Output Gate (cryptographic canary token injection for prompt leakage detection). Compiled AI reduces runtime LLM invocations from one-per-transaction to zero, effectively eliminating the prompt-based attack surface in the execution path.
Experiments & Results
We evaluate on two distinct task types to demonstrate that compiled AI generalizes beyond a single benchmark.
Task 1: Function Calling (BFCL)
The Berkeley Function-Calling Leaderboard measures function-calling capability on 400 instances where the model must identify the correct function and extract parameters from natural language queries.
Token Efficiency & Cost
| Method | Gen Tokens | Runtime/tx | Total @ 1K | Break Even | Comp. Ratio |
|---|---|---|---|---|---|
| Direct LLM | 0 | 552 | 552,000 | — | — |
| LangChain | 0 | 740 | 740,000 | — | 0.75× |
| AutoGen | 0 | 805 | 805,000 | — | 0.69× |
| Compiled AI | 9,600 | 0 | 9,600 | ~17 tx | ~57× |
Total Cost of Ownership
At 1M transactions/month (using Claude Opus 4.5 pricing: $15/1M input, $75/1M output tokens):
| Method | Inference | Infra | Total TCO | Cost Ratio |
|---|---|---|---|---|
| Direct LLM | $21,500 | $500 | $22,000 | 40× |
| LangChain | $28,900 | $500 | $29,400 | 53× |
| AutoGen | $31,400 | $500 | $31,900 | 57× |
| Compiled AI | $55 | $500 | $555 | 1× |
Latency, Consistency & Reliability
- Latency:4.5ms P50 vs. 2,004ms for direct LLM — a 450× improvement with near-zero jitter (10.5ms vs. 1,123ms).
- Determinism: 100% reproducibility and zero output entropy. Runtime inference shows only 95% reproducibility at temperature=0 due to model variance.
- Reliability:96% task completion (384/400). All 16 failures are compilation-time failures—once successfully compiled, workflows execute with 100% reliability.
Multi-stage validation converts silent runtime failures into detectable compile-time failures, achieving 100% deployed-workflow reliability at the cost of a 4% compilation failure rate.
Security Results
- Prompt injection validator: 95.8% recall with 100% precision (zero false positives) across 30 adversarial inputs.
- Code safety gate: 75% recall on 20 vulnerable code fixtures with 100% precision (zero false positives on 20 benign workflows).
- Overall precision across 135 security test cases: 96.1%.
Task 2: Document Intelligence (DocILE)
To stress-test the architecture on noisy, semantic inputs, we evaluate on DocILE—5,680 invoice documents with degraded OCR quality, heterogeneous formats across hundreds of vendors, and semantic ambiguity.
| Paradigm | Approach | KILE | LIR | Latency | LLM Calls |
|---|---|---|---|---|---|
| Compiled AI | Deterministic (Regex) | 20.3% | 59.7% | 0.6 ms | None |
| Code Factory | 80.0% | 80.4% | 2,695 ms | Compiled | |
| Runtime AI | Direct LLM | 80.0% | 74.5% | 6,339 ms | Per-request |
| LangChain | 80.0% | 75.6% | 6,207 ms | Per-request | |
| AutoGen | 77.8% | 78.9% | 13,742 ms | Per-request |
Code Factory bridges the gap between pure deterministic extraction and full runtime inference by compiling task-specific LLM calls as code artifacts rather than constructing prompts per-request. It achieves competitive KILE accuracy (80.0%, matching direct LLM) and the highest LIR accuracy (80.4%), while running at 2.3× lower latency than direct LLM.
Discussion: When to Use Compiled AI
Compiled AI is not a general replacement for runtime inference. Our results suggest a practical decision rule:
- Use compiled extraction for structured, predictable inputs where determinism matters.
- Use runtime LLM for noisy, open-ended content where semantic understanding is important.
- For mixed-content production systems,a hybrid approach—compiled extraction with confidence-based LLM fallback—can reduce latency and cost while maintaining near-LLM accuracy.
Healthcare administrative workflows exemplify the regime where compiled AI excels. Prior authorization, billing reconciliation, and clinical document processing are high-volume, well-specified by regulation, compliance-sensitive, latency-sensitive, and demand full auditability.
The Economics of Compile-Once-Run-Many
LLM inference costs have fallen roughly 10× per year, but enterprise deployment volumes are growing faster still—driving total inference spend upward even as per-token costs decline. Architectures that amortize inference across transactions, rather than invoking a model per request, become economically compelling at scale. Compiled AI is one such architecture: generation cost is fixed at compile time, and execution cost is zero regardless of transaction volume.
Healthcare Application: Prior Authorization
We demonstrate bounded agentic invocation for healthcare prior authorization, where the LLM is confined to structured data extraction while coverage decisions execute as deterministic, fully auditable code:
- Safety Sandwich: The probabilistic model is constrained between input validation and deterministic logic.
- Auditability: Decisions trace to specific code lines, enabling full auditability under this architecture.
- Token Economics: Coverage rules are compiled once rather than sent per-request.
Limitations
- Specification quality:Compiled AI assumes users can accurately specify workflows in YAML; the "specification problem" remains fundamental.
- Bounded applicability: Not all workflows reduce to deterministic code; tasks requiring genuine creativity or adaptation to novel situations may require runtime inference.
- Generation failures: 4% of compilations (16/400 on BFCL) failed to produce accurate outputs, though all were syntactically valid and detectable before deployment.
- Model dependence: Generated code quality depends on the underlying LLM, and model updates may require re-validation.
Conclusion
We studied compiled AI as a production systems design point for LLM-based workflow automation, with emphasis on the healthcare domain where reliability and auditability are regulatory requirements. Building on the compilation paradigm established in DSPy, LLM+P, and related work, we characterized compiled AI's operational properties across two task types, showed that it generalizes through the Code Factory variant to semantic tasks requiring bounded LLM invocation, and demonstrated consistent advantages in:
- Token economics:57× reduction at 1,000 transactions
- Latency:450× improvement on function-calling
- Determinism: 100% reproducibility vs. 95% for runtime inference
- Cost at scale:40–57× TCO reduction at 1M transactions/month
The approach trades flexibility for determinism, cost efficiency, auditability, and reduced attack surface—properties that matter most in well-specified, high-volume, compliance-sensitive workflow regimes.
We release our evaluation framework and benchmark suite at github.com/XY-Corp/CompiledAI.
Read the full paper: arXiv:2604.05150 — Trooskens, Karlsberg, Sharma, De Brouwer, Van Puyvelde, Young, Thickstun, Alterovitz, De Brouwer (April 2026).
