The Code Factory Manifesto
XY.AI Labs Team
Published on January 6, 2026

"As long as you have a machine that's good at writing programs, you have a very powerful tool."
The Secret
There is a secret hiding in plain sight at the frontier of AI development:
Code generation is becoming the universal interface through which AI accomplishes any task.
Not interpreting YAML configurations. Not orchestrating conversations between agents. Agent conversations meander until they don't. Code executes the same way, every time.
This insight is profoundly mispriced by most companies building AI agents for enterprise workflows. They're building interpreters when they should be building compilers, and the difference isn't just architectural, it's economic. Copilot helps humans write code. Multi-agent platforms reason at runtime, burning tokens on every transaction. We do something different: the LLM generates code once, then that code executes forever with zero inference cost. The LLM isn't in the loop after deployment. It did its job, now it's done.
At XY.AI, we are building the code factory for healthcare.
Stop Throwing Tokens at the Problem
The naive approach is to throw tokens at the problem: more context, longer reasoning chains, elaborate multi-agent dialogues. But tokens are expensive and indeterminate. Every token spent on agent negotiation is potential work, not actual work.
Code inverts this. A 50-line function replaces a 10,000-token conversation loop and runs in milliseconds, not minutes.
The best strategy isn't smarter agents talking more. It's compressing intent into executable form once and letting deterministic systems run without the model in the loop.
What Are Frontier Labs Betting On?
They're betting on code.
The major AI labs have made their priorities clear:
- Google announced that approximately one-third of its code is now AI-generated, and launched Gemini 3 with explicit focus on coding capabilities.
- OpenAI reports that "almost all new code" at the company is now written with AI assistance. Altman has stated their models improved from "millionth-best programmer to top-50" in under two years.
- Anthropic's Dario Amodei stated in October 2024: "90% of code written by AI within Anthropic and companies we work with, that is absolutely true now."
This isn't coincidental. It's rational. Code is where the commercial demand is, and code is where the scaling laws still work.
The Benchmarks Are Accelerating
The capability gains over the past twelve months are significant. On SWE-Bench Verified, a benchmark measuring AI's ability to resolve real GitHub issues:
| Model | SWE-Bench Verified | Date | Source |
|---|---|---|---|
| Claude Opus 4.5 | 80.9% | Nov 2025 | Anthropic |
| GPT-5.2 | 80.0% | Dec 2025 | OpenAI |
| Gemini 3 Pro Preview | 76.2% | Nov 2025 | |
| OpenAI o3 | 69.1-71.7% | Apr 2025 | OpenAI |
| DeepSeek V3.2 Reasoner | 67.8% | Dec 2025 | DeepSeek |
Year-over-year improvement:
| Provider | Late 2024 | Late 2025 | Gain |
|---|---|---|---|
| Anthropic | 49.0% (Claude 3.5 Sonnet) | 80.9% | +31.9 pts |
| OpenAI | 33.2% (GPT-4o) | 80.0% | +46.8 pts |
| 51.8% (Gemini 2.0 Flash) | 76.2% | +24.4 pts |
Note: Scores vary based on scaffolding and evaluation setup. Anthropic's custom scaffold adds approximately 10 points versus standard SWE-agent.
Verifiability Is the Moat
Andrej Karpathy articulated why code advances faster than other AI capabilities in his 2017 essay "Software 2.0":
"Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we develop software."
Code is uniquely verifiable. Programs can be executed, tested against unit tests, and evaluated millions of times with unambiguous pass/fail feedback. This creates a reinforcement learning flywheel that doesn't exist for prose, images, or configuration files.
In his June 2025 YC AI Startup School keynote, Karpathy extended this framework:
"What's changed, and I think it's a fundamental change, is that neural networks became programmable with large language models... basically your prompts are now programs that program the LLM."
Models Reason Better When Tasks Look Like Code
Academic research confirms what we, practitioners, observe:
- PAL (Program-aided Language Models) demonstrated improvements over chain-of-thought prompting on mathematical reasoning tasks.
- Program of Thoughts prompting showed improvements on financial reasoning benchmarks.
- Chain of Code achieved 84% on BIG-Bench Hard, improving over chain-of-thought baselines.
The mechanism: code disentangles reasoning (what LLMs are good at) from computation (what interpreters are good at). If you can turn any workflow automation problem into a "writing code" problem, you anchor the task to something familiar for the model.
Where Alternative Approaches Fit
The industry has tried many approaches, each with legitimate strengths:
LangChain/LangGraph excel at rapid prototyping and experimentation. The ecosystem makes it easy to connect tools, models, and data sources quickly. LinkedIn, Uber, Replit, and Elastic have used LangGraph in production. However, some teams find the abstractions constraining at scale. Octomind's engineering team reported: "Once we removed it... we could just code."
No-code platforms (Zapier, Make) shine for simple, linear automations where custom code is overkill. Zapier offers 8,000+ integrations and SOC 2 Type II compliance. They struggle with complex branching logic and custom compliance requirements.
Conversational agents (AutoGen, CrewAI) offer flexibility for open-ended tasks. Salesforce research found leading AI agents achieve 35% success rates in multi-turn business scenarios.
Our position: These approaches serve real needs. For prototypes and simple automations, use what works. Code generation complements them when you need custom logic at scale with compliance guarantees, the code factory takes over.
What We Build
The XY Code Factory (Config Agent) that acts as Architect and Assembler selecting templates, modules, and prompt blocks based on YAML specifications, then generating production-ready Python code.
A Component Library of tested, validated building blocks:
- Templates: Base execution patterns (SimpleAgent, StreamingAgent, ValidatorAgent)
- Modules: Functional capabilities (DatabaseModule, HTTPModule, NotificationModule)
- Prompt Blocks: Domain logic and compliance constraints (HIPAA, SOC2, PCI-DSS)
A Four-Stage Validation Pipeline: Security → Syntax → Execution → Accuracy. Every generated artifact is validated before deployment, something runtime interpreters cannot offer.
Temporal-Native Orchestration: Generated activities integrate seamlessly with durable execution guarantees.
How the Code Factory Works
We don't ask the LLM to write entire applications. That's where hallucinations live. Instead, we deliberately constrain the universe the model operates in a bounded sandbox of tested templates, validated modules, and explicit compliance rules. The model has one job: fill in the business logic. Everything else is fixed.
When a new workflow step is needed, the Config Agent:
- Selects the right template based on the task pattern (synchronous call, streaming response, validation check, etc.)
- Injects the right modules for required capabilities (database access, HTTP calls, notifications)
- Assembles the right prompt blocks for domain constraints (HIPAA rules, financial compliance, security policies)
- Hands the LLM one job: write the
process_business_logic()method
That's it. The LLM writes 20-50 lines of business logic. Everything else, the Temporal decorators, the input/output schemas, the module initialization, the error boundaries comes from deterministic, tested code.
The Differentiator: Agentic Where It Matters
We're not building static workflows. Real business processes require judgment interpreting ambiguous documents, handling edge cases, making decisions that can't be reduced to if/else.
Our generated activities invoke LLM reasoning when the situation demands it. The difference is where the agency lives.
In conversational frameworks, the agents are the workflow, they decide what to do next, how to route, when to stop. The structure is emergent.
In our architecture, the workflow is compiled code, deterministic, validated, auditable, and LLM reasoning is a tool that code invokes when needed. The agent works for the code, not the other way around.
How we validate dynamic invocations:
- Every LLM call has defined input/output schemas and acceptable response patterns
- Fallback logic handles cases where outputs don't validate
- We monitor output distributions for drift and route ambiguous cases to human review
This doesn't eliminate indeterminism, it contains it within observable, recoverable boundaries.
The Orchestration
We don't build our own orchestration, we use Temporal because they have already solved durable execution, state management, retries, observability, fault tolerance.
- Netflix reported that Temporal reduced deployment failure rates from 4% to 0.0001% for transient cloud operation failures.
- Coinbase stated: "Development velocity has increased as developers can focus exclusively on writing code instead of maintaining a homegrown SAGA solution... Temporal has opened up use cases which weren't even imaginable with the homegrown system."
- Replit noted: "We've been able to scale up, and Temporal has never been the bottleneck. The agent has massively increased in its usage, and not having to rebuild our entire orchestration engine is great."
- Maersk reported reducing time to add new features from 60-80 days to 5-10 days.
Our innovation is above the execution layer: generating the activities that Temporal orchestrates.
Healthcare is Where This Matters The Most to Us
Before XY.AI, our team has been in the field of AI for US healthcare for almost 2 decades building complex applications and systems. We know both domains well with their promises and constraints.
Healthcare enterprises face unique pressures: regulatory complexity, data sensitivity, and the stakes of automation failures. Their perspectives inform our approach.
"We see [AI] as a strategic enabler of what we are trying to accomplish, which will really drive more affordable, accessible and personalized care. We've been embedding AI responsibly not just as an experiment but at scale."
"AI is only going to develop at the speed of trust in healthcare... we can't get too excited that we move faster than the market we have to move at the speed of trust."
"This is healthcare's opportunity to really leapfrog."
"AI allows us to take the stupid out of work and allows us to do the real value-added work."
"It's no surprise that one of the most exciting areas of generative AI in healthcare is not where you think it is. It's not treatment. It's not diagnosis. It's not personalized medicine. It's reducing the administrative tasks of our clinicians."
These leaders aren't betting on autonomous agents. They're betting on reliable automation of well-defined processes—exactly what code generation enables.
Regulatory Context
As of July 2025, the FDA has authorized over 1,250 AI-enabled medical devices. The January 2025 draft guidance on "Artificial Intelligence-Enabled Device Software Functions" introduces the Predetermined Change Control Plan (PCCP), allowing certain updates without new submissions.
Our workflow automation systems generally fall outside FDA device jurisdiction (they automate administrative processes, not clinical decisions), but we design with regulatory scrutiny in mind. The code generation approach supports this: generated artifacts can be validated, documented, and audited in ways that runtime agent systems cannot.
The XY.AI Manifesto
I. Code Is the Universal Substrate
Every AI capability eventually becomes a coding problem. Natural language understanding? Write code that parses intent. Data transformation? Write code that manipulates schemas. API integration? Write code that makes HTTP calls. Decision logic? Write code that evaluates conditions.
We generate code. That is our moat.
II. Let Coding Models Do Their Thing
The frontier insight is simple: let coding models do their thing at the lowest possible level, and get out of their way.
This means:
- Don't wrap code generation in abstraction layers that obscure what's happening
- Don't interpret configurations when you can compile code
- Don't build clever orchestration when you can generate straightforward programs
- Don't fight the model's training distribution, anchor to it
The models are optimized for code. Our architecture is optimized for letting them write it.
III. Validation Is Non-Negotiable
The difference between a demo and a product is validation. The difference between a prototype and production is reliability.
Config-interpreting systems cannot be validated before execution. Code-generating systems can.
Our four-stage pipeline, security checks, syntax validation, execution testing, accuracy verification exists because enterprise workflows have consequences. Failures cost money. Compliance violations cost more.
We validate before we deploy. Always.
IV. Customer Isolation Is a Feature, Not a Constraint
Every customer has different business logic, different integrations, different compliance requirements. Generic agent frameworks force everyone into the same runtime.
We generate customer-specific code artifacts. Each deployment is isolated. Each agent library is fine-tuned. Each workflow reflects the customer's actual business.
This is more work. It's also the only way enterprise actually works.
V. Self-Serve Means Self-Validating
True self-serve isn't "users write YAML and hope it works." True self-serve is "users write YAML and the system guarantees it works."
Our Agent Factory is a compiler, not an interpreter. Compilers catch errors before runtime. Interpreters discover them during execution.
Self-serve workflow deployment requires compile-time guarantees. We provide them.
Our Belief
Code generation will become the dominant paradigm for AI task execution. The economics are too favorable, the verifiability advantage too strong, the lab priorities too clear.
Caveat: Enterprise adoption lags capability. Regulatory hurdles, FDA for healthcare workflows, SOC 2 for financial services, extend timelines in regulated industries. McKinsey reports that while 88% of organizations use AI in at least one business function, ~67% still struggle to scale beyond pilot projects.
Enterprise adoption requires compiled artifacts, not interpreted configurations. Compliance, auditability, and customer isolation demand it.
The self-serve future is code factories, not no-code platforms. The abstraction layer is YAML-to-code, not drag-and-drop interfaces.
If we're right, XY.AI becomes a major infrastructure layer for enterprise AI workflow automation, the compiler that turns business intent into reliable, validated, customer-specific code.
Andrej Karpathy wrote about verifiability. The frontier labs are optimizing for code. Amjad Masad is betting Replit's future on it.
While most companies are still building interpreters, we are building the code factory.
That is our strategy. That is our edge. That is our belief.
XY.AI Labs — Palo Alto, CA December 2025
Appendix: Competitive Positioning
| Approach | Examples | Strengths | Limitations | XY.AI Advantage |
|---|---|---|---|---|
| Conversational Agents | AutoGen, CrewAI | Flexible, open-ended tasks | No pre-deployment validation | Compiled artifacts, 4-stage validation |
| Graph Orchestration | LangGraph | Rapid prototyping, rich ecosystem | Runtime interpretation | Generated code, Temporal-native |
| No-Code Platforms | Zapier, Make | Simple automations, many integrations | Limited custom logic | Customer-specific code generation |
| Raw Temporal | DIY | Full control, proven reliability | Requires manual activity coding | Auto-generated activities from YAML |
Appendix: Key Quotes
"Coding will be a core competency for every AI agent—because writing code is how agents use computers best."
"We're seeing customers have three orders of magnitude of savings on apps."
"Your customer is no longer the developer—your customer is the agent that the developer is wielding."
"AI is only going to develop at the speed of trust in healthcare."
