Engineering15 min read

The Code Factory Manifesto

XY.AI Labs Team

Published on January 6, 2026

"As long as you have a machine that's good at writing programs, you have a very powerful tool."
— Amjad Masad, CEO of Replit

The Secret

There is a secret hiding in plain sight at the frontier of AI development:

Code generation is becoming the universal interface through which AI accomplishes any task.

Not interpreting YAML configurations. Not orchestrating conversations between agents. Agent conversations meander until they don't. Code executes the same way, every time.

This insight is profoundly mispriced by most companies building AI agents for enterprise workflows. They're building interpreters when they should be building compilers, and the difference isn't just architectural, it's economic. Copilot helps humans write code. Multi-agent platforms reason at runtime, burning tokens on every transaction. We do something different: the LLM generates code once, then that code executes forever with zero inference cost. The LLM isn't in the loop after deployment. It did its job, now it's done.

At XY.AI, we are building the code factory for healthcare.

Stop Throwing Tokens at the Problem

The naive approach is to throw tokens at the problem: more context, longer reasoning chains, elaborate multi-agent dialogues. But tokens are expensive and indeterminate. Every token spent on agent negotiation is potential work, not actual work.

Code inverts this. A 50-line function replaces a 10,000-token conversation loop and runs in milliseconds, not minutes.

The best strategy isn't smarter agents talking more. It's compressing intent into executable form once and letting deterministic systems run without the model in the loop.

What Are Frontier Labs Betting On?

They're betting on code.

The major AI labs have made their priorities clear:

Google announced that approximately one-third of its code is now AI-generated, and launched Gemini 3 with explicit focus on coding capabilities.
OpenAI reports that "almost all new code" at the company is now written with AI assistance. Altman has stated their models improved from "millionth-best programmer to top-50" in under two years.
Anthropic's Dario Amodei stated in October 2024: "90% of code written by AI within Anthropic and companies we work with, that is absolutely true now."

This isn't coincidental. It's rational. Code is where the commercial demand is, and code is where the scaling laws still work.

The Benchmarks Are Accelerating

The capability gains over the past twelve months are significant. On SWE-Bench Verified, a benchmark measuring AI's ability to resolve real GitHub issues:

Model	SWE-Bench Verified	Date	Source
Claude Opus 4.5	80.9%	Nov 2025	Anthropic
GPT-5.2	80.0%	Dec 2025	OpenAI
Gemini 3 Pro Preview	76.2%	Nov 2025	Google
OpenAI o3	69.1-71.7%	Apr 2025	OpenAI
DeepSeek V3.2 Reasoner	67.8%	Dec 2025	DeepSeek

Year-over-year improvement:

Provider	Late 2024	Late 2025	Gain
Anthropic	49.0% (Claude 3.5 Sonnet)	80.9%	+31.9 pts
OpenAI	33.2% (GPT-4o)	80.0%	+46.8 pts
Google	51.8% (Gemini 2.0 Flash)	76.2%	+24.4 pts

Note: Scores vary based on scaffolding and evaluation setup. Anthropic's custom scaffold adds approximately 10 points versus standard SWE-agent.

Verifiability Is the Moat

Andrej Karpathy articulated why code advances faster than other AI capabilities in his 2017 essay "Software 2.0":

"Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we develop software."

Code is uniquely verifiable. Programs can be executed, tested against unit tests, and evaluated millions of times with unambiguous pass/fail feedback. This creates a reinforcement learning flywheel that doesn't exist for prose, images, or configuration files.

In his June 2025 YC AI Startup School keynote, Karpathy extended this framework:

"What's changed, and I think it's a fundamental change, is that neural networks became programmable with large language models... basically your prompts are now programs that program the LLM."

Models Reason Better When Tasks Look Like Code

Academic research confirms what we, practitioners, observe:

PAL (Program-aided Language Models) demonstrated improvements over chain-of-thought prompting on mathematical reasoning tasks.
Program of Thoughts prompting showed improvements on financial reasoning benchmarks.
Chain of Code achieved 84% on BIG-Bench Hard, improving over chain-of-thought baselines.

The mechanism: code disentangles reasoning (what LLMs are good at) from computation (what interpreters are good at). If you can turn any workflow automation problem into a "writing code" problem, you anchor the task to something familiar for the model.

Where Alternative Approaches Fit

The industry has tried many approaches, each with legitimate strengths:

LangChain/LangGraph excel at rapid prototyping and experimentation. The ecosystem makes it easy to connect tools, models, and data sources quickly. LinkedIn, Uber, Replit, and Elastic have used LangGraph in production. However, some teams find the abstractions constraining at scale. Octomind's engineering team reported: "Once we removed it... we could just code."

No-code platforms (Zapier, Make) shine for simple, linear automations where custom code is overkill. Zapier offers 8,000+ integrations and SOC 2 Type II compliance. They struggle with complex branching logic and custom compliance requirements.

Conversational agents (AutoGen, CrewAI) offer flexibility for open-ended tasks. Salesforce research found leading AI agents achieve 35% success rates in multi-turn business scenarios.

Our position: These approaches serve real needs. For prototypes and simple automations, use what works. Code generation complements them when you need custom logic at scale with compliance guarantees, the code factory takes over.

What We Build

The XY Code Factory (Config Agent) that acts as Architect and Assembler selecting templates, modules, and prompt blocks based on YAML specifications, then generating production-ready Python code.

A Component Library of tested, validated building blocks:

Templates: Base execution patterns (SimpleAgent, StreamingAgent, ValidatorAgent)
Modules: Functional capabilities (DatabaseModule, HTTPModule, NotificationModule)
Prompt Blocks: Domain logic and compliance constraints (HIPAA, SOC2, PCI-DSS)

A Four-Stage Validation Pipeline: Security → Syntax → Execution → Accuracy. Every generated artifact is validated before deployment, something runtime interpreters cannot offer.

Temporal-Native Orchestration: Generated activities integrate seamlessly with durable execution guarantees.

How the Code Factory Works

We don't ask the LLM to write entire applications. That's where hallucinations live. Instead, we deliberately constrain the universe the model operates in a bounded sandbox of tested templates, validated modules, and explicit compliance rules. The model has one job: fill in the business logic. Everything else is fixed.

When a new workflow step is needed, the Config Agent:

Selects the right template based on the task pattern (synchronous call, streaming response, validation check, etc.)
Injects the right modules for required capabilities (database access, HTTP calls, notifications)
Assembles the right prompt blocks for domain constraints (HIPAA rules, financial compliance, security policies)
Hands the LLM one job: write the process_business_logic() method

That's it. The LLM writes 20-50 lines of business logic. Everything else, the Temporal decorators, the input/output schemas, the module initialization, the error boundaries comes from deterministic, tested code.

The Differentiator: Agentic Where It Matters

We're not building static workflows. Real business processes require judgment interpreting ambiguous documents, handling edge cases, making decisions that can't be reduced to if/else.

Our generated activities invoke LLM reasoning when the situation demands it. The difference is where the agency lives.

In conversational frameworks, the agents are the workflow, they decide what to do next, how to route, when to stop. The structure is emergent.

In our architecture, the workflow is compiled code, deterministic, validated, auditable, and LLM reasoning is a tool that code invokes when needed. The agent works for the code, not the other way around.

How we validate dynamic invocations:

Every LLM call has defined input/output schemas and acceptable response patterns
Fallback logic handles cases where outputs don't validate
We monitor output distributions for drift and route ambiguous cases to human review

This doesn't eliminate indeterminism, it contains it within observable, recoverable boundaries.

The Orchestration

We don't build our own orchestration, we use Temporal because they have already solved durable execution, state management, retries, observability, fault tolerance.

Netflix reported that Temporal reduced deployment failure rates from 4% to 0.0001% for transient cloud operation failures.
Coinbase stated: "Development velocity has increased as developers can focus exclusively on writing code instead of maintaining a homegrown SAGA solution... Temporal has opened up use cases which weren't even imaginable with the homegrown system."
Replit noted: "We've been able to scale up, and Temporal has never been the bottleneck. The agent has massively increased in its usage, and not having to rebuild our entire orchestration engine is great."
Maersk reported reducing time to add new features from 60-80 days to 5-10 days.

Our innovation is above the execution layer: generating the activities that Temporal orchestrates.

Healthcare is Where This Matters The Most to Us

Before XY.AI, our team has been in the field of AI for US healthcare for almost 2 decades building complex applications and systems. We know both domains well with their promises and constraints.

Healthcare enterprises face unique pressures: regulatory complexity, data sensitivity, and the stakes of automation failures. Their perspectives inform our approach.

"We see [AI] as a strategic enabler of what we are trying to accomplish, which will really drive more affordable, accessible and personalized care. We've been embedding AI responsibly not just as an experiment but at scale."
"AI is only going to develop at the speed of trust in healthcare... we can't get too excited that we move faster than the market we have to move at the speed of trust."
— Gail Boudreaux, CEO of Elevance Health

"This is healthcare's opportunity to really leapfrog."
"AI allows us to take the stupid out of work and allows us to do the real value-added work."
— Tilak Mandadi, EVP & Chief Technology Officer at CVS Health

"It's no surprise that one of the most exciting areas of generative AI in healthcare is not where you think it is. It's not treatment. It's not diagnosis. It's not personalized medicine. It's reducing the administrative tasks of our clinicians."
— Dr. Daniel Yang, VP of AI & Emerging Technologies at Kaiser Permanente

These leaders aren't betting on autonomous agents. They're betting on reliable automation of well-defined processes—exactly what code generation enables.

Regulatory Context

As of July 2025, the FDA has authorized over 1,250 AI-enabled medical devices. The January 2025 draft guidance on "Artificial Intelligence-Enabled Device Software Functions" introduces the Predetermined Change Control Plan (PCCP), allowing certain updates without new submissions.

Our workflow automation systems generally fall outside FDA device jurisdiction (they automate administrative processes, not clinical decisions), but we design with regulatory scrutiny in mind. The code generation approach supports this: generated artifacts can be validated, documented, and audited in ways that runtime agent systems cannot.

The XY.AI Manifesto

I. Code Is the Universal Substrate

Every AI capability eventually becomes a coding problem. Natural language understanding? Write code that parses intent. Data transformation? Write code that manipulates schemas. API integration? Write code that makes HTTP calls. Decision logic? Write code that evaluates conditions.

We generate code. That is our moat.

II. Let Coding Models Do Their Thing

The frontier insight is simple: let coding models do their thing at the lowest possible level, and get out of their way.

This means:

Don't wrap code generation in abstraction layers that obscure what's happening
Don't interpret configurations when you can compile code
Don't build clever orchestration when you can generate straightforward programs
Don't fight the model's training distribution, anchor to it

The models are optimized for code. Our architecture is optimized for letting them write it.

III. Validation Is Non-Negotiable

The difference between a demo and a product is validation. The difference between a prototype and production is reliability.

Config-interpreting systems cannot be validated before execution. Code-generating systems can.

Our four-stage pipeline, security checks, syntax validation, execution testing, accuracy verification exists because enterprise workflows have consequences. Failures cost money. Compliance violations cost more.

We validate before we deploy. Always.

IV. Customer Isolation Is a Feature, Not a Constraint

Every customer has different business logic, different integrations, different compliance requirements. Generic agent frameworks force everyone into the same runtime.

We generate customer-specific code artifacts. Each deployment is isolated. Each agent library is fine-tuned. Each workflow reflects the customer's actual business.

This is more work. It's also the only way enterprise actually works.

V. Self-Serve Means Self-Validating

True self-serve isn't "users write YAML and hope it works." True self-serve is "users write YAML and the system guarantees it works."

Our Agent Factory is a compiler, not an interpreter. Compilers catch errors before runtime. Interpreters discover them during execution.

Self-serve workflow deployment requires compile-time guarantees. We provide them.

Our Belief

Code generation will become the dominant paradigm for AI task execution. The economics are too favorable, the verifiability advantage too strong, the lab priorities too clear.

Caveat: Enterprise adoption lags capability. Regulatory hurdles, FDA for healthcare workflows, SOC 2 for financial services, extend timelines in regulated industries. McKinsey reports that while 88% of organizations use AI in at least one business function, ~67% still struggle to scale beyond pilot projects.

Enterprise adoption requires compiled artifacts, not interpreted configurations. Compliance, auditability, and customer isolation demand it.

The self-serve future is code factories, not no-code platforms. The abstraction layer is YAML-to-code, not drag-and-drop interfaces.

If we're right, XY.AI becomes a major infrastructure layer for enterprise AI workflow automation, the compiler that turns business intent into reliable, validated, customer-specific code.

Andrej Karpathy wrote about verifiability. The frontier labs are optimizing for code. Amjad Masad is betting Replit's future on it.

While most companies are still building interpreters, we are building the code factory.

That is our strategy. That is our edge. That is our belief.

XY.AI Labs — Palo Alto, CA December 2025

Appendix: Competitive Positioning

Approach	Examples	Strengths	Limitations	XY.AI Advantage
Conversational Agents	AutoGen, CrewAI	Flexible, open-ended tasks	No pre-deployment validation	Compiled artifacts, 4-stage validation
Graph Orchestration	LangGraph	Rapid prototyping, rich ecosystem	Runtime interpretation	Generated code, Temporal-native
No-Code Platforms	Zapier, Make	Simple automations, many integrations	Limited custom logic	Customer-specific code generation
Raw Temporal	DIY	Full control, proven reliability	Requires manual activity coding	Auto-generated activities from YAML

Appendix: Key Quotes

"Coding will be a core competency for every AI agent—because writing code is how agents use computers best."
— Alexander Embiricos, Product Lead for Codex, OpenAI

"We're seeing customers have three orders of magnitude of savings on apps."
— Amjad Masad, CEO of Replit

"Your customer is no longer the developer—your customer is the agent that the developer is wielding."
— Guillermo Rauch, CEO of Vercel

"AI is only going to develop at the speed of trust in healthcare."
— Gail Boudreaux, CEO of Elevance Health

Get started now

Check how easy and approachable our Al agents are to use within your existing workflows.