AI/ML8 min read

How XY.AI Labs Built Customer-Specific EOB Parsers with Serverless Fine-Tuning

XY.AI Labs Team

Published on February 6, 2026

How XY.AI Labs Built Customer-Specific EOB Parsers with Serverless Fine-Tuning

77% → 87%

EOB parsing accuracy

2–3×

lower infrastructure costs

Multiple/day

fine-tuning iterations

Executive Summary

XY.AI builds automated end-to-end workflows for mid-sized healthcare providers, targeting the friction that slows operations—including revenue cycle management. To operate at scale amid highly variable healthcare data formats, the team needed specialized models their existing engineers could train, evaluate, and maintain without standing up a dedicated AI infrastructure function.

Together AI's serverless fine-tuning platform gave XY.AI a plug-and-play workflow for training Qwen 2.5 14B models securely on customer-specific data using fully managed infrastructure. Iteration velocity accelerated from weekly experiments to multiple runs per day, infrastructure costs dropped 2–3×, and EOB parsing accuracy improved from 77% to 87%.

About XY.AI

XY.AI automates complex workflows for healthcare providers and operators, eliminating manual transcription, data entry, and cross-system reconciliation that slow operations including revenue cycle management. Lamara De Brouwer—who combines a background in psychology and computer science—founded the company alongside serial entrepreneur Sam De Brouwer. The company closed its seed round in mid-2024.

The platform combines multimodal browser-level automation with backend AI to bridge disconnected healthcare portals. A typical workflow extracts data from a payer portal, applies customer-specific business logic, and populates downstream systems for submission or reconciliation. XY.AI supports customers across the healthcare ecosystem, including small and medium size providers, RCM organizations, EHRs, billing systems, and more.

The Challenge

Building specialized models for healthcare data parsing required solving three interrelated problems:

Infrastructure Friction Throttled Iteration

Initially, XY.AI managed its own fine-tuning and serving stack using Unsloth and vLLM. Engineers provisioned GPU instances, tuned serving parameters, and debugged model-specific optimizations by hand. Each new model architecture required additional research and bespoke serving configuration. This overhead capped experimentation at one or two training runs per week—forcing a seed-stage team to spend scarce engineering time on infrastructure rather than evaluation, data design, and domain-specific improvements.

Healthcare Data Varies by Provider

Explanation of Benefits (EOB) documents are notoriously inconsistent across the healthcare system. Payers use wildly different formats: Some summarize claims at a high level, while others itemize every adjustment and payment detail. Large or merged healthcare organizations often operate multiple schemas simultaneously—sometimes five or more in parallel. While EOBs are a concrete example, this pattern of structural variance appears broadly across healthcare data. Parsing this variability into reliable structured JSON requires rapid iteration on prompts, data formatting, and hyperparameters.

Per-Customer Models Require Scalable Economics

XY.AI's product strategy depends on customer-specific fine-tuning rather than one-size-fits-all deployments. Scaling this approach on self-hosted infrastructure would require a dedicated AI platform team—economically impractical at seed stage. Although individual training runs were inexpensive, managing GPU clusters and serving endpoints for dozens of customers threatened to overwhelm both cost structure and operational capacity.

The Solution

Together AI turned fine-tuning, evaluations, and deployment into a repeatable loop from experimentation to generating a testable endpoint, significantly improving XY.AI's iteration cadence.

XY.AI migrated from its self-hosted stack to the Together Fine-Tuning Platform, replacing custom infrastructure work with a standard API workflow. The team standardized on Qwen 2.5 14Bas the base model, training LoRA adapters for structured EOB extraction. On Together's managed infrastructure, training runs complete in 10–20 minutes at roughly $10 per run. Multiple times a day, the team submits a job, receives an endpoint for a trained adapter, and evaluates the trained model.

Evaluation Metrics

Evaluation includes the following metrics:

Generation quality: exact match, response rate
Field accuracy: across 11 fields (plus overall field accuracy)
Calibration: ECE and accuracy–retention curves
Routing tables: mapping confidence → expected accuracy → retention

Engineers iterate on data formatting, prompts, hyperparameters, and post-processing in the same loop.

Confidence-Based Routing in Production

This loop supports a confidence-based routing layer in production. Using token-level log probabilities, the inference service flags predictions that clear a calibrated threshold—about 50% of cases—and processes them automatically with ~95% accuracy on that subgroup. Lower-confidence predictions route to human review as a safety check and generate feedback data for future training. Confidence thresholds are validated against previously unseen evaluation data, so the routing policy stays aligned with empirical accuracy over time.

Multi-Tenancy Architecture

This setup also handles multi-tenancy. Because EOB formats vary by provider, XY.AI trains and validates a LoRA adapter per customer. Managed multi-LoRA serverless capability makes loading and unloading LoRA adapters easy and inference solution cost-effective, while separating customer traffic. With this architecture, a three-person engineering team is able to efficiently maintain distinct production models across many customers—without dedicated per-customer GPU deployments.

Results

Rapid Iteration Drives Model Quality

Training velocity increased from one or two experiments per week to multiple iterations per day. This pace enabled systematic optimization of data formats, prompts, and calibration logic, driving EOB-to-JSON parsing accuracy from 77% to 87% when measured against expert human baselines.

Production Deployment with Human-in-the-Loop Architecture

In production, high-confidence predictions are processed automatically while ambiguous cases route to human review. More than half of incoming EOBs clear the confidence threshold and bypass manual handling. Over time, human feedback continuously reduces review volume through retraining while maintaining strict accuracy guarantees.

Infrastructure Cost Avoidance

Together's fully managed training and serving eliminated the need for a dedicated AI infrastructure hire. XY.AI's existing team handles all model development, while infrastructure costs dropped 2–3× compared to the prior self-hosted setup—while enabling significantly higher experimentation throughput.

"Together AI does for fine-tuning and inference what Vercel does for LLM-based apps—it removes the infrastructure layer so we can focus on our product. We fine-tune and deploy customer-specific models through simple API calls. That lets our existing team move from weekly to daily iteration, cut costs by 2–3×, and improve accuracy from 77% to 87%."
— Lamara De Brouwer, Co-Founder & CTO, XY.AI Labs

Technical Architecture Overview

The system architecture spans three phases: training, offline evaluation & calibration, and production with live traffic monitoring.

Phase	Components	Key Features
Training	Raw EOB Data, Data Prep, Fine-Tuning	X12/835 claims, de-identification, Qwen 2.5 14B with LoRA adapters
Evaluation	Held-out Eval, Calibration & Routing Tables	ECE + accuracy-retention curves, confidence thresholds
Production	Serverless Endpoint, Confidence Routing, Monitoring	Auto vs human review routing, drift detection, feedback loop

Key Workflow Steps

Data Preparation: Raw EOB data (X12/835, claims) undergoes de-identification, formatting, and held-out split creation
Fine-Tuning:Qwen 2.5 14B base model trained with customer-specific LoRA adapters on Together's managed infrastructure
Evaluation: Staged endpoint tested against held-out data for exact match, field accuracy, and calibration metrics
Promotion: Adapters that pass held-out thresholds are promoted to production
Production Routing: Token logprobs generate confidence scores; high-confidence predictions auto-process, low-confidence routes to human review
Feedback Loop: Human-reviewed cases generate curated feedback labels for future training iterations

Key Takeaways

Serverless fine-tuning eliminates infrastructure overhead, enabling faster iteration cycles
Per-customer LoRA adapters allow multi-tenancy without dedicated GPU deployments
Confidence-based routing balances automation efficiency with accuracy guarantees
Human-in-the-loop architecture generates continuous improvement data
Calibrated thresholds ensure routing policies remain aligned with real-world performance

This case study was originally published by Together AI. Read the full case study at together.ai/customers/xy-ai-labs.

Get started now

Check how easy and approachable our Al agents are to use within your existing workflows.