Splitting AI Models by Architectural Layer: When One Model Isn't Enough

When I started experimenting on a fun project, I did what most people do: picked one model, hardcoded it everywhere, and shipped it. AsyncAnthropic() at every call site. One MODEL_NAME env var. Done.

It worked — until it didn't.

The failure wasn't dramatic. There was no production incident, no page at 3am. It was subtler: a slow accumulation of wrongness. Cost reports that didn't make sense. Latency numbers that were acceptable but not great. The creeping feeling that we were paying Sonnet prices for jobs that didn't need Sonnet quality, and Sonnet speed for jobs where speed didn't matter at all.

The problem wasn't capability. Modern frontier models are genuinely remarkable across a wide range of tasks. The problem is that using one model for fundamentally different workloads is like using a Formula 1 car for your grocery run. Technically possible. Expensive. And completely wrong for the job.

Here is how we split them, and why it mattered.

the two jobs

The cleaner I got about what was actually happening in our stack, the more obvious the split became. We had two distinct LLM-consuming layers with almost perfectly opposite requirements.

The first is the user-facing chat path — what we call the BFF, Backend for Frontend. A user types something, expects a response in under a second, and needs rich instruction-following: structured tool calls, SSE streaming, nuanced output that doesn't feel robotic. Latency is visible here. Quality is visible here. Users will tell you when either is off. This layer runs Claude Sonnet 4.6.

The second is the background processing pipeline — analysis jobs, draft generation, agent orchestration. These jobs land in a queue, run in seconds or minutes, and return JSON. No one is watching a spinner. No one is waiting. What you want is throughput, reasoning depth, and cost efficiency. This layer runs Grok fast reasoning.

The insight, once you see it, is hard to unsee: user-facing completions are latency-sensitive and quality-visible. Batch completions are throughput-sensitive and cost-visible. They optimize for entirely different dimensions. Pretending they are the same problem — routing them to the same model with the same billing rate — costs you money, latency, or both. Usually both.

"The failure wasn't dramatic. It was a slow accumulation of wrongness — cost reports that didn't make sense, latency that was acceptable but not great, and the creeping feeling we were paying for things we didn't need."

what we found in the codebase

Before writing a single line of new code, we audited what already existed. Codebases have a way of encoding institutional knowledge that no one ever wrote down, and this one was no different.

The agents layer already had provider routing via a _make_llm_client() function. The logic was dead simple:

llm_agent.py

def _make_llm_client(model: str):
    if model.startswith("claude-"):
        return ChatAnthropic(model=model)
    return ChatXAI(model=model)

Model name prefix determines the provider. No registry, no config file, no abstraction layer. Just a string check. The agents layer had been doing this for weeks, quietly, without fanfare. Someone on the team had made a pragmatic decision and moved on.

The repl module had ChatXAI instantiated directly. Two call sites, two different patterns — evidence that the codebase was already converging on xAI for backend work, just inconsistently. No one had connected the dots and made it deliberate.

The gap was in the analysis and drafts routers: both directly instantiated AsyncAnthropic() with no abstraction. Switching providers required code changes. The kind of code change that feels routine until you do it six times and realize you need a different approach.

What the audit told us was not that the codebase was broken — it was that it was converging on the right answer through organic pressure. Our job was to make that convergence explicit and durable.

the implementation

We built two modules: llm_policy.py and llm_client.py. One decides which model to use. The other routes the completion to the right provider. Clean separation of concerns.

llm_policy.py resolves which model to use for a given call site:

llm_policy.py

def resolve_model(env_key: str, logger: Logger) -> str:
    model = (
        os.getenv(env_key)
        or os.getenv("MODEL_NAME")
        or "claude-haiku-4-5-20251001"
    )
    if RESTRICT_MODELS and model in _DENIED_MODELS:
        logger.warning(
            f"Model {model} blocked by RESTRICT_MODELS (env_key={env_key})"
        )
        model = "claude-haiku-4-5-20251001"
    return model

Two named env vars drive the split: CGX_FRONTEND_MODEL_NAME for the BFF chat path, and PROCESSING_BACKEND_MODEL_NAME for analysis, drafts, and the agents orchestration layer. MODEL_NAME survives as a legacy fallback during migration — no need to break anything while you transition. Haiku is the cost-safe default if nothing is set.

The RESTRICT_MODELS=true flag (on by default) blocks the entire Sonnet and Opus family in dev and staging environments. This is the guardrail I am most proud of. Dev environments accumulate expensive habits. You run a test, it hits a frontier model, the cost is invisible in the moment, and then you wonder why the bill is weird at the end of the month. Blocking by default makes the expensive choice deliberate.

llm_client.py routes completions to the right provider:

llm_client.py

async def simple_complete(
    model: str, system: str, user_prompt: str,
    max_tokens: int, logger: Logger
) -> str:
    if model.startswith("claude-"):
        return await _anthropic_complete(
            model, system, user_prompt, max_tokens
        )
    return await _xai_complete(
        model, system, user_prompt, max_tokens, logger
    )

Same prefix heuristic as the agents layer, now applied uniformly. AsyncAnthropic for Claude models, ChatXAI via LangChain for everything else. xAI imports are deferred inside _xai_complete() so there is no import cost on the Anthropic path — a small thing, but the kind of small thing that matters in a long-running service.

Error handling is standardized across both paths: missing API key returns a 503, upstream failure returns a 502. Routers stay thin. The distinction between "we can't reach the provider" and "the provider rejected the request" is meaningful and now consistent.

One explicit carve-out: chat streaming is excluded from simple_complete. SSE keepalives and tool-use interleaving in streaming mode are complex enough to warrant their own path. chat/core.py stays Anthropic-native for now. This is intentional, not a gap. Streaming is a different contract with a different failure surface, and collapsing it into a general-purpose completion function would create more problems than it solves.

the cost tracking wrinkle

Multiple providers means multiple rate cards, and that creates a tracking problem that is easy to underestimate. When everything runs on one provider, cost attribution is straightforward. When you add a second, the math gets slippery — especially if some models are brand new and you don't have a rate card yet.

We solved this with a single function in the shared core:

cost.py

def compute_cost_usd(
    model: str, input_tokens: int, output_tokens: int
) -> tuple[float, bool]:
    rates = (
        RATES.get(model)
        or RATES.get(DEFAULT_MODEL)
        or FALLBACK_RATES
    )
    pricing_unknown = model not in RATES
    cost = (
        input_tokens * rates[0] + output_tokens * rates[1]
    ) / 1_000_000
    return cost, pricing_unknown

The pricing_unknown boolean is the key design choice. It lets you distinguish "this model costs $0.00" from "we have no rate card for this model." Both get stored in the database. Dashboard queries can filter on pricing_unknown=false to get accurate cost reports, and flag the rest for review.

This sounds like a small thing, but it is the difference between a cost dashboard you can trust and one you can only squint at. The moment you add a new model mid-quarter, all the rows without rate cards become noise unless you have a way to mark them as intentionally unknown. The flag makes the unknown explicit.

what shipped

After the refactor, switching from Claude to Grok for backend processing is a one-line environment change:

Production config

PROCESSING_BACKEND_MODEL_NAME=grok-4-fast-reasoning

No code changes. No deploys beyond config. The production split looks like this:

Layer	Model	Why
Chat (BFF)	claude-sonnet-4-6	Latency, tool use, SSE streaming
Analysis + Drafts	grok-4-fast-reasoning	Throughput, reasoning depth, cost
Agents pipeline	grok-4-fast-reasoning	Same as above
Default (safe)	claude-haiku-4-5-20251001	Cost floor for dev and staging

The thing I keep coming back to is how invisible the split is once it is in place. The calling code does not know which provider it is hitting. The error handling is identical. The cost is tracked the same way. The abstraction is thin enough to see through when you need to debug, but thick enough to absorb provider changes without touching business logic.

That is the goal of any good infrastructure layer. Not to be clever. To be boring in the right ways.

when to do this

Not every AI product needs a model split. Most early-stage products should run on one model, one env var, and ship fast. Premature abstraction in infrastructure is a real tax.

But there are three signals that tell you it is time.

The first is when you have a user-facing completion path and a background processing path that have been running long enough to have visible cost and latency profiles. These jobs have different SLAs by definition. The moment you can articulate the difference, you are ready to split.

The second is when cost-per-request starts showing up in conversations that are not engineering conversations. When a finance review or a product prioritization meeting includes "our model costs are higher than expected," the split has already paid for itself in the time it buys you to investigate.

The third is when a new model ships and you want to evaluate it for specific workloads without a rewrite. The prefix-routing pattern costs almost nothing to implement and makes that evaluation a config change instead of a project. New models are shipping fast right now. The flexibility compounds.

The rule I would give anyone: do not do this prematurely. One model, one env var, ship it. Add the split when you can name the specific tradeoff you are trying to capture — not before. The architecture should follow the understanding, not precede it.

"The abstraction should be thin enough to see through when you need to debug, but thick enough to absorb provider changes without touching business logic. That is the goal of any good infrastructure layer. Not to be clever. To be boring in the right ways."