Introduction
AI features don’t fail because the model is bad. They fail because the bill shows up before the value does.
If you run a SaaS product, you already know the pattern:
- A prototype looks cheap in dev.
- Usage grows, prompts get longer, and latency creeps up.
- Finance asks why inference spend doubled.
- Engineering asks why the model is suddenly “worse” (it’s usually context bloat, not magic).
This article is a CTO level playbook for managing AI costs in SaaS without killing product quality. We’ll cover token budgets, caching, batching, and model routing. We’ll also talk about the parts teams skip: ownership, guardrails, and compliance.
Insight: Most AI cost problems are not model problems. They’re product and architecture problems that show up as a model invoice.
What you should get out of this:
- A way to forecast and cap spend before you scale usage
- Concrete patterns that reduce tokens and latency
- A routing approach that keeps quality where it matters
- A rollout plan that doesn’t torch trust with users or your own team
What we mean by “AI cost” (so we don’t argue later)
When people say “AI cost,” they usually mean the model invoice. For CTO planning, that’s too narrow.
Track cost as a stack:
- Inference cost: tokens in, tokens out, tool calls, embeddings, reranks
- Latency cost: timeouts, retries, user drop off, support tickets
- Engineering cost: prompt churn, brittle integrations, lack of evals
- Compliance cost: data retention, access controls, audit trails
A useful framing is cost per successful outcome, not cost per request. If the model is cheap but users abandon the flow, you didn’t save money.
Cost control checklist for CTOs
Print it, paste it in your backlog
- Add per endpoint token, latency, and outcome telemetry
- Create token budgets by tier and endpoint
- Version prompts and cache keys
- Implement tenant scoped caching with TTL
- Batch embeddings and background tasks
- Add routing with clear risk rules
- Build eval sets for critical flows
- Document data retention and access controls
- Add circuit breakers for retries and incidents
The cost curve: where SaaS teams get surprised
AI spend scales in ways typical SaaS infra does not. You can autoscale servers. You cannot autoscale your way out of sending 12k tokens of context on every request.
Common failure modes we see when teams ship AI features fast:
- Prompts grow with every edge case. Nobody deletes anything.
- “Temporary” logging becomes permanent. Suddenly you store sensitive data.
- One model is used for everything. The expensive one.
- Retries multiply spend during incidents.
- Product adds “just one more source” to RAG. Recall improves. Precision tanks.
Key Stat: 76% of consumers get frustrated when organizations fail to deliver personalized interactions.
That stat cuts both ways. Personalization pushes teams to add more context. More context pushes token bills up. You need a budget.
featuresGrid
- Feature: Cost guardrails
- What it is: Hard caps per user, per workspace, per day
- Why it matters: Prevents surprise invoices and abuse
- Feature: Quality gates
- What it is: Offline evals and golden sets for critical flows
- Why it matters: Routing and compression can degrade output silently
- Feature: Observability
- What it is: Token, latency, and outcome metrics per endpoint
- Why it matters: You can’t optimize what you can’t see
- Feature: Incident controls
- What it is: Backoff, circuit breakers, degraded mode
- Why it matters: Retries can double spend during outages
Scalability is not just throughput. It’s prompt growth.
In SaaS, usage growth usually means more requests. In AI, it often means:
- More requests
- Longer requests
- Longer responses
- More tool calls per request
So your cost curve bends upward. If you don’t put a token budget in front of the system, the prompt becomes the backlog. And it grows forever.
Team reality: who owns cost and quality?
If “the AI team” owns everything, you get a bottleneck. If nobody owns it, you get chaos.
A workable split we’ve used on AI heavy internal tools (like the Mobegí Slack bot architecture we wrote about) is:
- Platform team owns: routing, caching, telemetry, compliance controls
- Product teams own: prompts, UX, acceptance criteria, eval cases
- Security owns: data classification, retention, vendor review
That gives you speed without creating an unreviewable prompt jungle.
_> What we track in production
Cost and quality signals that make optimization real
Hypothesized savings from routing
Validate with A B tests per endpoint
Latency percentile to watch
Cache hit vs miss tells the story
Core metrics per endpoint
Tokens, latency, outcome
A note on team and hiring
You don’t need a big AI team. You need clear ownership.
If you’re scaling post MVP, avoid the trap of hiring one “LLM person” and making them responsible for everything. What works better:
- One platform oriented engineer who owns routing, telemetry, and reliability
- Product engineers who own prompts and UX, with review guidelines
- A security partner who reviews data flows early
This mirrors how we’ve seen SaaS teams mature: move from ad hoc decisions to repeatable systems, without losing shipping speed.
Token budgets: the boring control that saves you
Token budgets sound like finance. They’re actually architecture.
Cache and batch
Stop paying twice
Most SaaS workloads repeat. Two levers usually pay back fast: Caching (reuse results):
- Prompt response cache keyed by normalized prompt + model version + system prompt hash. Risk: stale answers when policies change.
- Embedding cache keyed by content hash. Risk: missed invalidation on doc updates.
- Tool result cache keyed by tool + params. Risk: leaking sensitive data across tenants.
Rule that avoids incidents: cache within a tenant by default. Cache across tenants only for public, non sensitive content. Example from the article: internal assistant style systems see the same weekly questions (policies, office hours, where a doc lives). Cache with expiry, then measure cache hit rate vs. wrong answer rate (hypothesis: hit rate rises quickly; wrong answers spike after policy changes unless you invalidate).
A token budget is a set of limits and tradeoffs you decide upfront:
- Max input tokens per request (context window safety)
- Max output tokens per response (runaway verbosity)
- Max tool calls per request (agent loops)
- Max retrieval chunks (RAG bloat)
Then you enforce them. In code.
Insight: If you don’t set a budget, the model will. And it will set it by timing out or getting expensive.
processSteps
- Define success for the endpoint (what is a “good” answer?)
- Measure baseline tokens and latency with real traffic samples
- Set hard caps (input, output, tool calls)
- Add soft strategies (summarize, compress, drop low value context)
- Create a degraded mode when caps are hit (short answer, ask a clarifying question, or hand off)
Here’s a simple pattern for enforcing budgets and logging what mattered.
def run_llm(request, user_tier, telemetry):
budget = budgets.for_tier(user_tier)
context = build_context(request)
context = truncate_or_summarize(context, max_tokens=budget.max_input_tokens)
response = llm.generate(
prompt=context,
max_output_tokens=budget.max_output_tokens,
max_tool_calls=budget.max_tool_calls,
timeout_ms=budget.timeout_ms,
)
telemetry.log({
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"tool_calls": response.usage.tool_calls,
"endpoint": request.endpoint,
"tier": user_tier,
"outcome": classify_outcome(response),
})
return response
What to budget by (pick one, then add the rest):
- By endpoint (best starting point)
- By workspace (B2B fairness)
- By user tier (pricing alignment)
- By feature flag (safe rollout)
benefits
- Predictable spend per customer segment
- Faster debugging when costs spike
- Cleaner prompts because you’re forced to choose
- Better product decisions because tradeoffs are explicit
Context compression: what works and what backfires
Compression is where teams get clever and then get burned.
What tends to work:
- Summarize conversation history into a running state
- Extract structured facts (entities, preferences, constraints)
- Keep a short “policy” prompt and move the rest into tools
What often backfires:
- Aggressive summarization without evals (hallucinations increase)
- Dropping “boring” system instructions (safety regressions)
- Over stuffing retrieval chunks (precision drops, tokens climb)
Hypothesis to validate in your product:
- Measure answer success rate vs input tokens. Many flows have a sweet spot where more context stops helping.
Track:
- success rate (human label or proxy)
- tokens per success
- p95 latency
- escalation rate to support or human handoff
Caching and batching: make the model do less work
If you want a fast cost win, start here. Most SaaS workloads have repetition.
Enforce token budgets
Limits in code, not docs
A token budget is architecture: max input, max output, max tool calls, max retrieval chunks. Set these per endpoint (best starting point), then expand to workspace and tier. Implementation pattern:
- Define what “good” means for the endpoint (golden set or offline evals for critical flows).
- Measure baseline tokens and latency on real traffic samples.
- Enforce hard caps, then add soft strategies: summarize, compress, drop low value context.
- When caps hit, switch to degraded mode: short answer, clarifying question, or handoff.
Failure mode: No budget means the model sets one for you via timeouts or runaway spend. Log input tokens, output tokens, tool calls, and outcome so you can see which cap is hurting quality.
Two patterns matter:
- Caching: reuse a previous result instead of calling the model
- Batching: combine multiple small calls into one request
Example: In internal assistant style systems like Mobegí, the same questions come up every week. Office hours. Policies. Where to find a doc. Caching those answers (with an expiry) is not fancy, but it’s effective.
Caching strategies that hold up in production
Use caching where the answer is stable enough.
- Prompt response cache
- Key: normalized prompt plus model version plus system prompt hash
- Good for: deterministic tasks, repeated internal queries
- Risk: stale or wrong answers if policy changes
- Embedding cache
- Key: content hash
- Good for: RAG pipelines where docs don’t change often
- Risk: forgetting to invalidate on doc updates
- Tool result cache
- Key: tool name plus params
- Good for: expensive database lookups, CRM reads
- Risk: caching sensitive results across tenants
A practical rule:
- Cache within a tenant by default
- Cache across tenants only for public, non sensitive content
Batching without breaking latency
Batching helps most when you do many small calls:
- embedding generation
- reranking
- classification
- moderation
But batching can hurt UX if you wait too long.
A typical approach:
- Batch on the server for a short window (for example 10 to 50 ms)
- Cap batch size
- Prioritize interactive requests over background jobs
Here’s a simple decision table.
| Technique | Best for | Main risk | Mitigation |
|---|---|---|---|
| Response caching | repeated questions | stale answers | TTL, versioned prompts, invalidation hooks |
| Embedding caching | stable documents | wrong retrieval | content hashing, re embed on change |
| Tool caching | expensive lookups | data leaks | tenant scoped keys, encryption, audit logs |
| Batching embeddings | high volume ingestion | added latency | micro batching window, async pipelines |
| Batching classifications | moderation, tagging | queue buildup | backpressure, drop low priority work |
faq
- Question: Should we cache LLM outputs at all? Answer: Yes, but be strict about scope and expiry. Cache within a tenant first. Version by model and prompt. Add a kill switch.
- Question: Won’t caching hide model regressions? Answer: It can. Log cache hit rates and sample cache bypass requests for evaluation.
- Question: Is batching worth it for chat? Answer: Usually not for the main response. It is worth it for side tasks like embeddings, moderation, and intent classification.
What to measure (so caching doesn’t become a guessing game)
Caching and batching are only “wins” if you can see the tradeoff.
Track at minimum:
- cache hit rate by endpoint
- tokens saved per 1k requests
- p50 and p95 latency (cache hit vs miss)
- stale answer reports (user feedback tag)
- cost per successful outcome
Insight: A 60% cache hit rate is meaningless if the remaining 40% are the expensive edge cases that drive 90% of spend.
Security and compliance guardrails that matter
AI cost work often touches sensitive data. Don’t bolt security on later.
- Data classification: label what can be sent to models and what cannot
- Tenant isolation: cache keys and vector stores must be tenant scoped
- Retention: log prompts and responses only as long as you need for debugging
- Access controls: restrict who can view traces and transcripts
- Audit trails: record model, prompt version, and routing decisions for investigations
If you operate in regulated environments, treat the AI layer like any other critical service. Zero trust principles still apply.
Model routing: pay for quality only where it matters
Routing is the CTO lever that aligns cost with product value.
Cost curve surprises
Where bills spike first
What breaks in production:
- Prompts accrete edge cases. Nobody deletes. Token counts climb.
- One expensive model gets used for everything.
- Retries during incidents silently double spend.
- RAG adds “one more source.” Recall goes up, precision drops, context gets huge.
What to do this week:
- Track tokens, latency, and outcome per endpoint. If you cannot slice by endpoint, you cannot control cost.
- Add hard caps (per user, per workspace, per day) and incident controls (backoff, circuit breaker, degraded mode).
Tradeoff: Personalization pushes more context. The article’s stat (76% frustrated by weak personalization) is real pressure. Mitigation is budgeting and routing, not “add everything to the prompt.”
Instead of “the model,” you run a portfolio:
- small model for classification and extraction
- mid model for most user facing responses
- large model for high stakes or complex reasoning
The mistake is routing on vibes. Use signals.
A routing matrix you can actually implement
Start with a few dimensions:
- Task type: classify, extract, generate, summarize
- Risk: user facing, compliance relevant, financial impact
- Complexity: context size, ambiguity, multi step tool use
- User tier: free vs paid
| Route | When to use | Why it saves money | What can go wrong |
|---|---|---|---|
| Small model | intent, tagging, moderation | cheap and fast | lower accuracy on edge cases |
| Mid model | standard chat, help content | good enough quality | can struggle with long context |
| Large model | escalations, complex reasoning | quality where it matters | can be slow and expensive |
| Human handoff | policy, legal, sensitive requests | avoids wrong answers | operational load |
Insight: Routing is not just cost control. It’s a quality control system. You’re deciding where mistakes are acceptable.
Implementation pattern: two stage with guardrails
A simple, robust approach:
- Stage 1: small model decides intent, risk, and complexity
- Stage 2: choose generation model and budget based on that decision
- Add fallbacks:
- if confidence is low, upgrade model
- if budget is exceeded, ask clarifying question
- if request is sensitive, refuse or hand off
type Route = "small" | "mid" | "large" | "handoff";
function routeRequest(intent: string, risk: number, complexity: number, tier:
string): Route {
if (risk >= 0.8) return "handoff";
if (complexity >= 0.7) return "large";
if (tier === "free" && complexity < 0.5) return "small";
return "mid";
}Where Apptension style delivery lessons show up
When we build SaaS products end to end, the routing logic ends up living next to other platform concerns: rate limits, tenancy, observability, and feature flags. That’s the right place for it.
It’s the same mindset as scaling a SaaS team post MVP. Early on you “do things that don’t scale.” Later you encode decisions into systems so the team doesn’t have to re argue them every sprint.
Routing failures you should expect
Routing is not free.
Expect these issues:
- Misroutes that degrade UX for power users
- Silent quality drops when you change prompts or models
- Hard to reproduce bugs because different users hit different paths
Mitigations that work in practice:
- Keep routing rules simple at first
- Log the route decision and inputs
- Maintain a golden set per endpoint
- Run A B tests on routing changes
Hypothesis to validate:
- Routing can cut inference spend by 20 to 50% in many SaaS apps with mixed workloads. Measure it with a controlled experiment, not a spreadsheet estimate.
Conclusion
Managing AI costs in SaaS is not one trick. It’s a system.
You need budgets to stop prompt sprawl. You need caching and batching to avoid repeat work. You need routing so you only pay for the expensive model when the user benefit is real.
Insight: The best cost reduction is the one users never notice.
A practical next sprint plan:
- Instrument tokens, latency, and outcome per endpoint
- Set token budgets with hard caps and degraded modes
- Add tenant scoped caching for the top repeated requests
- Batch embeddings and background classifications
- Introduce two stage routing with a small model gate
- Add security controls: tenant isolation, retention rules, audit logs
Takeaways to share with your team:
- Cost is a product metric. Treat it like latency.
- Budgets force decisions. Prompts get cleaner.
- Caching and batching are the fastest wins. But measure staleness.
- Routing aligns spend with risk. Put your best model where mistakes are expensive.
If you already have an AI feature in production, start with observability. The rest gets easier once you can see what’s happening.


