Skip to content
← BACK TO BLOG
Fikri Firman Fadilah
GenAI

Shipping GenAI Features Without Breaking Production

Concrete strategies for deploying LLM-powered features safely: evaluation frameworks, fallback patterns, cost controls, and monitoring approaches that actually work.

Shipping GenAI Features Without Breaking Production

Building with LLMs feels different from traditional software. Your code is deterministic; the model outputs aren't. Your latency is predictable; API calls aren't. Your costs scale linearly with usage; token consumption can surprise you.

After shipping several GenAI features to production, I've learned that responsibility isn't about perfect accuracy—it's about understanding failure modes and building systems that degrade gracefully.

The Evaluation Problem

You can't deploy what you can't measure. But evaluating LLM outputs is harder than evaluating traditional code.

What actually works:

  • Task-specific metrics, not general ones. If you're generating summaries, BLEU scores are noise. Build a rubric: Is the summary factually accurate? Does it capture key decisions? A human reviewer scoring 50 examples beats 10,000 examples scored by another LLM.

  • Baseline comparisons. Compare against the simplest possible solution—templates, regex, rule-based fallbacks. Your fancy prompt engineering should outperform "just return the user's input formatted."

  • Failure categorization. Don't track "accuracy" as a single number. Track: hallucinations, truncations, refusals, latency timeouts, rate limits. Each requires different fixes.

  • Real traffic sampling. Your test set won't match production. Set up logging to sample 1-5% of real requests. Score them weekly. This catches distribution shifts you never anticipated.

Fallback Patterns That Matter

The most reliable GenAI systems aren't the ones with the best prompts. They're the ones with the best fallbacks.

Pattern 1: Graceful degradation

Try: Use LLM for intelligent response Catch timeout/error: Use template-based response Catch rate limit: Return cached result or queue for async processing Catch validation failure: Return user's input + gentle error message

Pattern 2: Staged rollouts
Don't enable GenAI for all users on day one. Use feature flags:

  • 1% of users for 24 hours
  • 10% for 2 days
  • 50% for 1 week
  • 100%

Monitor error rates, latency, and cost at each stage. A 50% increase in processing cost that seems fine at 1% becomes a budget problem at 100%.

Pattern 3: Circuit breaker for costs
Set hard limits on token spend per hour, per user, per feature. When you hit the limit, switch to fallback immediately. Don't wait for your bill.

python
if tokens_spent_this_hour > HOURLY_BUDGET:
    return fallback_response()

Monitoring and Observability

You need visibility into three layers:

1. Model layer

  • Token counts (input + output, separately)
  • Latency percentiles (p50, p95, p99)
  • Error rates by error type
  • Cache hit rates (if using caching)

2. Application layer

  • Fallback rates (how often did the fallback trigger?)
  • User-facing latency
  • Quality metrics from your evaluation framework
  • Cost per request

3. Business layer

  • Feature adoption (% of users using GenAI features)
  • User satisfaction signals (if available)
  • Cost vs. benefit

Concrete setup:
Log structured data at every decision point:

json
{
  "feature": "email_summarization",
  "timestamp": "2024-01-15T14:23:45Z",
  "model": "gpt-4-turbo",
  "input_tokens": 450,
  "output_tokens": 120,
  "latency_ms": 1240,
  "fallback_triggered": false,
  "quality_score": 0.87
}

Query this data daily. Watch for trends, not just spikes.

Cost Is a Feature, Not a Bug

LLM costs scale with usage. You can't optimize your way out of fundamental economics.

Things that actually reduce costs:

  • Caching identical requests. If 30% of your users ask the same question, cache the response. Saves 70% of calls.

  • Shorter prompts. Every token costs money. Remove unnecessary context. "Summarize in 3 sentences" vs. "Summarize in 3 sentences or fewer if you can be concise but comprehensive." The latter costs more.

  • Batch processing. If you don't need real-time responses, batch 100 requests and process them together. Cheaper per token.

  • Model selection. Smaller models (Claude 3 Haiku, GPT-4 Turbo vs. GPT-4) are faster and cheaper. Test if they meet your quality bar before defaulting to the largest model.

  • User-facing limits. Tell users what they're getting. "Summarize up to 10 documents per day" is cheaper than unlimited.

Handling Hallucinations

Hallucinations are real. Your system needs to detect and handle them.

Detection strategies:

  • Ask the model to cite sources. "Provide the answer and quote the relevant passage." If it can't quote, it probably hallucinated.

  • Fact-check against known data. If generating a product recommendation, verify the product exists before returning it.

  • Confidence scoring. Some models can estimate uncertainty. Use it as a signal to escalate to human review.

  • User feedback loops. "Was this helpful?" feedback, especially negative feedback, is a hallucination detector.

When you detect a hallucination, don't return it. Return the fallback. The user won't notice the system was wrong—they'll just get a less-fancy response.

The Honest Constraints

GenAI in production isn't magic. It's:

  • Slower than you want. LLM latency is measured in seconds, not milliseconds. If you need sub-100ms responses, this isn't the tool.

  • More expensive than alternatives. It's worth it for some problems (complex reasoning, nuanced text generation). It's not worth it for simple classification or templating.

  • Less reliable than traditional code. You can't guarantee specific outputs. You can only make failures graceful.

  • Harder to debug. When an LLM gives a bad response, "why?" is harder to answer. Invest in logging and sampling.

Putting It Together

A responsible production GenAI system looks like:

  1. Clear evaluation metrics that match your actual use case
  2. Fallback mechanisms for every failure mode
  3. Staged rollouts with cost and quality gates
  4. Continuous monitoring of model, application, and business metrics
  5. Cost controls that prevent runaway spending
  6. Hallucination detection that keeps bad outputs off your platform

None of this is novel. It's the same engineering discipline we've applied to databases, caches, and APIs for years. GenAI doesn't change the fundamentals—it just changes what you're monitoring and how you're thinking about failure.

Ship it responsibly.

#genai#llm#production engineering#ai operations#evaluation