Shipping your first AI feature — what I got wrong

It was a Tuesday. The conference room had that particular Bandung afternoon light — orange through the blinds, everyone a little slow from lunch. Five of us around the table: the CEO, the CPO, two product managers, and me. I had a laptop open and a Loom recording queued up as backup, just in case the live demo stuttered.

It didn’t stutter. It was beautiful.

I’d built a feature that read a user’s support ticket — free text, whatever language they typed — and drafted a first response. The model picked up context, matched tone, suggested resolution steps. In the demo, it handled every case I threw at it with the kind of calm competence that makes a room go quiet.

The CEO leaned back. “Ship it.”

I smiled. I said I’d have it to staging by Thursday.

And I said nothing about the thing I’d been sitting with all week: that I had tested this on exactly 100 handpicked tickets. That I had no idea what it would do with the other 40,000.

What broke

The feature went live on a Friday. By Monday morning we had twelve escalations.

Not from the cases I’d tested. From a specific, embarrassing class of input I hadn’t thought to include: tickets where users mistyped the product name.

Our product was called “Solvr.” Users wrote “Solver,” “Solvir,” “Slvr,” “SOLVR,” “solvr.ai.” My system prompt said: You are a support agent for Solvr. Respond only to issues related to Solvr products. When the name didn’t match exactly, the model — GPT-4-turbo at the time, mid-2024 — would occasionally decide the ticket was off-topic and respond with a polite out-of-scope message. To a user who’d just paid for a subscription and typed one wrong letter.

One ticket read: “Hi, I can’t log in to Solvir, been trying for 20 minutes.” The model responded: “It looks like your question may be about a different product. For Solvr support, please visit our help centre.” The user’s reply was three words. None of them printable.

I had tested 100 cases. Every test case used the exact product name. I had accidentally evaluated my own blind spot.

The failure mode wasn’t catastrophic in a data-loss, security-breach sense. It was worse, in a way. It was quietly, consistently condescending. It made the product feel broken to the users who needed it most — the ones already frustrated, the ones least likely to type carefully.

That’s the thing about probabilistic systems. They don’t fail uniformly. They fail at the edges. And the edges are exactly where your users are when they’re most vulnerable.

The pivot

We rolled back the automated response. I spent a week rebuilding the feature with a human-review step in the loop.

The new architecture: the model drafted a response, but it went to a support agent’s queue instead of directly to the user. The agent saw the draft, edited if needed, and sent it. If the model’s confidence score (I’d added an eval step — more on that in a moment) fell below a threshold, the ticket bypassed the draft entirely and went straight to the queue flagged for manual handling.

This was slower. The original pitch was “reduce first-response time by 60%.” With the review step, we got to roughly 35%. The CEO asked why. I explained. He understood. But I had to say out loud, in a meeting, that I’d shipped without adequate testing and the number I’d promised wasn’t achievable at the quality bar we needed.

That conversation was uncomfortable in a specific way I want to name: I had confused demo velocity with production reliability. In the demo, I was the human in the loop. I was choosing which tickets to show, and I was reading the output before it went anywhere. I removed myself from the loop when I shipped. I forgot that I’d been doing that.

The eval step I added wasn’t sophisticated. I used Simon Willison’s approach — a second model call that scored the draft response on three criteria: relevance to the ticket, absence of hallucinated product details, and appropriate tone. Anything below a composite threshold of 0.75 went to the queue. It added 400–600ms of latency. It caught roughly 8% of drafts in the first month. Every single one of those was a draft I would not have wanted to send.

Demo path: Input, Model, OK. Production path: Input, Eval, Model, Confidence check, Fallback, Logging, Output — The demo path is linear. The production path is the demo path plus everything you didn’t build yet.

Four things I’d tell the earlier version of me

On demo velocity versus production reliability. The demo is not a proof of concept. The demo is a best-case scenario you constructed with selection bias. Your test set will always be easier than the real world, because you built it before you understood the real world. The question to ask before you ship isn’t “does it work?” — it’s “what does it do when it doesn’t work?” I had no answer to that question. I should have refused to ship until I did.

On “mostly works” being the dangerous answer in probabilistic systems. Traditional code either handles an input or it throws an exception. You learn about the failure. Probabilistic systems do something worse: they handle the input confidently and incorrectly, and they keep going. Andrej Karpathy called this out when he wrote about Software 2.0 — the failure modes of learned systems are qualitatively different from the failure modes of written code. “Mostly works” in a deterministic system means 95% of users have a good experience. “Mostly works” in a probabilistic system means 5% of users have an experience so bad they escalate. The 5% are not randomly distributed. They cluster around the edges, the unusual inputs, the frustrated users. That’s where your product reputation actually lives.

On evaluation as a product surface, not a backend chore. I treated evals as something you do before you ship to convince yourself things work. That’s backwards. Evals are the ongoing heartbeat of an AI feature. Anthropic’s guidance on evaluating AI systems makes this point clearly: evaluation isn’t a pre-deployment gate, it’s a continuous discipline. Your eval suite should grow every time you find a new failure mode. Mine started at 100 handpicked cases and is now at 2,300, most of them contributed by real production failures. The feature is measurably better for it. I should have started this discipline before I shipped, not after.

On user-facing humility. The feature should have been able to say “I’m not sure.” It should have shown users when it was operating with low confidence, given them a clear path to a human, made the fallback obvious. Instead I shipped a feature that was equally confident whether it was correct or completely wrong. That’s not a minor UX issue — it’s a trust problem. When a user can’t tell when to trust the AI and when not to, they learn to distrust it entirely. Build uncertainty into the surface. Let the feature be honest about its limits. Users handle honest uncertainty far better than confident wrongness.

The unflashy lesson

Everything I got wrong, I got wrong because I treated this as a new kind of engineering problem that required new thinking. It wasn’t.

Shipping an AI feature is shipping a product. The same discipline applies: understand your failure modes before you’re in them; test at the edges not just the center; put a human in the loop until you trust the system; build observability from day one. None of that is AI-specific advice. It’s just engineering.

The genuinely new thing is the failure mode shape. Probabilistic failure is quieter, more distributed, and harder to catch in review than an exception or a null pointer. That requires different testing strategies — more coverage, adversarial cases, eval suites that evolve continuously. But the underlying rigor is unchanged. I just forgot to apply it because the demo was so good and the CEO said “ship it” and I wanted to be the person who shipped it.

If I could go back to that conference room — Tuesday afternoon, orange light, everyone still a little slow from lunch — I wouldn’t kill the demo. The demo was genuinely good. I would wait for the room to settle, and then I would say:

“This works beautifully on the cases I tested. I need two more weeks to understand what it does on the cases I haven’t.”

The CEO would have said yes. They always say yes when you sound like you know what you’re doing. And knowing what you’re doing means knowing what you don’t know yet.

That’s the version of me I’m trying to ship.