Clients Were Right: Anthropic Admits Claude Code Got Dumber — Not Claude (Post-Mortem)

A post-mortem, a trust gap, and the quiet cost of shipping slop.

For roughly two months, developers screamed into the void that Claude Code — Anthropic’s flagship agentic coding tool — had gotten noticeably worse. Threads multiplied on GitHub. Senior engineers at major chipmakers published forensic analyses. X timelines filled with the word “nerfed.“ Anthropic employees, for weeks, pushed back: no, the model is fine, this is a UI change, here’s how to type /effort high.

On April 23, 2026, Anthropic finally published a post-mortem confirming what users had been saying all along. Claude Code had gotten worse. Not because of a conspiracy, not because of compute-throttling, but because Anthropic had quietly shipped three separate changes — a default downgrade, a caching bug, and a verbosity-limiting system prompt — each of which chipped away at the product’s intelligence. Stacked together, they produced exactly the erratic, forgetful, lazy coding assistant developers had been complaining about since early March.

The confirmation didn’t just vindicate frustrated customers. It exposed a deeper problem that no amount of usage-limit resets will quietly paper over: the trust gap between Anthropic and the developers who built their workflows on Claude.

The Three Bugs, in Anthropic’s Own Words

Anthropic’s engineering team laid it out cleanly in the April 23 post-mortem. Three independent changes, affecting three slightly different slices of users on three different dates, combined to produce what looked like broad, inconsistent degradation — but was actually three distinct regressions overlapping in time.

1. The default effort downgrade (March 4). Anthropic changed Claude Code’s default reasoning effort from high to medium. The stated reason was latency: high mode could occasionally think so long that the UI appeared frozen, and some users were burning through usage limits faster than expected. Anthropic’s internal evals suggested medium gave “slightly lower intelligence with significantly less latency for the majority of tasks.” In practice, the “slightly lower intelligence” was a lot more visible than expected. Users noticed. They complained. Anthropic reverted the change on April 7.

In the post-mortem, the team now calls this “the wrong tradeoff.”

2. The thinking-cache bug (March 26). Anthropic shipped what was supposed to be an efficiency optimization: if a Claude Code session had been idle for more than an hour, the system would clear old reasoning blocks once on resume, to avoid sending expensive cache-miss tokens back to the API. Sensible enough.

The bug was subtle and nasty. Instead of clearing thinking history once, the broken flag caused it to clear on every turn for the rest of the session. Once a session crossed the idle threshold, Claude would progressively lose memory of why it had made the edits and tool calls it had made. It kept executing, but without context. That surfaced as exactly the symptoms users were describing: forgetfulness, repetition, weird tool choices, and — because each request became a cache miss — faster-than-expected usage burn.

Anthropic admits the bug “made it past multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding.” Fixed April 10.

3. The verbosity prompt (April 16). When Opus 4.7 launched, it was notably chatty — smarter on hard problems, but very wordy. Anthropic added a line to the Claude Code system prompt: “Length limits: keep text between tool calls to ≤25 words. Keep final responses to ≤100 words unless the task requires more detail.”

After launch, broader ablation testing showed this single instruction caused a roughly 3% drop on coding evaluations for both Opus 4.6 and 4.7. Reverted April 20.

So: one bad default, one bug that shredded the model’s working memory, and one prompt line that measurably cut coding performance. All three shipped past Anthropic’s internal review processes. All three landed in production. All three hit paying users.

Clients Were Right. They Had Receipts.

The most uncomfortable part of the post-mortem isn’t the bugs. Bugs happen. The uncomfortable part is the timeline of denials that preceded the admission.

On April 2, Stella Laurenzo — Senior Director in AMD’s AI group — filed a meticulously documented GitHub issue titled “Claude Code is unusable for complex engineering tasks with the Feb updates.” This was not a vibe check. It was a forensic analysis of 6,852 Claude Code sessions, 234,760 tool calls, and 17,871 thinking blocks pulled from her team’s production logs, as The Register reported.

The data was brutal:

Files read before editing dropped from 6.6 to 2.0 — a model that no longer reads the code before changing it.
Stop-hook violations tracking laziness (ownership dodging, premature termination, permission-seeking) went from zero before March 8 to ~10 per day by end of month.
Median visible thinking length collapsed from ~2,200 to ~600 characters, as detailed in this breakdown by Scortier on Substack.

Laurenzo’s conclusion was unambiguous: her team had switched to a competing provider, and Anthropic was at risk of losing its coding crown.

Anthropic’s Claude Code lead, Boris Cherny, replied on the thread — thanking her for the analysis, but disputing the main conclusion. He said the redact-thinking-2026-02-12 header was a UI-only change that did “not impact thinking itself,” pointed to the February shift to adaptive thinking and the March 3 move to medium effort as the real cause of the perceived change, and reminded users they could type /effort high to restore deeper reasoning.

Elsewhere on X, Cherny was more direct, responding to accusations that Claude Code had been secretly nerfed with a blunt “This is false.” Colleague Thariq Shihipar publicly denied that Anthropic degrades models to manage demand.

Then, on April 23, Anthropic conceded that users had been seeing a real regression. Not the one they originally pointed at — the February redacted-thinking header really was UI-only, as Cherny had said — but a different one, or rather three different ones, that Anthropic hadn’t yet found when the public denials were going out.

Cherny acknowledged the severity himself. In his thread on X, he wrote:

“We take these reports incredibly seriously. In my time on the team, this has probably been the most complex investigation we’ve had. The root causes were not obvious, and there were many confounders.”

He also flagged that the team is “separately… hearing reports of issues with Opus 4.7 in Claude Code,” and that improvements would roll out “over the coming days.” And the company is resetting usage limits for all subscribers as a goodwill gesture.

The investigation was genuinely hard. Three orthogonal changes, overlapping schedules, internal dogfooding that didn’t reproduce the bugs, an unrelated experiment that actively masked the caching bug in CLI sessions, and user feedback that at first looked like normal variance. Nobody reading the post-mortem charitably should conclude Anthropic was lying. But charity and caution cut both ways. The same post-mortem also reveals how much confidence Anthropic placed in “we don’t see it in our evals” as a rebuttal — while thousands of paying developers were generating telemetry showing exactly the opposite.

The customers weren’t making it up. They weren’t being hysterical. They had receipts, and those receipts were, as it turned out, correct.

Shipped Slop

There’s a phrase that keeps surfacing in AI developer communities: shipped slop. It captures something specific — not a catastrophic failure, but the steady accumulation of small, plausibly-reasonable product decisions that, together, make a good product meaningfully worse.

The April 23 post-mortem is a case study in shipped slop.

Look at each change on its own. Dropping the default to medium to protect users from UI freezes? Defensible. Clearing old thinking from stale sessions to cut latency on resume? Sensible. Telling a verbose model to be less verbose? A reasonable instinct, especially when Opus 4.7 was publicly acknowledged to be chatty.

But every one of those changes traded intelligence for something else — latency, cost, tokens, UX polish — and the internal review processes treated each trade as a local decision rather than a compound one. The caching bug passed every existing test. The system prompt passed every existing eval. The effort downgrade was justified by internal data that showed only “slightly” worse intelligence.

And then customers experienced the actual compound product. Which was broken.

The post-mortem is implicitly candid about this. Anthropic writes that going forward:

“For any change that could trade off against intelligence, we’ll add soak periods, a broader eval suite, and gradual rollouts so we catch issues earlier.”

Translated: the existing review process wasn’t calibrated to notice intelligence regressions. It optimized for what it could measure. The things customers noticed — memory, reasoning persistence, care with code — weren’t in the eval set.

That’s shipped slop. Not malice. Not incompetence. Just optimization pressure in the wrong direction, compounded across independent teams, delivered to customers who pay for the version that was advertised.

The Loss of Trust Problem

The hardest damage from this episode is not technical. It’s relational. And it was arguably forecast weeks before Anthropic published the post-mortem.

VentureBeat’s April 13 report got to the core of it:

“A model can feel worse to users even if the company believes it has not ‘nerfed’ the underlying weights.”

That line was written before the post-mortem existed. It correctly anticipated the shape of the resolution: Anthropic would acknowledge product changes but deny a model downgrade, users would continue to experience a worse product, and the gap between “what we changed” and “what you experienced” would keep widening.

Now layer on the context:

Anthropic adjusted session limits during peak hours in late March, affecting ~7% of users.
An OpenAI internal memo, reported by CNBC, alleged Anthropic was compute-constrained — “operating on a meaningfully smaller curve” than competitors.
The Register fed Claude its own GitHub repo and the model agreed quality complaints had tripled month-over-month.
A brief outage on April 13 added fuel.
Fortune reported on the growing user backlash and transparency accusations.

Against that backdrop, Anthropic’s repeated public message — “we don’t degrade models for demand, this is a UI change, here’s a CLI flag” — was technically accurate about the things it addressed, but it wasn’t addressing the actual regressions. Three of them were sitting in production the entire time those denials were going out.

That is the trust-gap anatomy. When a vendor’s confident public posture outruns its own internal visibility, users eventually discover the gap — and then every future reassurance gets read through that lens.

Some of the sharpest responses aren’t even technical. In the comments on Scortier’s Substack breakdown, one reader asked simply: “At what point can we sue?” That’s not a serious legal question. It’s a temperature reading. When the paying customer is left running their own telemetry, debugging the vendor’s product at the binary level with Ghidra and a MITM proxy, and still being told their measurements don’t match the company’s evals, the relationship has already broken somewhere structural.

What the Post-Mortem Gets Right

It would be unfair to read the post-mortem only as a confession. It’s also, to Anthropic’s credit, one of the more honest engineering write-ups the big AI labs have produced. Worth noting:

The bugs are explained concretely, with dates, versions (e.g. v2.1.101, v2.1.116), affected models, and the exact system prompt text at fault.
Anthropic explicitly owns “this isn’t the experience users should expect from Claude Code” and resets usage limits as remediation.
The “going forward” section commits to specific mitigations: more staff using the public build, tighter prompt-change controls, model-specific gating in CLAUDE.md, broader per-model eval suites, soak periods, gradual rollouts, and shipping their improved Code Review tool to customers.
Interestingly, the post-mortem notes that Opus 4.7 caught the caching bug in back-testing while Opus 4.6 didn’t — a subtle but meaningful admission that the newer model would have been useful in finding the problem earlier.
The team created @ClaudeDevs on X as a dedicated channel for explaining product decisions, committing to deeper context than a typical status update.

Compared with the alternative — a quiet patch and no explanation — this is a grown-up response. The question is whether grown-up responses become the norm, or whether the next compound regression will also take eight weeks and a viral GitHub thread from an AMD director to surface.

Practical Takeaways for Developers

If you’re building on Claude Code (or any agentic coding tool), this episode has some durable lessons, regardless of how much goodwill you extend to Anthropic going forward:

Keep your own telemetry. Laurenzo’s analysis worked because she had months of structured session logs to compare against. Vendors cannot fix what they cannot see, and they will not always see what you see. Track tool-call counts, read-before-edit ratios, retry rates, and session duration over time.
Assume defaults will drift. Effort levels, verbosity settings, cache TTLs, thinking visibility — these are product knobs that vendors will tune, often without prominent disclosure. Build your workflow around explicit settings, not inherited defaults.
Have a fallback. Laurenzo’s team migrated to a competing provider mid-regression. That kind of optionality isn’t cheap to maintain, but depending entirely on one vendor’s silent decisions is a structural risk.
Weigh the post-mortem, not just the outage. Good engineering teams ship bugs. The signal worth tracking is what they do afterward — and how long it takes them to believe their own users.

The Bottom Line

Anthropic’s April 23 post-mortem confirms the thing that mattered most: Claude Code got dumber, Claude (the underlying model) did not. Shipped slop — three small, well-intentioned changes that compounded into a meaningfully worse product — reached paying users, persisted for weeks, and was initially denied publicly before being acknowledged internally.

Boris Cherny’s own framing — “probably the most complex investigation we’ve had” — is honest. So is the commitment to gradual rollouts, per-model evals, and shipping an improved Code Review tool. Resetting usage limits is a reasonable goodwill move.

But the customers were right. They had the data. They published the data. And for weeks, the official response was that the data didn’t mean what they thought it meant.

The model weights may be intact. The trust account is overdrawn. Rebuilding it will take more than one good post-mortem — it will take a demonstrated pattern of catching the next regression before the AMD director has to.

Clients Were Right: Anthropic Admits Claude Code Got Dumber — Not Claude (Post-Mortem)

Curtis Pyke

Related Posts

Best AI Fundraising Tools for Startups: The Complete Founder’s Guide

Claude Fable 5 Benchmarks Explained: Coding, Context Window, Pricing, and Mythos-Class Performance

AI Loops Explained: How Advanced Users Turn Codex, Claude Code, and LLMs Into Real Workflows

Comments 1

Leave a Reply Cancel reply

Recent News

Best AI Fundraising Tools for Startups: The Complete Founder’s Guide

Claude Fable 5 Benchmarks Explained: Coding, Context Window, Pricing, and Mythos-Class Performance

AI Loops Explained: How Advanced Users Turn Codex, Claude Code, and LLMs Into Real Workflows

Claude Fable 5 and Claude Mythos 5: Anthropic’s Mythos-Class Era Has Arrived

The Best in A.I.

Recent Posts

Recent News

Best AI Fundraising Tools for Startups: The Complete Founder’s Guide

Claude Fable 5 Benchmarks Explained: Coding, Context Window, Pricing, and Mythos-Class Performance