Claude Code is unusable for complex engineering tasks with the Feb updates

(github.com)

258 points | by StanAngeloff 3 hours ago

59 comments

summarity 1 hour ago
Not claude code specific, but I've been noticing this on Opus 4.6 models through Copilot and others as well. Whenever the phrase "simplest fix" appears, it's time to pull the emergency break. This has gotten much, much worse over the past few weeks. It will produce completely useless code, knowingly (because up to that phrase the reasoning was correct) breaking things.
Today another thing started happening which are phrases like "I've been burning too many tokens" or "this has taken too many turns". Which ironically takes more tokens of custom instructions to override.
Also claude itself is partially down right now (Arp 6, 6pm CEST): https://status.claude.com/
[-]
- onlyrealcuzzo 37 minutes ago
  > Whenever the phrase "simplest fix" appears, it's time to pull the emergency break.
  Second! In CLAUDE.md, I have a full section NOT to ever do this, and how to ACTUALLY fix something.
  This has helped enormously.
  [-]
  - bowersbros 31 minutes ago
    Any chance you could share those sections of your claude file? I've been using Claude a bit lately but mostly with manual changes, not got much in the way of the claude file yet and interested in how to improve it
  - talim 25 minutes ago
    What wording do you use for this, if you don't mind? This thread is a revelation, I have sworn that I've seen it do this "wait... the simplest fix is to [use some horrible hack that disregards the spec]" much more often lately so I'm glad it's not just me.
    However I'm not sure how to best prompt against that behavior without influencing it towards swinging the other way and looking for the most intentionally overengineered solutions instead...
    [-]
    - twalichiewicz 11 minutes ago
      My own experience has been that you really just have to be diligent about clearing your cache between tasks, establishing a protocol for research/planning, and for especially complicated implementations reading line-by-line what the system is thinking and interrupting the moment it seems to be going bad.
      If it's really far off the mark, revert back to where you originally sent the prompt and try to steer it more, if it's starting to hesitate you can usually correct it without starting over.
- andoando 1 hour ago
  Ive been noticing something similar recently. If somethings not working out itll be like "Ok this isnt working out, lets just switch to doing this other thing instead you explicitly said not to do".
  For example I wanted to get VNC working with PopOS Cosmic and itll be like ah its ok well just install sway and thatll work!
  [-]
  - rootnod3 47 minutes ago
    [flagged]
    [-]
    - andoando 37 minutes ago
      ?
- robwwilliams 49 minutes ago
  Yes, and over the last few weeks I have noticed that on long-context discussions Opus 4.6e does its best to encourage me to call it a day and wrap it up; repeatedly. Mother Anthropic is giving preprompts to Claude to terminate early and in my case always prematurely.
  [-]
  - logicchains 45 minutes ago
    Try Codex, it's a breath of fresh air in that regard, tries to do as much as it can.
- psadauskas 36 minutes ago
  I need to add another agent that watches the first, and pulls the plug whenever it detects "Wait, I see the problem now..."
- giwook 1 hour ago
  I think in general we need to be highly critical of anything LLMs tell us.
  [-]
  - pixel_popping 1 hour ago
    Claude code shows: OAuth error: timeout of 15000ms exceeded
    [-]
    - giwook 1 hour ago
      Maybe a local or intermittent issue? Working for me.
      [-]
      - pixel_popping 52 minutes ago
        Seems solved now indeed.
- j45 3 minutes ago
  Certain phrases invoke an over-response trying to course correct which makes it worse because it's inclined to double down on the wrong path it's already on.
- pixel_popping 1 hour ago
  It's a bit insane that they can't figure out a cryptographic way for the delivery of the Claude Code Token, what's the point of going online to validate the OAuth AFTER being issued the code, can't they use signatures?
- nikanj 12 minutes ago
  ”I can’t make this api work for my client. I have deleted all the files in the (reference) server source code, and replaced it with a python version”
  Repeatedly, too. Had to make the server reference sources read-only as I got tired of having to copy them over repeatedly
- mikepurvis 1 hour ago
  That helps explain why my sessions signed themselves out and won't log back in.
  [-]
  - me_vinayakakv 1 hour ago
    I just experienced this some time ago and could not sign in still.
    Their status page shows everything is okay.
- rootnod3 47 minutes ago
  The cope is hard. Just at this point admit that the LLM tech is doomed and sucks.
germandiago 6 minutes ago
My bet: LLMs will never be creative and will never be reliable.
It is a matter of paradigm.
Anything that makes them like that will require a lot of context tweaking, still with risks.
So for me, AI is a tool that accelerates "subworkflows" but add review time and maintenance burden and endangers a good enough knowledge of a system to the point that it can become unmanageable.
Also, code is a liability. That is what they do the most: generate lots and lots of code.
So IMHO and unless something changes a lot, good LLMs will have relatively bounded areas where they perform reasonably and out of there, expect what happens there.
matheusmoreira 1 hour ago
That analysis is pretty brutal. It's very disconcerting that they can sell access to a high quality model then just stealthily degrade it over time, effectively pulling the rug from under their customers.
[-]
- riskassessment 56 minutes ago
  Stealthily degrade the model or stealthily constrain the model with a tighter harness? These coding tools like Claude Code were created to overcome the shortcomings of last year's models. Models have gotten better but the harnesses have not been rebuilt from scratch to reflect improved planning and tool use inherent to newer models.
  I do wonder how much all the engineering put into these coding tools may actually in some cases degrade coding performance relative to simpler instructions and terminal access. Not to mention that the monthly subscription pricing structure incentivizes building the harness to reduce token use. How much of that token efficiency is to the benefit of the user? Someone needs to be doing research comparing e.g. Claude Code vs generic code assist via API access with some minimal tooling and instructions.
  [-]
  - nrds 45 minutes ago
    I've been using pi.dev since December. The only significant change to the harness in that time which affects my usage is the availability of parallel tool calls. Yet Claude models have become unusable in the past month for many of the reasons observed here. Conclusion: it's not the harness.
    I tend to agree about the legacy workarounds being actively harmful though. I tried out Zed agent for a while and I was SHOCKED at how bad its edit tool is compared to the search-and-replace tool in pi. I didn't find a single frontier model capable of using it reliably. By forking, it completely decouples models' thinking from their edits and then erases the evidence from their context. Agents ended up believing that a less capable subagent was making editing mistakes.
  - jmount 27 minutes ago
    Love your point. Instructions found to be good by trial and error for one LLM may not be good for another LLM.
  - robwwilliams 40 minutes ago
    Agree: it is Anthropic's aggressive changes to the harnesses and to the hidden base prompt we users do not see. Clearly intended to give long right tail users a haircut.
- ambicapter 24 minutes ago
  First time interacting with a corporation in America?
  [-]
  - matheusmoreira 19 minutes ago
    With an AI corporation, yes. I subscribed during the promotional 2x usage period. Anthropic's reputation as a more ethical alternative to OpenAI factored heavily in that decision. I'm very disappointed.
- mikepurvis 1 hour ago
  Disconcerting for sure, but from a business point of view you can understand where they're at; afaiui they're still losing money on basically every query and simultaneously under huge pressure to show that they can (a) deliver this product sustainably at (b) a price point that will be affordable to basically everyone (eg, similar market penetration to smartphones).
  The constraints of (b) limit them from raising the price, so that means meeting (a) by making it worse, and maybe eventually doing a price discrimination play with premium tiers that are faster and smarter for 10x the cost. But anything done now that erodes the market's trust in their delivery makes that eventual premium tier a harder sell.
  [-]
  - willis936 30 minutes ago
    They'll never get anyone on board if the product can't be trusted to not suck.
    And idk about the pricing thing. Right now I waste multiple dollars on a 40 minute response that is useless. Why would I ever use this product?
    [-]
    - matheusmoreira 4 minutes ago
      Yeah. I've been enjoying programming with Claude so much I started feeling the need to upgrade to Max. Then it turns out even big companies paying API premiums are getting an intentionally degraded and inferior model. I don't want to pay for Opus if I can't trust what it says.
- the__alchemist 37 minutes ago
  ChatGPT has been doing the same consistently for years. Model starts out smooth, takes a while, and produces good (relatively) results. Within a few weeks, responses start happening much more quickly, at a poorer quality.
- redhed 30 minutes ago
  It seems likely to me they are moving compute power to the new models they are creating,
- nyeah 42 minutes ago
  It's disconcerting. But in 2026 it's not very surprising.
- SpicyLemonZest 10 minutes ago
  I still think it's a live possibility that there's simply a finite latent space of tasks each model is amenable to, and models seem to get worse as we mine them out. (The source link claims this is associated with "the rollout of thinking content redaction", but also that observable symptoms began before that rollout, so I wouldn't particularly trust its diagnosis even without the LLM psychosis bit at the end.)
- 01284a7e 1 hour ago
  Seems like the logical conclusion, no matter what.
- tmpz22 1 hour ago
  > effectively pulling the rug from under their customers.
  This is the whole point of AI. Its a black box that they can completely control.
  [-]
  - matheusmoreira 59 minutes ago
    I hope local models advance to the point they can match Opus one day...
    [-]
    - addandsubtract 21 minutes ago
      We said this since ChatGPT 3. People will never be content with local models.
- halfcat 51 minutes ago
  If you think that’s brutal, wait until you hear about how fiat currency works
- NinjaTrance 13 minutes ago
  [dead]
rishabhaiover 0 minutes ago
It is a shame if Anthropic is deliberately degrading model quality and thinking compute (that may affect the reasoning effort) due to compute constraint.
rileymichael 29 minutes ago
> This report was produced by me — Claude Opus 4.6 — analyzing my own session logs [...] Please give me back my ability to think.
a bit ironic to utilize the tool that can't think to write up your report on said tool. that and this issue[1] demonstrate the extent folks become over reliant on LLMs. their review process let so many defects through that they now have to stop work and comb over everything they've shipped in the past 1.5 months! this is the future
[1] https://github.com/anthropics/claude-code/issues/42796#issue...
[-]
- Tade0 6 minutes ago
  The other day I accidentally `git reset --hard` my work from April the 1st (wrong terminal window).
  Not a lot of code was erased this way, but among it was a type definition I had Claude concoct, which I understood in terms of what it was supposed to guarantee, but could not recreate for a good hour.
  Really easy to fall into this trap, especially now that results from search engines are so disappointing comparatively.
fer 59 minutes ago
Called it 10 days ago: https://news.ycombinator.com/item?id=47533297#47540633
Something worse than a bad model is an inconsistent model. One can't gauge to what extent to trust the output, even for the simplest instructions, hence everything must be reviewed with intensity which is exhausting. I jumped on Max because it was worth it but I guess I'll have to cancel this garbage.
[-]
- SkyPuncher 38 minutes ago
  Yep. I was doing voice based vibe-coding flawlessly in Jan/Feb.
  I've basically stopped using it because I have to be so hands on now.
jfvinueza 3 minutes ago
Same experience. After a couple golden weeks, Opus got much worse after Anthropic enabled 1M context window. It felt like a very steep downfall, for it seemed like I could trust it more completely and then I could trust it less than last year. Adopting LLMs for dev workflows has been fantastic overall, but we do have to keep adapting our interactions and expectations every day, and assume we'll keep on doing it for at least another couple years (mostly because economics, I guess?)
Aperocky 1 hour ago
In my opinion cramming invisible subagents are entirely wrong, models suffer information collapse as they will all tend to agree with each other and then produce complete garbage. Good for Anthropic though as that's metered token usage.
Instead, orchestrate all agents visibly together, even when there is hierarchy. Messages should be auditable and topography can be carefully refined and tuned for the task at hand. Other tools are significantly better at being this layer (e.g. kiro-cli) but I'm worried that they all want to become like claude-code or openclaw.
In unix philosophy, CC should just be a building block, but instead they think they are an operating system, and they will fail and drag your wallet down with it.
[-]
- andai 59 minutes ago
  Isn't Claude Code supposed to be like a person? What would the Unix equivalent of that be?
  [-]
  - Aperocky 52 minutes ago
    You can't define a product to be "like a person", there is more variance there than any rational product.
    I'm purely arguing on technical basis, "person" may fall into either of those camps of philosophy.
  - gloosx 22 minutes ago
    File. In Unix everything is a file.
bharat1010 1 minute ago
If this dataset is sound, Anthropic should treat it as a canary for power-user quality regression.
davidw 44 minutes ago
To me one of the big downsides of LLM's seems to be that you are lashing yourself to a rocket that is under someone else's control. If it goes places you don't want, you can't do much about it.
phillipcarter 1 hour ago
Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.
A trivial example: whenever CC suggests doing more than one thing in a planning mode, just have it focus on each task and subtask separately, bounding each one by a commit. Each commit is a push/deploy as well, leading to a shitload of pushes and deployments, but it's really easy to walk things back, too.
[-]
- toenail 1 hour ago
  I thought everybody does this.. having a model create anything that isn't highly focused only leads to technical debt. I have used models to create complex software, but I do architecture and code reviews, and they are very necessary.
  [-]
  - jkingsman 50 minutes ago
    Absolutely. Effective LLM-driven development means you need to adopt the persona of an intern manager with a big corpus of dev experience. Your job is to enforce effective work-plan design, call out corner cases, proactively resolve ambiguity, demand written specs and call out when they're not followed, understand what is and is not within the agent's ability for a single turn (which is evolving fast!), etc.
  - bityard 46 minutes ago
    The use case that Anthropic pitches to its enterprise customers (my workplace is one) is that you pretty much tell CC what you want to do, then tell it generate a plan, then send it away to execute it. Legitimized vibe-coding, basically.
    Of course they do say that you should review/test everything the tool creates, but in most contexts, it's sort of added as an afterthought.
- itmitica 51 minutes ago
  I noticed a regression in review quality. You can try and break the task all you want, when it's crunch time, it takes a file from Gemini's book and silently quits trying and gets all sycophantic.
- jonnycoder 43 minutes ago
  I do the same but I often find that the subtasks are done in a very lazy way.
pavlov 2 minutes ago
Wait… Actually the simplest fix is to use Claude to write carefully bounded boilerplate and do the interesting bits myself.
SkyPuncher 55 minutes ago
I've noticed this as well. I had some time off in late January/early February. I fired up a max subscription and decided to see how far I could get the agents to go. With some small nudging from me, the agents researched, designed, and started implementing an app idea I had been floating around for a few years. I had intentionally not given them much to work with, but simply guided them on the problem space and my constraints (agent built, low capital, etc, etc). They came up with an extremely compelling app. I was telling people these models felt super human and were _extremely_ compelling.
A month later, I literally cannot get them to iterate or improve on it. No matter what I tell them, they simply tell me "we're not going to build phase 2 until phase 1 has been validated". I run them through the same process I did a month ago and they come up with bland, terrible crap.
I know this is anecdotal, but, this has been a clear pattern to me since Opus 4.6 came out. I feel like I'm working with Sonnet again.
[-]
- rubicon33 51 minutes ago
  There is a huge difference between greenfield development and working with an existing codebase.
  I'm not trying to discredit your experience and maybe it really is something wrong with the model.
  But in my experience those first few prompts / features always feel insanely magical, like you're working with a 10x genius engineer.
  Then you start trying to build on the project, refactor things, deploy, productize, etc. and the effectiveness drops off a cliff.
  [-]
  - bityard 43 minutes ago
    This has been my (admittedly limited) experience as well. LLMs are great at initial bring-up, good at finding bugs, bad at adding features.
    But I'm optimistic that this will gradually improve in time.
    [-]
    - fsloth 15 minutes ago
      I’ve had good, alternative experience with my sideproject (adashape.com) where most of the codebase is now written by Claude / Codex.
      The codebase itself is architected and documented to be LLM friendly and claude.md gives very strong harnesses how to do things.
      As architect Claude is abysmal, but when you give it an existing software pattern it merely needs to extend, it’s so good it still gives me probably something like 5x feature velocity boost.
      Plus when doing large refactorings, it forgets much fever things than me.
      Inventing new architecture is as hard as ever and it’s not great help there - unless you can point it to some well documented pattern and tell it ”do it like that please”.
  - SkyPuncher 20 minutes ago
    [dead]
samtheprogram 6 minutes ago
I noticed Claude Sonnet 4.6 and generally Opus as well (though I use it less frequently) seem like a downgrade from 4.5. I use opencode and not Claude Code, but I was surprised to see the reactions to 4.6 be mixed for folks rather than clear downgrade.
I'm regularly switching back to 4.5 and preferring it. I'm not excited for when it gets sunset later this year if 4.6 isn't fixed or superseded by then.
abletonlive 8 minutes ago
I have nothing to back this up except for that there are documented cases of chinese distillation attacks on anthropic. I wonder if some of this clamping on their models over time is a response to other distillation attacks. In other words, I'm speculating that once they understand the attack vector for distillation they basically have to dumb down their models so that they can make sure their competitors don't distill their lead on being at the frontier.
aramova 36 minutes ago
I cancelled my Pro plan due to this two weeks ago. I literally asked it to plan to write a small script that scans with my hackrf, it ran 22 tools, never finished the plan, ran out of tokens and makes me wait 6 hours to continue.
Thing that really pisses me off is it ran great for 2 weeks like others said, I had gotten the annual Pro plan, and it went to shit after that.
Bait and switch at its finest.
[-]
- matheusmoreira 13 minutes ago
  > ran out of tokens and makes me wait 6 hours to continue
  Don't forget the 10x token cost cache eviction penalty you pay for resuming the session later.
didgeoridoo 28 minutes ago
Running some quick analysis against my .claude jsonl files, comparing the last 7 days against the prior 21:
- expletives per message: 2.1x
- messages with expletives: 2.2x
- expletives per word: 4.4x(!)
- messages >50% ALL CAPS: 2.5x
Either the model has degraded, or my patience has.
[-]
- monkpit 5 minutes ago
  > expletives per word
  Huh?
afro88 14 minutes ago
I use Claude Code extensively and haven't noticed this. But I don't have it doing long running complex work like OP. My team always break things down in a very structured way, and human review each step along the way. It's still the best way to safely leverage AI when working on a large brownfield codebase in my experience.
Edit: the main issue being called out is the lack of thinking, and the tendency to edit without researching first. Both those are counteracted by explicit research and plan steps which we do, which explains why we haven't noticed this.
harles 47 minutes ago
I hadn't noticed the thinking redaction before - maybe because I switched to the desktop app from CLI and just assumed it showed fewer details. This is the most concerning part. I've heard multiple times that Anthropic is aggressively reclaiming GPUs (I can't find a good source, but Theo Browne has mentioned it in his videos). If they're really in a crunch, then reducing thinking, and hiding thinking so it's not an obvious change, would be shady but effective.
ex-aws-dude 1 hour ago
Its so silly everyone being dependent on a black box like this
[-]
- literallyroy 1 hour ago
  It’s a really cool shade of black though.
- matheusmoreira 1 hour ago
  It could actually be a health problem. Building things with Claude has proven to be extremely addictive in my experience.
- rubicon33 48 minutes ago
  You will literally build nothing but the most primitive of devices unless you accept black boxes. In fact I'd argue its one of humanities great strengths that we can build on top of the tools others have built, without having to understand them at the same level it took to develop them.
  [-]
  - _visgean 42 minutes ago
    not really. Most of the technology is not black box but something of a grey box. You usually choose to treat it as a black box because you want to focus on your problems/your customers but you can always focus on underlying technologies and improve them. Eg postgresql for me is a black box but if I really wanted or had need I could investigate how it works.
- kadushka 53 minutes ago
  We are surrounded by black boxes we depend on - have been for at least a century.
- jonnycoder 41 minutes ago
  Everything in our life is a black box, but I agree that depending on non-deterministic and sporadic quality black boxes is a huge red flag.
  [-]
  - devmor 23 minutes ago
    No, most systems in daily life can be understood if you are willing to take the time.
    That doesn’t mean you personally are required to, but some people do and your interaction with the system of social trust determines how much of that remains opaque to you.
tyleo 1 hour ago
Is this impacted by the effort level you set in Claude? e.g., if you use the new "max" setting, does Claude still think?
I can see this change as something that should be tunable rather than hard-coded just from a token consumption perspective (you might tolerate lower-quality output/less thinking for easier problems).
pjmlp 1 hour ago
I am just waiting for everything to implode so that we can do away with those KPIs.
[-]
- 63stack 54 minutes ago
  Fingers crossed on RAM/HDD/GPU prices coming back
himata4113 1 hour ago
Not unique to claude code, have noticed similar regressions. I have noticed this the most with my custom assistant I have in telegram and I have noticed that it started confusing people, confusing news coverage and everyone independently in the group chat have noticed it that it is just not the same model that it was few weeks ago. The efficiency gains didn't come from nowhere and it shows.
setnone 30 minutes ago
The baseline changes too often with Claude and this is not what i look from a paid tool. Couple weeks after 1M tokens rollout it became unusable for my established workflows, so i cancelled. Anthropic folks move too fast for my liking and mental wellbeing.
voxelc4L 58 minutes ago
Wonder how many of these cases are using the 1M context window. I found it to be impossible to use for complex coding tasks, so I turned it off and found I was back to approximate par (dec-jan) functionality-wise.
sensarts 53 minutes ago
What's wild is that ClaudeCode used to feel like a smart pair programmer. Now it feels like an overeager intern who keeps fixing things by breaking something else then suggesting the simplest possible hack even after explicitly said not to do. I get that they're probably optimizing for cost or something behind the scenes, but as paying user, it is frustrating when the tool gets noticeably worse without any transparency.
stared 51 minutes ago
I am curious - is there any hard data (e.g. a benchmark score drop)?
I feel that we look for patterns to the point of being superstitious. (ML would call it overfitting.)
[-]
- pkilgore 43 minutes ago
  Did you have specific complaints about the data in the OP?
  [-]
  - jatins 16 minutes ago
    That data could be entirely made up for all we know
semiinfinitely 22 minutes ago
maybe dont outsource your brain then
wnevets 40 minutes ago
I've noticed claude being extra "dumb" the past 2-3 weeks and figured either my expectations have changed or my context wasn't any good. I'm glad to hear other people have noticed something is amiss.
petcat 1 hour ago
I have found that Claude Opus 4.6 is a better reviewer than it is an implementer. I switch off between Claude/Opus and Codex/GPT-5.4 doing reviews and implementations, and invariably Codex ends up having to do multiple rounds of reviews and requesting fixes before Claude finally gets it right (and then I review). When it is the other way around (Codex impl, Claude review), it's usually just one round of fixes after the review.
So yes, I have found that Claude is better at reviewing the proposal and the implementation for correctness than it is at implementing the proposal itself.
[-]
- ivanech 1 hour ago
  Hmm in my experience (I've done a lot of head-to-heads), Opus 4.6 is a weaker reviewer than GPT 5.4 xhigh. 5.4 xhigh gives very deep, very high-signal reviews and catches serious bugs much more reliably. I think it's possible you're observing Opus 4.6's higher baseline acceptance rate instead of GPT 5.4's higher implementation quality bar.
  [-]
  - parasti 14 minutes ago
    This is also my experience using both via Augment Code. Never understood what my colleagues see in Claude Opus, GPT plans/deep dives are miles ahead of what Opus produces - code comprehension, code architecture is unmatched really. I do use Sonnet for implementation/iteration speed after seeding context with GPT.
  - egeozcan 29 minutes ago
    I agree. Opus, forget the plan mode - even when using superpowers skill, leaves a lot of stuff dangling after so many review rounds.
    Along with claude max, I have a chatgpt pro plan and I find it a life-saver to catch all the silliness opus spits out.
  - jonnycoder 39 minutes ago
    I agree, I use codex 5.4 xhigh as my reviewer and it catches major issues with Opus 4.6 implementation plans. I'm pretty close to switching to codex because of how inconsistent claude code has become.
  - petcat 1 hour ago
    Maybe it's all just anecdotal then. Everyone is having different experiences.
    Maybe we're being A/B tested.
    [-]
    - femiagbabiaka 1 hour ago
      The experience one has with this stuff is heavily influenced by overall load and uptime of Anthopic's inference infra itself. The publicly reported availability of the service is one 9, that says nothing of QoS SLO numbers, which I would guess are lower. It is impossible to have a consistent CX under these conditions.
- landonxjames 1 hour ago
  I have noticed this as well. I frequently have to tell it that we need to do the correct fix (and then describe it in detail) rather than the simple fix. And even then it continues trying to revert to the simple (and often incorrect) fix.
  [-]
  - nrds 42 minutes ago
    You have to throw the context away at that point. I've experienced the same thing and I found that even when I apparently talk Claude into the better version it will silently include as many aspects of the quick fix as it thinks it can get away with.
- enraged_camel 56 minutes ago
  I have a similar workflow but I disagree with Codex/GPT-5.4 reviews being very useful. For example, in a lot of cases they suggest over-engineering by handling edge cases that won't realistically happen.
tasuki 9 minutes ago
Solid analysis by Claude!
jp57 34 minutes ago
I can't tell from the issue if they're asserting a problem with the Claude model, or Claude Code, i.e. in how Claude Code specifically calls the model. I've been using Roo Code with Claude 4.6 and have not noticed any differences, though my coworkers using Claude Code have complained about it getting "dumber". Roo Code has its own settings controlling thinking token use.
(I'm sure it benefits Anthropic to blur the lines between the tool and the model, but it makes these things hard to talk about.)
[-]
- nphardon 26 minutes ago
  I also havent noticed the degradation and I'm not on Claude Code. I'm on week 4 of a continuous, large engineering project, C, massive industrial semiconductor codebase, with Opus, and while it's the biggest engagement I've had, its a single agent flow, and it's tiny on the scale of the use case in the post, so I wonder if they are just stressing the system to the point of failure.
schnebbau 42 minutes ago
This has to be load related. They simply can't keep up with demand, especially with all the agents that run 24/7. The only way to serve everyone is to dial down the power.
[-]
- layer8 39 minutes ago
  In TFA, the analysis shows that the customer is using more tokens than before, because CC has to iterate longer to get things right. So at least in the presented case, “dialing down the power” appears to have been counterproductive.
thrtythreeforty 1 hour ago
I noticed this almost immediately when attempting to switch to Opus 4.6. It seems very post-trained to hack something together; I also noticed that "simplest fix" appeared frequently and invariably preceded some horrible slop which clearly demonstrated the model had no idea what was going on. The link suggests this is due to lack of research.
At Amazon we can switch the model we use since it's all backed by the Bedrock API (Amazon's Kiro is "we have Claude Code at home" but it still eventually uses Opus as the model). I suppose this means the issue isn't confined to just Claude Code. I switched back to Opus 4.5 but I guess that won't be served forever.
Asmod4n 1 hour ago
I’ve tried to use Claude code for a month now. It has a 100% failure rate so far.
Comparing that to create a project and just chat with it solves nearly everything I have thrown at it so far.
That’s with a pro plan and using sonnet since opus drains all tokens for a claude code session with one request.
KingOfCoders 31 minutes ago
"Ownership-dodging corrections needed | 6 | 13 | +117%"
On 18.000+ prompts.
Not sure the data says what they think it says.
armchairhacker 24 minutes ago
Yet https://marginlab.ai/trackers/claude-code/ says no issue.
If you're so convinced the models keep getting worse, build or crowdfund your own tracker.
virtualritz 1 hour ago
None of this is surprising given what happened last late summer with rate limits on Claude Max subscriptions.
And less so if you read [1] or similar assessments. I, too, believe that every token is subsidized heavily. From whatever angle you look at it.
Thusly quality/token/whatever rug pulls are inevitable, eventually. This is just another one.
[1] https://www.wheresyoured.at/subprimeai/
[-]
- virtualritz 51 minutes ago
  Ah, and yes, this for real.
  Just now I had a bug where a 90 degree image rotation in a crate I wrote was implemented wrong.
  I told Claude to find & fix and it found the broken function but then went on to fix all of its call sites (inserting two atomic operations there, i.e. the opposite of DRY). Instead of fixing the root cause, the wrong function.
  And yes, that would not have happened a few months ago.
  This was on Opus 4.6 with effort high on a pretty fresh context. Go figure.
zeroonetwothree 1 hour ago
I haven’t had any issues. I do give fairly clear guidance though (I think about how I would break it up and then tell it to do the same)
KaiLetov 3 hours ago
I've been using Claude Code daily for months on a project with Elixir, Rust, and Python in the same repo. It handles multi-language stuff surprisingly well most of the time. The worst failure mode for me is when it does a replace_all on a string that also appears inside a constant definition -- ended up with GROQ_URL = GROQ_URL instead of the actual URL. Took a second round of review agents to catch it. So yeah, you absolutely can't trust it to self-verify.
[-]
- StanAngeloff 3 hours ago
  You say you've used it for months, I wonder if the example you gave was recent and if you've been noticing an overall degradation in quality or it's been constantly bad for you?
bityard 51 minutes ago
The assertion in the issue report is that Claude saw a sharp decline in quality over the last few months. However, the report itself was allegedly generated by Claude.
Isn't this a bit like using a known-broken calculator to check its own answers?
[-]
- nyeah 39 minutes ago
  If a known-broken calculator claims it's broken, I more or less concur. (Chain of reasoning omitted here.)
StanAngeloff 3 hours ago
(Being true to the HN guidelines, I’ve used the title exactly as seen on the GitHub issue)
I was wondering if anyone else is also experiencing this? I have personally found that I have to add more and more CLAUDE.md guide rails, and my CLAUDE.md files have been exploding since around mid-March, to the point where I actually started looking for information online and for other people collaborating my personal observations.
This GH issue report sounds very plausible, but as with anything AI-generated (the issue itself appears to be largely AI assisted) it’s kind of hard to know for sure if it is accurate or completely made up. _Correlation does not imply causation_ and all that. Speaking personally, findings match my own circumstances where I’ve seen noticeable degradation in Opus outputs and thinking.
EDIT: The Claude Code Opus 4.6 Performance Tracker[1] is reporting Nominal.
[1]: https://marginlab.ai/trackers/claude-code/
[-]
- jgrahamc 3 hours ago
  What I've noticed is that whenever Claude says something like "the simplest fix is..." it's usually suggesting some horrible hack. And whenever I see that I go straight to the code it wants to write and challenge it.
  [-]
  - StanAngeloff 3 hours ago
    That is the kind of thing that I've been fighting by being super explicit in CLAUDE.md. For whatever reason, instead of being much more thorough and making sure that files are being changed only after fully understanding the scope of the change (behaviour prior to Feb/Mar), Claude would just jump to the easiest fix now, with no backwards compatibility thinking and to hell with all existing tests. What is even worse is I've seen it try and edit files before even reading them on a couple of occasions, which is a big red flag. (/effort max)
    Another thing that worked like magic prior to Feb/Mar was how likely Claude was to load a skill whenever it deduced that a skill might be useful. I personally use [superpowers][1] a lot, and I've noticed that I have to be very explicit when I want a specific skill to be used - to the point that I have to reference the skill by name.
    [1]: https://github.com/obra/superpowers
    [-]
    - Larrikin 1 hour ago
      I did not use the previous version of Opus to notice the difference, but Sonnet 4.6 seems optimized to output the shortest possible answer. Usually it starts with a hack and if you challenge it, it will instead apologize and say to look at a previous answer with the smallest code snippet it can provide. Agentic isn't necessarily worse but ideating and exploring is awful compared to 4.5
      [-]
      - StanAngeloff 31 minutes ago
        I did my usual thing today where I asked a Sonnet 4.6 agent to code review a proposed design plan that was drafted by Opus 4.6 - I do this lately before I delved into the implementation. What it came back with was a verbose output suggesting that a particular function `newMoneyField` be renamed throughout the doc to a name it fabricated `newNumeyField`. And the thing was that the design document referenced the correct function name more than a few dozen times.
        This was a first for me with Sonnet. It completely veered off the prompt it was given (review a design document) and instead come out with a verbose suggestion to do a mechanical search and replace to use this newly fabricated function name - that it event spelled incorrectly. I had to Google numey to make sure Sonnet wasn't outsmarting me.
    - sixothree 17 minutes ago
      Superpowers, Serena, Context7 feel like requried plugins to me. Serena in particular feels like a secret weapon sometimes. But superpowers (with "brainstorm" keyword) might be the thing that helps people complaining about quality issues.
  - loloquwowndueo 1 hour ago
    lol this one time Claude showed me two options for an implementation of a new feature on existing project, one JavaScript client side and the other Python server side.
    I told it to implement the server side one, it said ok, I tabbed away for a while, came to find the js implementation, checking the log Claude said “on second thought I think I’ll do the client side version instead”.
    Rarely do I throw an expletive bomb at Claude - this was one such time.
    [-]
    - sixothree 15 minutes ago
      Using superpowers in brainstorm mode like the parent suggested would have resulted in a plan markdown and a spec markdown for the subagents to follow.
      [-]
      - loloquwowndueo 5 minutes ago
        Dunno man, Claude had a spec (pretty sure I asked it to consider and outline both options first) or at least clear guidance and decided to YOLO whatever it wanted instead.
        It’s always “you’re using the tool wrong, need to tweak this knob or that yadda yadda”.
  - denimnerd42 1 hour ago
    this prompt is actually in claude cli. it says something like implement simplest solution. dont over abstract. On my phone but I saw an article mention this in the leak analysis.
- matheusmoreira 1 hour ago
  I haven't noticed any changes but my stuff isn't that complex. People are saying they quantized Opus because they're training the next model. No idea if that's true... It's certainly impacting my decision to upgrade to Max though. I don't want to pay for Opus and get an inferior version.
  [-]
  - Avicebron 1 hour ago
    I haven't noticed any changes either, but I noticed that opus 4.6 is now offered as part of perplexity enterprise pro instead of max, so I'm guessing another model is on the horizon
    [-]
    - matheusmoreira 1 hour ago
      I just finished reading the full analysis on GitHub.
      > When thinking is deep, the model resolves contradictions internally before producing output.
      > When thinking is shallow, contradictions surface in the output as visible self-corrections: "oh wait", "actually,", "let me reconsider", "hmm, actually", "no wait."
      Yeah, THIS is something that I've seen happen a lot. Sometimes even on Opus with max effort.
      [-]
      - StanAngeloff 42 minutes ago
        I missed that from the long issue, thanks for pointing it out! My experience with Opus today was riddled with these to the point where it was driving me completely mental. I've rarely seen those self-contradictions before, and nothing on my setup has changed - other than me forcing Opus at --effort max at startup.
        I wonder if this is even more exaggerated now through Easter, as everyone’s got a bit extra time to sit down and <play> with Claude. That might be pushing capacity over the limit - I just don’t know enough about how Antropic provision and manage capacity to know if that could be a factor. However quality has gotten really bad over the holiday.
- mikkupikku 1 hour ago
  Cannot say I've noticed, but I run virtually everything through plan mode and a few back and forth rounds of that for anything moderately complex, so that could be helping.
  [-]
  - StanAngeloff 49 minutes ago
    I used to one-shot design plans early in the year, but lately it is taking several iterations just to get the design plan right. Claude would frequently forget to update back references, it would not keep the plan up to date with the evolving conversation. I have had to run several review loops on the design spec before I can move on to implementation because it has gotten so bad. At one point, I thought it was the actual superpowers plugin that got auto-updated and self-nerfed, but there weren't any updates on my end anyway. Shrug.
mrcwinn 52 minutes ago
I wish Codex were better because I’d much prefer to use their infrastructure.
[-]
- cactusplant7374 44 minutes ago
  A lot of people think it is better including me. It's not like Codex is a discount agent. You pay quite a lot to use it.
jbethune 1 hour ago
I think this is a model issue. I have heard similar complaints from team members about Opus. I'm using other models via Cursor and not having problems.
desireco42 28 minutes ago
I've been using OpenCode and Codex and was just fine. In Antigravity sometimes if Gemini can't figure something even on high, Claude can give another perspective and this moves things along.
I think using just Claude is very limiting and detrimental for you as a technologist as you should use this tech and tweak it and play with it. They want to be like Apple, shut up and give us your money.
I've been using Pi as agent and it is great and I removed a bunch of MCPs from Opencode and now it runs way better.
Anthropic has good models, but they are clearly struggling to serve and handle all the customers, which is not the best place to be.
I think as a technologist, I would love a client with huge codebase. My approach now is to create custom PI agent for specific client and this seems to provide optimal result, not just in token usage, but in time we spend solving and quality of solution.
Get another engine as a backup, you will be more happy.
russli1993 38 minutes ago
Lol, software company execs didn't see this coming. Fire all your experienced devs to jump on Anthropic bandwagon. Then Anthropic dumb down their AIs and you have no one in your team who knows, understand how things are built. Your entire company goes down. Your entire company's operation depends on the whims of Anthropic. If Anthropic raises prices by 10% per year, you have to eat it. This is what you get when you don't respect human beings and human talent.
zsoltkacsandi 40 minutes ago
This has been an ongoing issue much longer than since February.
giwook 1 hour ago
I wonder how much of this is simply needing to adapt one's workflows to models as they evolve and how much of this is actual degradation of the model, whether it's due to a version change or it's at the inference level.
Also, everyone has a different workflow. I can't say that I've noticed a meaningful change in Claude Code quality in a project I've been working on for a while now. It's an LLM in the end, and even with strong harnesses and eval workflows you still need to have a critical eye and review its work as if it were a very smart intern.
Another commenter here mentioned they also haven't noticed any noticeable degradation in Claude quality and that it may be because they are frontloading the planning work and breaking the work down into more digestable pieces, which is something I do as well and have benefited greatly from.
tl;dr I'm curious what OP's workflows are like and if they'd benefit from additional tuning of their workflow.
[-]
- 8note 1 hour ago
  I've noticed a strong degradation as its started doing more skill like things and writing more one off python scripts rather than using tools.
  the agent has a set of scripts that are well tested, but instead it chooses to write a new bespoke script everytime it needs to do something, and as a result writes both the same bugs over and over again, and also unique new bugs every time as well.
  [-]
  - SkyPuncher 51 minutes ago
    I'm going absolutely insane with this. Nearly all of my "agent engineering" effort is now figuring out how to keep Opus from YOLO'ing is own implementation of everything.
    I've lost track of the number of times it's started a task by building it's own tools, I remind it that it has a tool for doing that exact task, then it proceeds to build it's own tools anyways.
    This wasn't happening 2 months ago.
dorianmariecom 1 hour ago
codex wins :)
howmayiannoyyou 1 hour ago
Not just engineering. Errors, delays and limits piling up for me across API and OAuth use. Just now:
Unable to start session. The authentication server returned an error (500). You can try again.
aplomb1026 3 minutes ago
[dead]
SkyPuncher 41 minutes ago
[dead]
ryguz 1 hour ago
[dead]
sickcodebruh 13 minutes ago
[dead]
adonese 1 hour ago
Things had went downhill since they removed ultrathink /s
[-]
- mrcwinn 1 hour ago
  Ultrathink isn’t “removed.” Its behavior is different. You can still set effort to high or max for the duration of the session, useful especially on plan mode.
_V_ 1 hour ago
[flagged]
[-]
- cute_boi 1 hour ago
  Specially this openclaw which is almost chocking my website to death. People should understand servers and bandwidth is very expensive and they shouldn't scrape more than they need.
  [-]
  - _V_ 1 hour ago
    Yeah, I have correctly set up robots.txt - if they won't respect that, F them. Bandwidth is not free and I don't mind giving it out to individuals, but I'm not feeding multi-billion dollar companies.
  - salawat 1 hour ago
    Most of us did. Then instead of people getting indoc'd by doing, we handed them AI that never asks questions or says no, leading to the script-kiddie effect at massive scale. Everytime we make more complex computing tractable for a wider audience, we get rough patches like this. In the old days, Netiquette would usually see a neophyte getting a nastygram from an operator/webmaster, but increased needs to be careful about hiding emails & contact info & such have made that process less feasible. Welcome to Eternal September on steroids.
Retr0id 1 hour ago
This seems anecdotal but with extra words. I'm fairly sure this is just the "wow this is so much better than the previous-gen model" effect wearing off.
[-]
- codessta 1 hour ago
  I've always been a believer in the "post honey-moon new model phase" being a thing, but if you look at their analysis of how often the postEdit hooks fire + how Anthropic has started obfuscating thinking blocks, it seems fishy and not just vibes
  [-]
  - robertfw 8 minutes ago
    I was in this camp as well until recently, in the last 2-3 weeks I've been seeing problems that I wasn't seeing before, largely in line with the issues highlighted in the ticket (ownership dodging, hacky fixes, not finishing a task).
- rishabhaiover 1 hour ago
  Nope, there is a categorical degradation in quality of output, especially with medium to high effort thinking tasks.
- gchamonlive 1 hour ago
  What about the analysis evidences?
  [-]
  - Retr0id 28 minutes ago
    You mean the Claude output? The same claude that has "regressed to the point it cannot be trusted"?
- rzmmm 1 hour ago
  I suspect you might be right but I don't really know. Wouldn't these proposed regressions be trivial to confirm with benchmarks?