> Then a brick hits you in the face when it dawns on you that all of our tools are dumping crazy amounts of non-relevant context into stdout thereby polluting your context windows.
I've found that letting the agent write its own optimized script for dealing with some things can really help with this. Claude is now forbidden from using `gradlew` directly, and can only use a helper script we made. It clears, recompiles, publishes locally, tests, ... all with a few extra flags. And when a test fails, the stack trace is printed.
Before this, Claude had to do A TON of different calls, all messing up the context. And when tests failed, it started to read gradle's generated HTML/XML files, which damaged the context immensely, since they contain a bunch of inline javascript.
And I've also been implementing this "LLM=true"-like behaviour in most of my applications. When an LLM is using it, logging is less verbose, it's also deduplicated so it doesn't show the same line a hundred times, ...
> He sees something goes wrong, but now he cut off the stacktraces by using tail, so he tries again using a bigger tail. Not satisfied with what he sees HE TRIES AGAIN with a bigger tail, and … you see the problem. It’s like a dog chasing its own tail.
I've had the same issue. Claude was running the 5+ minute test suite MULTIPLE TIMES in succession, just with a different `| grep something` tacked at the end.
Now, the scripts I made always logs the entire (simplified) output, and just prints the path to the temporary file. This works so much better.
The way I've solved this issue with a long running build script is to have a logging scripts which redirects all outputs into a file and can be included with
```
# Redirect all output to a log file (re-execs script with redirection)
source "$(dirname "$0")/common/logging.sh"
```
at the start of a script.
Then when the script runs the output is put into a file, and the LLM can search that. Works like a charm.
This has been my exact experience with agents using gradle and it’s beyond frustrating to watch. I’ve been meaning to set up my own low-noise wrapper script.
This post just inspired me to tackle this once and for all today.
> Then a brick hits you in the face when it dawns on you that all of our tools are dumping crazy amounts of non-relevant context into stdout thereby polluting your context windows.
Not just context windows. Lots of that crap is completely useless for humans too. It's not a rare occurrence for warnings to be hidden in so much irrelevant output that they're there for years before someone notices.
We’ve got a long way to go in optimising our environments for these models. Our perception of a terminal is much closer to feeding a video into Gemini than reading a textbook of logs. But we don’t make that ax affordance at the moment.
I wrote a small game for my dev team to experience what it’s like interacting through these painful interfaces over the summer www.youareanagent.app
Jump to the agentic coding level or the mcp level to experience true frustration (call it empathy). I also wrote up a lot more thinking here www.robkopel.me/field-notes/ax-agent-experience/
So frequently beginners in linux command lines complain about the irregularity or redundance in command line tool conventions (sometimes actual command parameters -h --help or /h ? other times: man vs info; etc...)
When the first transformers that did more than poetry or rough translation appeared everybody noticed their flaws, but I observed that a dumb enough (or smart enough to be dangerous?) LLM could be useful in regularizing parameter conventions. I would ask an LLM how to do this or that, and it would "helpfully" generate non-functional command invocations that otherwise appeared very 'conformant' to the point that sometimes my opinion was that -even though the invocation was wrong given the current calling convention for a specific tool- it would actually improve the tool if it accepted that human-machine ABI or calling convention.
Now let us take the example of man vs info, I am not proposing to let AI decide we should all settle on man; nor do I propose to let AI decide we should all use info instead, but with AI we could have the documentation made whole in the missing half, and then it's up to the user if they prefer man or info to fetch the documentation of that tool.
Similarily for calling conventions, we could ask LLM's to assemble parameter styles and analyze command calling conventions / parameters and then find one or more canonical ways to communicate this, perhaps consulting an environment variable to figure out what calling convention the user declares to use.
> Indeed hallucinated cases are "better law." Drawing on Ronald Dworkin's theory of law as integrity, which posits that ideal legal decisions must "fit" existing precedents while advancing principled justice, this article argues that these hallucinations represent emergent normative ideals. AI models, trained on vast corpora of real case law, synthesize patterns to produce rulings that optimally align with underlying legal principles, filling gaps in the doctrinal landscape. Rather than errors, they embody the "cases that should exist," reflecting a Hercules-like judge's holistic interpretation.
Ngl.. I can see the merit and simultaneously recoil in horror as I am starting to understand what linux greybeards hate about windofication of linux ( and now proposed llm-ification of it :D).
Something related to this article, but not related to AI:
As someone who loves coding pet projects but is not a software engineer by profession, I find the paradigm of maintaining all these config files and environment variables exhausting, and there seem to be more and more of them for any non-trivial projects.
Not only do I find it hard to remember which is which or to locate any specific setting, their mechanisms often feel mysterious too: I often have to manually test them to see if they actually work or how exactly. This is not the case for actual code, where I can understand the logic just by reading it, since it has a clearer flow.
And I just can’t make myself blindly copy other people's config/env files without knowing what each switch is doing. This makes building projects, and especially copying or imitating other people's projects, a frustrating experience.
How do you deal with this better, my fellow professionals?
First of all, I read the documentation for the tools I'm trying to configure.
I know this is very 20th century, but it helps a lot to understand how everything fits together and to remember what each tool does in a complex stack.
Documentation is not always perfect or complete, but it makes it much easier to find parameters in config files and know which ones to tweak.
And when the documentation falls short, the old adage applies: "Use the source, Luke."
> As someone who loves coding pet projects but is not a software engineer by profession, I find the paradigm of maintaining all these config files and environment variables exhausting
Then don’t.
> How do you deal with this better, my fellow professionals?
By not doing it.
Look, it’s your project. Why are you frustrating yourself? What you do is you set up your environment, your configuration, what you need/understand/prefer and that’s it. You’ll find out what those are as you go along. If you need, document each line as you add it. Don’t complicate it.
Software folks love over-engineering things. If you look at the web coding craze of a few years ago, people started piling up tooling on top of tooling (frameworks, build pipelines, linting, generators etc.) for something that could also be zero-config, and just a handful of files for simple projects.
I guess this happens when you're too deep in a topic and forget that eventually the overhead of maintaining the tooling outweights the benefits. It's a curse of our profession. We build and automate things, so we naturally want to build and automate tooling for doing the things we do.
I don’t think those web tooling piles are over-engineered per se, they address huge challenges at Google and Facebook, but the profession is way too driven by hype and fashion and the result is a lot of cargo culting of stuff from Big Dogs unquestioningly. Wrong tooling for the job creates that bubble of over complicated app development.
Inventing GraphQL and React and making your own PHP compiler are absolutely insane and obviously wrong decisions — for everyone who isn’t Facebook. With Facebook revenue and Facebooks army of resume obsessed PHP monkeys they strike me as elegant technological solutions to otherwise intractable organizational issues. Insane, but highly profitable and fast moving. Outside of that context using React should be addressing clear pain points, not a dogmatic default.
We’re seeing some active pushback on it now online, but so much damage has been done. Embracing progressive complexity of web apps/sites should leave the majority as barebones with minimal if any JavaScript.
Facebook solutions for Facebook problems. Most of us can be deeply happy our 99 problems don’t include theirs, and live a simpler easier life.
You start with the cleanest most minimal config you can get away with, but over the years you keep adding small additions and tweaks until it becomes a massive behemoth that only you will ever understand the reasoning behind.
Part of doing it well is adding comments as you add options. When I used vim, every line or block in the config had an accompanying comment explaining what it did, except if the config’s name was so obvious that a comment would just repeat it.
That's a good call. It's a big problem for JSON configs given pure JSON's strict no-comments policy. I like tools that let you use .js or better yet .ts files for config.
Don't fall for the "JS ecosystem" trap and use sane tools. If a floobergloob requires you to add a floobergloob.config.js to your project root that's a very good indicator floobergloob is not worth your time.
The only boilerplate files you need in a JS repo root are gitignore, package.json, package-lock.json and optionally tsconfig if you're using TS.
A node.js project shouldn't require a build step, and most websites can get away with a single build.js that calls your bundler (esbuild) and copies some static files dist/.
Also an acceptable solution - create a "runner" subagent on a cheap model, that's tasked with running a command and relaying the important parts to the main agent.
Rather than an LLM=true, this is better handled with standardizing quiet/verbose settings, as this is a question of verbosity, where an LLM is one instance where you usually want it to be quieter, but not always.
Secondly, a helper to capture output and cache it, and frankly a tool or just options to the regular shell/bash tools to cache output and allow filtered retrieval of the cached output, as more so than context and tokens the frustration I have with the patterns shown is that often the agent will re-execute time-consuming tasks to retrieve a different set of lines from the output.
A lot of the time it might even be best to run the tool with verbose output, but it'd be nice if tools had a more uniform way of giving output that was easier to systematically filter to essentials on first run (while caching the rest).
Yes! After seeing a lot of discussions like this, I came up with a rule of thumb:
Any special accommodations you make for LLMs are either a) also good for humans, or b) more trouble than they're worth.
It would be nice for both LLMs and humans to have a tool that hides verbose tool output, but still lets you go back and inspect it if there's a problem. Although in practice as a human I just minimise the terminal and ignore the spam until it finishes. Maybe LLMs just need their own equivalent of that, rather than always being hooked up directly to the stdout firehose.
Given that most of the utility of Typescript is to make VSCode play nice for its human operator, _should_ we be using Typescript for systems that are written by machines?
I think about what I do in these verbose situations; I learn to ignore most of the output and only take forward the important piece. That may be a success message or error. I've removed most of the output from my context window / memory.
I see some good research being done on how to allow LLMs to manage their own context. Most importantly, to remove things from their context but still allow subsequent search/retrieval.
Huh. I've noticed CC running build or test steps piped into greps, to cull useless chatter. It did this all by itself, without my explicit instructions.
Also, I just restart when the context window starts filling up. Small focused changes work better anyway IMO than single god-prompts that try do do everything but eventually exceed context and capability...
LLMs (Claude Code in particular) will explicitly create token intensive steps, plans and responses - "just to be sure" - "need to check" - "verify no leftovers", will do git diff even tho not asked for, create python scripts for simple tasks, etc.
Absolutely no cache (except the memory which is meh) nor indexing whatsoever.
Pro plan for 20 bucks per month is essentially worthless and, because of this and we are entering new era - the era of $100+ monthly single subscription being something normal and natural.
>LLMs (Claude Code in particular) will explicitly create token intensive steps, plans and responses - "just to be sure" - "need to check" - "verify no leftovers", will do git diff even tho not asked for, create python scripts for simple tasks, etc. Absolutely no cache (except the memory which is meh) nor indexing whatsoever.
Most of these things can be avoided with a customized CLAUDE.md.
Not in my projects it seems. Perhaps you can share your best practices?
Moreover, avoiding these should be the default behaviour. Currently the default is to drain your pockets.
P.S
CLAUDE.md is sometimes useful but, it's a yet another token drain. Especially that it can grow exponentially.
On a lot of linux distros there is the `moreutils` package, which contains a command called `chronic`. Originally intended to be used in crontabs, it executes a command and only outputs its output if it fails.
I think this could find another use case here.
Make (or whatever) targets that direct output to file and returns a subset have helped me quite a bit. Then wrap that in an agent that also knows how and when to return cached and filtered data from the output vs. rerunning. Fewer tokens spent reading output details that usually won't matter, coupled with less context pollution in the main agent from figuring out what to do.
If you use a smaller model for the sub agent you get all three
Of course you can combine both approaches for even greater gains. But Claude Code and like five alternatives gaining an efficient tool-calling paradigm where console output is interpreted by Haiku instead of Opus seems like a much quicker win than adding an LLM env flag to every cli tool under the sun
Dunno about that. Having used the $20 claude plan, I ran out of tokens within 30 minutes if running 3-4 agents at the same time. Often times, all 3-4 will run a build command at the end to confirm that the changes are successful. Thus the loss of tokens quickly gets out of hand.
Edit: Just remembered that sometimes, I see claude running the build step in two terminals, side-by-side at nearly the same time :D
I noticed this on my spring boot side project. Successful test runs produce thousands of log lines in default mode because I like to e.g. log every executed SQL statement during development. It gives me visibility into what my orm is actually doing (yeh yeh I know I should just write SQL myself). For me it's just a bit of scrolling and cmd+f if I need to find something specific but Claude actually struggles a lot with this massive output.
Especially when it then tries to debug things finding the actual error message in the haystack of logs is suddenly very hard for the LLM. So I spent some time cleaning up my logs locally to improve the "agentic ergonomics" so to say.
In general I think good DevEx needs to be dialed to 11 for successful agentic coding. Clean software architecture and interfaces, good docs, etc. are all extremely valuable for LLMs because any bit of confusion, weird patterns or inconsistency can be learned by a human over time as a "quirk" of the code base. But for LLMs that don't have memory they are utterly confusing and will lead the agent down the wrong path eventually.
I wonder whether attention-free architectures like Mamba or Gated DeltaNet are distracted less by irrelevant context, because they don't recall every detail inside their context window in the first place. Theoretically it should be fairly easy to test this via a dedicated "context rot benchmark" (standard benchmarks but with/without irrelevant context).
great idea. thought about the waste of tokens dozens of times when I saw claude code increase the token count in the CLI after a build. I was wondering if there's a way to stop that, but not enough to actually look into it. I'd love for popular build tools to implement something along those lines!
Speaking of obvious questions. Why are you counting pennies instead of getting the LLM to do it? (Unless the idea was from an LLM and the executive decision was left to the operator, as well as posting the article)
So much content about furnishing the Markdown and the whatnot for your bots. But content is content?
This all seems like a lot of effort so that an agent can run `npm run build` for you.
I get the article's overall point, but if we're looking to optimise processing and reduce costs, then 'only using agents for things that benefit from using agents' seems like an immediate win.
You don't need an agent for simple, well-understood commands. Use them for things where the complexity/cost is worth it.
Feedback loops are important to agents. In the article, the agent runs this build command and notices an error. With that feedback loop, it can iterate a solution without requiring human intervention. But the fact that the build command pollutes the context in this case is a double-edge sword.
I've found that letting the agent write its own optimized script for dealing with some things can really help with this. Claude is now forbidden from using `gradlew` directly, and can only use a helper script we made. It clears, recompiles, publishes locally, tests, ... all with a few extra flags. And when a test fails, the stack trace is printed.
Before this, Claude had to do A TON of different calls, all messing up the context. And when tests failed, it started to read gradle's generated HTML/XML files, which damaged the context immensely, since they contain a bunch of inline javascript.
And I've also been implementing this "LLM=true"-like behaviour in most of my applications. When an LLM is using it, logging is less verbose, it's also deduplicated so it doesn't show the same line a hundred times, ...
> He sees something goes wrong, but now he cut off the stacktraces by using tail, so he tries again using a bigger tail. Not satisfied with what he sees HE TRIES AGAIN with a bigger tail, and … you see the problem. It’s like a dog chasing its own tail.
I've had the same issue. Claude was running the 5+ minute test suite MULTIPLE TIMES in succession, just with a different `| grep something` tacked at the end. Now, the scripts I made always logs the entire (simplified) output, and just prints the path to the temporary file. This works so much better.
Then when the script runs the output is put into a file, and the LLM can search that. Works like a charm.
This post just inspired me to tackle this once and for all today.
Not just context windows. Lots of that crap is completely useless for humans too. It's not a rare occurrence for warnings to be hidden in so much irrelevant output that they're there for years before someone notices.
I wrote a small game for my dev team to experience what it’s like interacting through these painful interfaces over the summer www.youareanagent.app
Jump to the agentic coding level or the mcp level to experience true frustration (call it empathy). I also wrote up a lot more thinking here www.robkopel.me/field-notes/ax-agent-experience/
When the first transformers that did more than poetry or rough translation appeared everybody noticed their flaws, but I observed that a dumb enough (or smart enough to be dangerous?) LLM could be useful in regularizing parameter conventions. I would ask an LLM how to do this or that, and it would "helpfully" generate non-functional command invocations that otherwise appeared very 'conformant' to the point that sometimes my opinion was that -even though the invocation was wrong given the current calling convention for a specific tool- it would actually improve the tool if it accepted that human-machine ABI or calling convention.
Now let us take the example of man vs info, I am not proposing to let AI decide we should all settle on man; nor do I propose to let AI decide we should all use info instead, but with AI we could have the documentation made whole in the missing half, and then it's up to the user if they prefer man or info to fetch the documentation of that tool.
Similarily for calling conventions, we could ask LLM's to assemble parameter styles and analyze command calling conventions / parameters and then find one or more canonical ways to communicate this, perhaps consulting an environment variable to figure out what calling convention the user declares to use.
https://x.com/ProfRobAnderson/status/2019078989348774129
> Indeed hallucinated cases are "better law." Drawing on Ronald Dworkin's theory of law as integrity, which posits that ideal legal decisions must "fit" existing precedents while advancing principled justice, this article argues that these hallucinations represent emergent normative ideals. AI models, trained on vast corpora of real case law, synthesize patterns to produce rulings that optimally align with underlying legal principles, filling gaps in the doctrinal landscape. Rather than errors, they embody the "cases that should exist," reflecting a Hercules-like judge's holistic interpretation.
As someone who loves coding pet projects but is not a software engineer by profession, I find the paradigm of maintaining all these config files and environment variables exhausting, and there seem to be more and more of them for any non-trivial projects.
Not only do I find it hard to remember which is which or to locate any specific setting, their mechanisms often feel mysterious too: I often have to manually test them to see if they actually work or how exactly. This is not the case for actual code, where I can understand the logic just by reading it, since it has a clearer flow.
And I just can’t make myself blindly copy other people's config/env files without knowing what each switch is doing. This makes building projects, and especially copying or imitating other people's projects, a frustrating experience.
How do you deal with this better, my fellow professionals?
I know this is very 20th century, but it helps a lot to understand how everything fits together and to remember what each tool does in a complex stack.
Documentation is not always perfect or complete, but it makes it much easier to find parameters in config files and know which ones to tweak.
And when the documentation falls short, the old adage applies: "Use the source, Luke."
Then don’t.
> How do you deal with this better, my fellow professionals?
By not doing it.
Look, it’s your project. Why are you frustrating yourself? What you do is you set up your environment, your configuration, what you need/understand/prefer and that’s it. You’ll find out what those are as you go along. If you need, document each line as you add it. Don’t complicate it.
I guess this happens when you're too deep in a topic and forget that eventually the overhead of maintaining the tooling outweights the benefits. It's a curse of our profession. We build and automate things, so we naturally want to build and automate tooling for doing the things we do.
Inventing GraphQL and React and making your own PHP compiler are absolutely insane and obviously wrong decisions — for everyone who isn’t Facebook. With Facebook revenue and Facebooks army of resume obsessed PHP monkeys they strike me as elegant technological solutions to otherwise intractable organizational issues. Insane, but highly profitable and fast moving. Outside of that context using React should be addressing clear pain points, not a dogmatic default.
We’re seeing some active pushback on it now online, but so much damage has been done. Embracing progressive complexity of web apps/sites should leave the majority as barebones with minimal if any JavaScript.
Facebook solutions for Facebook problems. Most of us can be deeply happy our 99 problems don’t include theirs, and live a simpler easier life.
It could depend on what you're doing, but if it's not for work the config hell is probably optional.
They do an excellent job of reading documentation and searching to pick and choose and filter config that you might care about.
After decades of maintaining them myself, this was a huge breath of fresh air for me.
The only boilerplate files you need in a JS repo root are gitignore, package.json, package-lock.json and optionally tsconfig if you're using TS.
A node.js project shouldn't require a build step, and most websites can get away with a single build.js that calls your bundler (esbuild) and copies some static files dist/.
Why was the output so verbose in the first place then?
Secondly, a helper to capture output and cache it, and frankly a tool or just options to the regular shell/bash tools to cache output and allow filtered retrieval of the cached output, as more so than context and tokens the frustration I have with the patterns shown is that often the agent will re-execute time-consuming tasks to retrieve a different set of lines from the output.
A lot of the time it might even be best to run the tool with verbose output, but it'd be nice if tools had a more uniform way of giving output that was easier to systematically filter to essentials on first run (while caching the rest).
Any special accommodations you make for LLMs are either a) also good for humans, or b) more trouble than they're worth.
It would be nice for both LLMs and humans to have a tool that hides verbose tool output, but still lets you go back and inspect it if there's a problem. Although in practice as a human I just minimise the terminal and ignore the spam until it finishes. Maybe LLMs just need their own equivalent of that, rather than always being hooked up directly to the stdout firehose.
I see some good research being done on how to allow LLMs to manage their own context. Most importantly, to remove things from their context but still allow subsequent search/retrieval.
Also, I just restart when the context window starts filling up. Small focused changes work better anyway IMO than single god-prompts that try do do everything but eventually exceed context and capability...
LLMs (Claude Code in particular) will explicitly create token intensive steps, plans and responses - "just to be sure" - "need to check" - "verify no leftovers", will do git diff even tho not asked for, create python scripts for simple tasks, etc. Absolutely no cache (except the memory which is meh) nor indexing whatsoever.
Pro plan for 20 bucks per month is essentially worthless and, because of this and we are entering new era - the era of $100+ monthly single subscription being something normal and natural.
Most of these things can be avoided with a customized CLAUDE.md.
P.S CLAUDE.md is sometimes useful but, it's a yet another token drain. Especially that it can grow exponentially.
Another thing that helps is using plan mode first, since you can more or less see how it's going to proceed and steer it beforehand.
1) It's only $100, well worth the money.
2) Surprisingly little value was provide for all that money.
this avoids having to update everything to support LLM=true and keep your current context window free of noise.
There :)
Of course you can combine both approaches for even greater gains. But Claude Code and like five alternatives gaining an efficient tool-calling paradigm where console output is interpreted by Haiku instead of Opus seems like a much quicker win than adding an LLM env flag to every cli tool under the sun
Edit: Just remembered that sometimes, I see claude running the build step in two terminals, side-by-side at nearly the same time :D
In general I think good DevEx needs to be dialed to 11 for successful agentic coding. Clean software architecture and interfaces, good docs, etc. are all extremely valuable for LLMs because any bit of confusion, weird patterns or inconsistency can be learned by a human over time as a "quirk" of the code base. But for LLMs that don't have memory they are utterly confusing and will lead the agent down the wrong path eventually.
The best friend isn't a dog, but the family that you build. Wife/Husband/kids. Those are going to be your best friends for life.
So much content about furnishing the Markdown and the whatnot for your bots. But content is content?
I get the article's overall point, but if we're looking to optimise processing and reduce costs, then 'only using agents for things that benefit from using agents' seems like an immediate win.
You don't need an agent for simple, well-understood commands. Use them for things where the complexity/cost is worth it.