LLMs Corrupt Your Documents When You Delegate

(arxiv.org)

94 points | by rbanffy 6 hours ago

11 comments

timacles 58 minutes ago
Least shocking thing I've read about LLMs recently.
They are essentially like that one JPEG meme, where each pass of saving as JPEG slightly degrades the quality until by the end its unrecognizable.
Except with LLMs, the starting point is intent. Each pass of the LLMs degrades the intent, like in the case of a precise scientific paper, just a little bit of nuance, a little bit of precision is lost with a re-wording here and there.
LLMs are mean reversion machines, the more 'outside of their training' the context/work load they are currently dealing with, the more they will tend to gradually pull that into some homogenous abstract equilibrium
[-]
- ekidd 15 minutes ago
  Where this result is actually interesting and relevant is when a coding agent splits a large source file into multiple smaller files. Opus + Claude Code will try to recite long sections of source code from memory into each of the new files, instead of using some sort of copy/paste operation like a human would.
  Moving a file is a bit easier. LLMs may sometimes try to recite the file from memory. But if you tell them to use "git mv" and fix the compiler errors, they mostly will.
  Ordinary editing on the other hand, generally works fine with any reasonable model and tool setup. Even Qwen3.6 27B is fine at this. And for in-place edits, you can review "git diff" for surprises.
- Twirrim 25 minutes ago
  A coworker talks about LLMs as "bullshit" layers. Not exactly dismissing them or being derogatory about them, but emphasising that each time you feed something through an LLM, what comes out the other side may not be what you expect/want. Like that guy at the pub sharing what he'd seen online somewhere, after a few pints. Might be accurate, but carries notable risk it's not.
  So e.g., don't use an LLM to call an API to gather data and produce a report on it, as that's feeding deterministic data through a "bullshit" layer, meaning you can't trust what comes out the other side. Instead use the LLM to help you write the code that will produce a deterministic output from deterministic data.
  I've seen co-workers use LLMs to summarise deterministic data coming from APIs and have reports be wildly off the mark as often as they are accurate. Depending on what they're looking at that can have catastrophic risk.
  [-]
  - ben_w 5 minutes ago
    Similar experience. I wouldn't say it even needs to be like some random person in the local pub: this behaviour is what you'd get from any game of telephone, book authors will say how you need to be blunt and direct about points in the book because readers will miss subtlety, anyone who has been quoted in a newspaper will have a story about the paper getting it wrong, etc.
    However, there's a reason pre-computing bureaucracy came with paper trails and meeting minutes getting written up, why court cases are increasingly cautious about the reliability of eye witnesses.
    It is ironic, the more AI becomes like us and less it acts like a traditional computer program, the worse it is at many things we want to use it for, but because collectively we're oblivious to our cognitive limitations we race into completely avoidable failures like this.
- threethirtytwo 20 minutes ago
  A human doing the same tasks as what the LLM did in the paper that the human will degrade the document further then the LLM. If the LLM is 25%, a human would degrade it probably 80% if they used the same technique as the LLM did in this paper. I'm talking about a single pass.
  The fact of the matter is, humans don't edit things the way it was done in the paper and neither do coding agents like claude. Think about it: You do not ingest an entire paper and then regurgitate that paper with a single targeted edit... and neither do coding agents.
  Also think carefully. A 25% degradation rate is unacceptable in the industry. The AI change that's taking over all of SWE development would not actually exist if there was 25% degradation... that's way too much.
  [-]
  - lelanthran 9 minutes ago
    Are we comparing humans to LLMs or human written software to LLMs?
    The whole point of creating software to do things used to be getting things done more accurately and consistently.
simonw 45 minutes ago
I'm suspicious of their results with regards to tool usage.
It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.
They claim that tool use didn't help, which surprised me... but they also said:
> To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.
And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!
The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.
They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.
Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...
The relevant prompt fragment is:
```
  You can approach the task in whatever
  way you find most effective:
  programmatically or directly
  by writing files
```
As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.
I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.
[-]
- threethirtytwo 27 minutes ago
  People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity. I refer to HN specifically.
  The fact of the matter is, if you want to edit a document by reading the document and then regurgitating the entire document with said edits... a human will DO worse then a 25% degradation. It's possible for a human to achieve 0% degradation but the human will have to ingest the document hundreds of times to achieve a state called "memorization". The equivalent in an LLM is called training. If you train a document into an LLM you can get parity with the memorized human edit in this case.
  But the above is irrelevant. The point is LLMs have certain similarities with humans. You need to design a harness such that an LLM edits a document the same way a human would: Search and surgical edits. All coding agents edit this way, so this paper isn't relevant.
causal 2 hours ago
Yeah I've been saying this for a while: AI-washing any text will degrade it, compounding with each pass.
"Semantic ablation" is my favorite term for it: https://www.theregister.com/software/2026/02/16/semantic-abl...
[-]
- mohamedkoubaa 1 hour ago
  I've been calling it meanwit reversion
- polskibus 1 hour ago
  By „with each pass” do you mean within the same session, or with new session (context window) each time?
  [-]
  - sebastiennight 1 hour ago
    In my experience, it happens with each edit of the document, whether or not you clear the context window.
    You can somewhat mitigate this, at the same moment you ask for the new edit, by adding new info or specifying the lost meaning you want to add back. But other things will still get washed out.
    Nuances will drift, sharp corners will be ablated. You're doing a Xerox copy of your latest Xerox copy, so even if you add your comments with a sharpie, anything that was there right before will be slightly blurrier in the next version.
  - adampunk 1 hour ago
    Each edit, even with unrelated edits. I had a README referring to something as "the cathedral of s*t" (some HN commentators don't care for the swearing, which is systemically bad news but w/e) and the robot would lift that phrase out in drive-bys, repeatedly.
    Occasionally it would report the action, sometimes it would not bother to report it. It never reached into the README on an unrelated doc edit, but if it was touching the README, that line was getting excised.
meander_water 35 minutes ago
> We find that models are not failing due to “death by a thousand cuts” (i.e., many small errors). Instead, they main- tain near-perfect reconstruction in some rounds, and experience critical failures in a few rounds, typically losing 10-30+ points in a single round trip
> We find that weaker models’ degradation originates primarily from content deletion, while frontier models’ degradation is attributable to corruption of content.
I think we largely already knew this. This is why we fudge around with harnesses and temperature etc.
jonmoore 2 hours ago
I really liked the evaluation method here - testing fidelity by round-tripping through chains of invertible steps. It was striking how even frontier models accumulated errors on seemingly computer-friendly tasks.
It would be interesting to know if the stronger results on Python are not just an artefact of the Python-specific evaluation, if they carry over to other common general-purpose languages, and if they are driven by something specific in the training processes.
carterschonwald 11 minutes ago
this is literally just “leave a child at the work computer with a real doc open playing office”. otoh it is good to design benchmarks tonground these things.
on the flip side if you’re literally just using a bare bones harness on top of a stochastic parrot, of course stochastic errors accumulate.
theres a lot of ways for improving text faithfulness through harness tool designs, and my incremental experiments seem promising.
but unless work is gated on shit like “the script used must type checked ghc haskell or lean4”, unsupervised stuff is gonna decay
woeirua 1 hour ago
It's an interesting paper, but I'd like to see a lot more about the types of errors that the LLM makes. Are they happening in the forward pass or the inverse pass? My guess is the inverse pass.
threethirtytwo 33 minutes ago
This experiment needs to be put in perspective. Let me explain. IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.
The way this experiment is conducted is not inline with how current agentic AI is used OR how even humans edit documents.
Here's how agentic AI currently typically do edits:
1. They read the whole document. 2. They come up with a patch. A diff of the section they want to edit. 3. They change THAT section only.
This is NOT what that experiment was doing. A 25% degradation rate would render the whole industry dead. No one would be using claude code because of that. The reality is... everyone is using claude code.
AI is alien to the human brain, but in many ways it is remarkably. This is one aspect of similarity in that we cannot edit a whole document holistically to produce one edit. It has to be targeted surgical edits rather then a regurgitation of the entire document with said edit.
cyanydeez 2 hours ago
I played around with a local LLM to try and build a wiki like DAG. It made a lot of stupid errors from vague generic things like interpreting based on file names to not following redirects and placing the redirect response in them.
I've also had them convert to markdown something like an excel formatted document. It worked pretty well as long as I was examining the output. But the longer it ran in context, the more likely it was to try in slip things in that seemed related but wasn't part of the break down.
The only way I've found to mitigate some of it is to make every file a small-purpose built doc. This way you can definitely use git to revert changes but also limit the damage every time they touch them to the small context.
Anyone who thinks they're a genius creating docs or updating them isnt actually reading the output.
[-]
- sebastiennight 1 hour ago
  > I've also had them convert to markdown something like an excel formatted document.
  This look like a task where the LLM would be best used in writing a deterministic script or program that then does the conversion.
  Trusting a LLM to make the change without tools is like telling the smartest person you know to just recite the converted document out loud from memory. At some point they'll get distracted, wrong, or unwittingly inject their own biases and ideas into it whenever the source data is counter-intuitive to them.
  [-]
  - trollbridge 48 minutes ago
    I see people cut and paste from Excep into a chat, as an image, and ask it to sum up numbers.
  - cyanydeez 46 minutes ago
    it was, but the formatting was garbage so it ran again to fix thw format.
adampunk 1 hour ago
LLMs will make mistakes on every turn. The mistakes will have little to no apparent connection to "difficulty" or what may or may not be prevalent in the training data. They will be mistakes at all levels of operation, from planning to code writing to reporting. Whether those mistakes matter and whether you catch them is mostly up to you.
I have yet to find a model that does not make mistakes each turn. I suspect that this kind of error is fundamentally incorrigible.
The most interesting thing about LLMs is that despite the above (and its non-determinism) they're still useful.
[-]
- simonw 34 minutes ago
  > I have yet to find a model that does not make mistakes each turn
  What kind of mistakes are you talking about here?
- pyrolistical 1 hour ago
  As a human I make typos all the time
  [-]
  - dangus 53 minutes ago
    A human can sit down and say “I’m going to make sure this is correct on the first pass and make sure I make an exact copy.”
    They have cognitive awareness of which tasks are highly critical and need more checking and re-checking without being prompted to think that way.
    For a human, time doesn’t stop when the first pass of the prompt and response is over. An LLM effectively wipes its memory of what it just did unless something is keeping track of a highly resource constrained context.
    An LLM is like an author of a book that immediately closes its eyes and wipes its memory after writing a chapter. Sure, it can pull some of that back in the next query via context, and it can regain context very quickly, but it effectively has no memory of the exact thing it just did.
    When a human is doing these tasks there is a lot of room for mistakes but there’s also a wildly higher capacity for flowing through time.
    [-]
    - adampunk 47 minutes ago
      Ok, and?
      [-]
      - simonh 38 minutes ago
        Humans understand what mistakes are and can reason about what constitutes a mistake and what doesn’t. LLMs can’t do that.
        It’s for the same reason that they will invent bullshit instead of saying “I don’t know”, when they don’t know. They don’t have a concept of accuracy of facts.
      - dangus 43 minutes ago
        And that’s why I’m paid six figures and my LLM is paid $20/month.
  - adampunk 1 hour ago
    I do too! I also make higher level design errors and get too enthusiastic about projects before code is written.
    We are, in a sense, fallible machines who have designed a planet-wide computational fabric around that fact.
    [-]
    - peyton 47 minutes ago
      [flagged]
arian_ 17 minutes ago
[flagged]