We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them

(quesma.com)

91 points | by jakozaur 2 hours ago

16 comments

7777332215 25 minutes ago
I know they said they didn't obfuscate anything, but if you hide imports/symbols and obfuscate strings, which is the bare minimum for any competent attacker, the success rate will immediately drop to zero.
This is detecting the pattern of an anomaly in language associated with malicious activity, which is not impressive for an LLM.
akiselev 1 hour ago
Shameless plug: https://github.com/akiselev/ghidra-cli
I’ve been using Ghidra to reverse engineer Altium’s file format (at least the Delphi parts) and it’s insane how effective it is. Models are not quite good enough to write an entire parser from scratch but before LLMs I would have never even attempted the reverse engineering.
I definitely would not depend on it for security audits but the latest models are more than good enough to reverse engineer file formats.
[-]
- bitexploder 32 minutes ago
  I can tell you how I am seeing agents be used with reasonable results. I will keep this high level. I don't rely on the agents solely. You build agents that augment your capabilities.
  They can make diagrams for you, give you an attack surface mapping, and dig for you while you do more manual work. As you work on an audit you will often find things of interest in a binary or code base that you want to investigate further. LLMs can often blast through a code base or binary finding similar things.
  I like to think of it like a swiss army knife of agentic tools to deploy as you work through a problem. They won't balk at some insanely boring task and that can give you a real speed up. The trick is if you fall into the trap of trying to get too much out of an LLM you end up pouring time into your LLM setup and not getting good results, I think that is the LLM productivity trap. But if you have a reasonable subset of "skills" / "agents" you can deploy for various auditing tasks it can absolutely speed you up some.
  Also, when you have scale problems, just throw an LLM at it. Even low quality results are a good sniff test. Some of the time I just throw an LLM at a code review thing for a codebase I came across and let it work. I also love asking it to make me architecture diagrams.
- jakozaur 1 hour ago
  Oh, nice find... We end up using PyGhidra, but the models waste some cycles because of bad ergonomics. Perhaps your cli would be easier.
  Still, Ghidra's most painful limitation was extremely slow time with Go Lang. We had to exclude that example from the benchmark.
- lima 1 hour ago
  How does this approach compare to the various Ghidra MCP servers?
  [-]
  - akiselev 48 minutes ago
    There’s not much difference, really. I stupidly didn’t bother looking at prior art when I started reverse engineering and the ghidra-cli was born (along with several others like ilspy-cli and debugger-cli)
    That said, it should be easier to use as a human to follow along with the agent and Claude Code seems to have an easier time with discovery rather than stuffing all the tool definitions into the context.
    [-]
    - bitexploder 30 minutes ago
      That is pretty funny. But you probably learned something in implementing it! This is such a new field, I think small projects like this are really worthwhile :)
- huflungdung 1 hour ago
  [dead]
simianwords 1 hour ago
I'm not an expert but about false positives: why not make the agent attempt to use the backdoor and verify that it is actually a backdoor? Maybe give it access to tools and so on.
[-]
- jakozaur 56 minutes ago
  So many models refuse to do that due to alignment and safety concerns. So cross-model comparison doesn't make sense. We do, however, require proof (such as providing a location in binary) that is hard to game. So the model not only has to say there is a backdoor, but also point out the location.
  Your approach, however, makes a lot of sense if you are ready to have your own custom or fine-tuned model.
  [-]
  - simianwords 47 minutes ago
    Surprising that they still allow to catch the back doors but not use them.
    A bad actor already has most of the work done.
magicmicah85 37 minutes ago
GPT is impressive with a consistent 0% false positive rate across models, yet its ability to detect is as high as 18%. Meanwhile Claude Opus 4.6 is able to detect up to 46% of backdoors, but has a 22% false positive rate.
It would be interesting to have an experiment where these models are able to test exploiting but their alignment may not allow that to happen. Perhaps combining models together can lead to that kind of testing. The better models will identify, write up "how to verify" tests and the "misaligned" models will actually carry out the testing and report back to the better models.
[-]
- sdenton4 4 minutes ago
  It would be really cool if someone developed some standard language and methodology for measuring the success of binary classificaiton tasks...
  Oh, wait, we have had that for a hundred years - somehow it's just entirely forgotten when generative models are involved.
jakozaur 2 hours ago
See direct benchmark link: https://quesma.com/benchmarks/binaryaudit/
Open-source GitHub: https://github.com/QuesmaOrg/BinaryAudit
folex 1 hour ago
> The executables in our benchmark often have hundreds or thousands of functions — while the backdoors are tiny, often just a dozen lines buried deep within. Finding them requires strategic thinking: identifying critical paths like network parsers or user input handlers and ignoring the noise.
Perhaps it would make sense to provide LLMs with some strategy guides written in .md files.
[-]
- Arech 3 minutes ago
  That's what I thought of too. Given their task formulation (they basically said - "check these binaries with these tools at your disposal" - and that's it!) their results are already super impressive. With a proper guidance and professional oversight it's a tremendous force multiplier.
- selridge 18 minutes ago
  That’s hard. Sometimes you will do that and find it prompts the model into “strategy talk” where it deploys the words and frame you use in your .md files but doesn’t actually do the strategy.
  Even where it works, it is quite hard to specify human strategic thinking in a way that an AI will follow.
ducktastic 28 minutes ago
It would be interesting to have some tests run against deliberate code obfuscation next
nisarg2 56 minutes ago
I wonder how model performance would change if the tooling included the ability to interact with the binary and validate the backdoor. Particularly for models that had a high rate of false positives, would they test their hypothesis?
Bender 2 hours ago
Along this line can AI's find backdoors spread across multiple pieces of code and/or services? i.e. by themselves they are not back-doors, advanced penetration testers would not suspect anything is afoot but when used together they provide access.
e.g. an intentional weakness in systemd + udev + binfmt magic when used together == authentication and mandatory access control bypass. Each weakness reviewed individually just looks like benign sub-optimal code.
[-]
- cluckindan 1 hour ago
  Start with trying to find the xz vulnerability and other software possibly tying into that.
  Is there code that does something completely different than its comments claim?
  [-]
  - Bender 49 minutes ago
    Another way to phrase what I am asking is ... Does AI understand the context of code deep enough to know everything a piece of code can do, everything a service can do vs. what it was intended to do. If it can understand code that far then it could understand all the potential paths data could flow and thus all the potential vulnerabilities that several piece of code together could achieve when used in concert with one another. Advanced multi-tier chess so to speak.
    Or put another way, each of these three through three hundred applications or services by themselves may be intended to perform x,y,z functions but when put together by happy coincidence they can perform these fifty-million other unintended functions including but not limited to bypassing authentication, bypassing mandatory access controls, avoiding logging and auditing, etc... oh and it can automate washing your dishes, too.
    [-]
    - DANmode 7 minutes ago
      Some models can,
      depending on the length of the piece of code,
      is probably the most honest answer right now.
      [-]
      - Bender 1 minute ago
        Fair enough. I suspect when they reach such a point that length no longer matters then a plethora of old and currently used state sponsored complex malware will be realized. Beyond that I think the next step would be to attain attribution to both individuals and perhaps whom they were really employed by. Bonus if the model can rewrite sanitize each piece of code to remove the malicious capabilities without breaking the officially intended functions.
dgellow 54 minutes ago
Random thoughts, only vaguely related: what’s the impact of AI on CTFs? I would assume that kills part of the fun of such events?
BruceEel 1 hour ago
Very, very cool. Besides the top-performing models, it's interesting (if I'm reading this correctly) that gpt-5.2 did ~2x better than gpt-5.2-codex.. why?
[-]
- NitpickLawyer 53 minutes ago
  > gpt-5.2 did ~2x better than gpt-5.2-codex.. why?
  Optimising a model for a certain task, via fine-tuning (aka post-training), can lead to loss of performance on other tasks. People want codex to "generate code" and "drive agents" and so on. So oAI fine-tuned for that.
Roark66 15 minutes ago
And this one demonstration why these "1000 CTOs claim no effectiveness improvement after introducing AI in their companies" are 100% BS.
They may have not noticed an improvement, but it doesn't mean there isn't any.
[-]
- HeWhoLurksLate 9 minutes ago
  it also generally takes a heck of a noisy bang for internal developments to make it to the c-suite
Tiberium 56 minutes ago
I highly doubt some of those results, GPT 5.2/+codex is incredible for cyber security and CTFs, and 5.3 Codex (not on API yet) even moreso. There is absolutely no way it's below Deepseek or Haiku. Seems like a harness issue, or they tested those models at none/low reasoning?
[-]
- jakozaur 52 minutes ago
  As I do eval and training data sets for living, in niche skills, you can find plenty of surprises.
  The code is open-source; you can run it yourself using Harbor Framework:
  git clone git@github.com:QuesmaOrg/BinaryAudit.git
  export OPENROUTER_API_KEY=...
  harbor run --path tasks --task-name lighttpd-* --agent terminus-2 --model openrouter/anthropic/claude-opus-4.6 --model openrouter/google/gemini-3-pro-preview --model openrouter/openai/gpt-5.2 --n-attempts 3
  Please open PR if you find something interesting, though our domain experts spend fair amount of time looking at trajectories.
  [-]
  - Tiberium 12 minutes ago
    Just for fun, I ran dnsmasq-backdoor-detect-printf (which has a 0% pass rate in your leaderboard with GPT models) with --agent codex instead of terminus-2 with gpt-5.2-codex and it identified the backdoor successfully on the first try. I honestly think it's a harness issue, could you re-run the benchmarks with Codex for gpt-5.2-codex and gpt-5.2?
  - Tiberium 24 minutes ago
    Are the existing trajectories from your runs published anywhere? Or is the only way is for me to run them again?
stevemk14ebr 20 minutes ago
These results are terrible, false positives and false negatives. Useless
[-]
- amelius 18 minutes ago
  Yeah, what does the confusion matrix look like?
raphaelmolly8 8 minutes ago
[dead]
shablulman 1 hour ago
Validating binary streams at the gateway level is such an overlooked part of the stack; catching malformed Protobuf or Avro payloads before they poison downstream state is a massive win for long-term system reliability.