It's an important question! If you are paying a lot of money to use AI models, you care that you are using the best for your task. And it turns out that figuring out which AI models is best for your task is not trivial and requires some expertise.
That was too nice of a reply, I apologize. I just can't understand the thought process and that what exactly are we optimizing for? If you are paying a lot of money to use AI models, you already have so much overhead that precise ranking in an eval is not gonna make much difference between equally "frontier" models. Especially since models are sensitive to the input. So the eval is just gonna evaluate the eval with very high accuracy. It might be equivalent to the illusion of safety thing applied to financial risk.
It's not just for choice of model, you can use it for your prompting as well. And yes, running evals is expensive and mostly of use to people with serious spend.
(Author) It's short for "evaluation", a test for an AI model. Specifically, an AI evaluation comprises (1) a dataset of prompts (as questions / tasks / queries), (2) some way to score model performance on each prompt, like a set of correct answers or a grading rubric that you can use with an LLM autograder, and (3) a metric, such as accuracy¹. (If you're already familiar with the term "benchmark", it's the same thing; for some reason the former has become the term of art in the past few years).
For example, a simple eval is a dataset of multiple-choice questions, which each have one correct answer, and scored by accuracy. An example of this kind of eval is the Massive Multitask Language Understanding benchmark (2020) (https://arxiv.org/abs/2009.03300).
A more complex eval is FrontierCode (2026). Questions in FrontierCode represent coding tasks needed for real-world repos and are evaluated against rubrics scoring for correctness, code quality, cleanliness, and other factors. https://cognition.com/blog/frontier-code.
¹Note that this is a slightly different definition we used in [0], which used a definition of a fixed input-output correspondence pairs combined with a metric. What's different from 2021 is: models are now given more open-ended inputs (prompts like "find the bug" and a codebase rather than a simple question), have freeform generation (rather than choosing a fixed answer), and are graded in a more complex manner (e.g. beyond correctness, one might care for a coding eval also to grade adherence to coding guidelines, test coverage, etc).
[0] Liao, T., Taori, R., Raji, I. D., & Schmidt, L. (2021, January). Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://thomasliao.com/are_we_learning_yet.pdf
The problem with eval is the fact that the information is not updating itself fast enough so that you want the latest model performance benchmarks. Bloomberg succeeded because it sells info that is expires in the next hour.
Except regex, no matter how technologically advanced your company, somewhere someone is slapping regex on something that has no business being regexed.
If you look at the history of software engineering, the ones that made the most money were usually not the companies that built the applications themselves, but the ones that built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.
Personally, I agree with the Goodhart problem, but isn't the reason Eval startups fail because they try to sell an 'evaluation service' rather than a 'verification toolchain'? The problem, it seems, is that AI verification toolchains require a model in the end, because they internalize AI and sell it under the name of a 'harness.'
So an AI verification(eval) toolchain would have to be structurally different. Verifying AI code isn't about whether it compiles. AI code can always be made to compile. The issue involves various semantic criticisms, such as overfitting to existing designs and tests. To catch those issues, you ultimately need to build an AI. But building that AI is expensive. So in the end, AI verification companies depend on external model providers for the core components of their verification engine. I think this is a bad business decision
I thought about mentioning Atlassian (Jira) and JetBrains, but come to think of it, they aren't really testing frameworks. They cover the entire development workflow overall. I guess I was thinking too short.
I do agree that the author does not do a good job of introducing the term.
What kind of stupid business is this. Though nothing can beat SEO in that spirit.
For example, a simple eval is a dataset of multiple-choice questions, which each have one correct answer, and scored by accuracy. An example of this kind of eval is the Massive Multitask Language Understanding benchmark (2020) (https://arxiv.org/abs/2009.03300).
A more complex eval is FrontierCode (2026). Questions in FrontierCode represent coding tasks needed for real-world repos and are evaluated against rubrics scoring for correctness, code quality, cleanliness, and other factors. https://cognition.com/blog/frontier-code.
¹Note that this is a slightly different definition we used in [0], which used a definition of a fixed input-output correspondence pairs combined with a metric. What's different from 2021 is: models are now given more open-ended inputs (prompts like "find the bug" and a codebase rather than a simple question), have freeform generation (rather than choosing a fixed answer), and are graded in a more complex manner (e.g. beyond correctness, one might care for a coding eval also to grade adherence to coding guidelines, test coverage, etc).
[0] Liao, T., Taori, R., Raji, I. D., & Schmidt, L. (2021, January). Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://thomasliao.com/are_we_learning_yet.pdf
Aha.
Personally, I agree with the Goodhart problem, but isn't the reason Eval startups fail because they try to sell an 'evaluation service' rather than a 'verification toolchain'? The problem, it seems, is that AI verification toolchains require a model in the end, because they internalize AI and sell it under the name of a 'harness.'
So an AI verification(eval) toolchain would have to be structurally different. Verifying AI code isn't about whether it compiles. AI code can always be made to compile. The issue involves various semantic criticisms, such as overfitting to existing designs and tests. To catch those issues, you ultimately need to build an AI. But building that AI is expensive. So in the end, AI verification companies depend on external model providers for the core components of their verification engine. I think this is a bad business decision
> built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.
Curious. Which company made money with testing frameworks?