Somehow for AI agents taking longer is getting praise with the framing “maintaining attention for long-time horizons?”
Have we collectively gone down to room temperature IQs with COVID?
Why would the time dimension matter for a tool that is limited in context window? Doesn’t matter if you fill up the window in 1 second or 60 minutes. Also, it’s super easy to game. Insert random lags, reduce tokens/sec, there you have a model that maintains attention over “long-time horizons”
Maybe more importantly how do people in this field buy into these easily game-able non-indicators so easily? How did they not develop the instinct to instantly call out metrics like lines of code, number of tokens burned or time taken to process a task as BS the instant they hear it?
How do they benchmark their code? The longer running the better? Number of CPU cycles spent?
This is not "how long does AI take to do ${thing}", it is "how long does *human* take to do ${thing}, where ${thing} is from the set of things that AI has probability = n of getting right", where n happens to be 50% or 80% in the METR studies.
At least, that's the short answer, here's a video with more depth: https://www.youtube.com/watch?v=evSFeqTZdqs
My experience is the AI actually completes the task in a few minutes, when it was a 2-ish hour task and the AI has a time horizon of 2 hours at P(correct) = 0.8. It is I the human, not the AI used by me, that would have taken 2 hours.
All I see now is celebration of how agents run for hours and handle “long-time horizons.”
Although the original definition is also flawed for coding. How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.
Then why did you write "Also, it’s super easy to game. Insert random lags, reduce tokens/sec, there you have a model that maintains attention over “long-time horizons”"?
The wall-clock time the LLM spends per task isn't the metric. How long you can leave the LLM alone, wall-clock time, without intervention, isn't "long-time horizons", it's more like "I gave it a list of tasks and it worked through them". Which is neat when it works, but different.
> All I see now is celebration of how agents run for hours and handle “long-time horizons.”
Yes? And? The long time horizons is with reference *to how long it would take humans to do*. Of course this is celebrated. When I've experimented with them, quite often after finishing one task from the plan, they'll go right on to the next task. Each task may take minutes, but the plan can have hundreds of items in it, and hundreds of minute-by-the-clock tasks is indeed hours.
You're literally, on your opening sentence, complaining about 2 + 2 taking longer to solve, this isn't even close to the point of the "time horizons" metric.
> How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.
Mostly it wasn't estimated, but rather *measured*:
- https://arxiv.org/html/2503.14499v3As with all the other metrics, this is now basically saturated, as nobody seems to want to pay METR $4M to hire a statistically significant number of engineers to spend 4h-1w on each of another 800 baselines for longer tasks. Or if they are, it's being kept very quiet.
Not sure how you’d measure software engineering tasks in an isolated manner like that. There are things I need to look up docs for, and others I don’t need to. And that depends on the person. There are tedious tasks that I sometimes get right with my first try, other times I have to look away for a minute and look back at it to get right. There is internet speed. Task evolves or architecture changes mid-task.
I wouldn’t consider anything well-defined and repetitively measurable a “long-time horizon task” - adding a new HTTP handler isn’t one, adding a new React route isn’t one.
Edit: Apparently there are people who care to be precise about this. See: https://subq.ai and how they describe it as "long‑context tasks."
To quote the researchers who coined the term:
- https://metr.org/time-horizons/If by "these people" you man people like you who conflate "long time horizon" with "long wall-clock time" like you did, then yes, that's why I replied to you.
Conversely, when a researcher says "I can leave my LLM running for hours, because it has a long time horizon", this is *causality*. Car analogy: if time horizon is fuel efficiency, the LLM working by itself for hours at a time is like driving your car for thousands of miles. The latter can obviously be gamed by having a bigger fuel tank, but also comes automatically from having a more efficient engine. Max range != Engine efficiency, but more efficient engines increase range. "Long wall clock time without intervention" != "Long time horizon", but longer time horizons increase wall clock time without intervention.
In fact, another relevant quote from the researchers who coined the term:
- https://metr.org/time-horizons/> Not sure how you’d measure software engineering tasks in an isolated manner like that. There are things I need to look up docs for, and others I don’t need to. And that depends on the person. There are tedious tasks that I sometimes get right with my first try, other times I have to look away for a minute and look back at it to get right. There is internet speed. Task evolves or architecture changes mid-task.
Are you unfamiliar with how statistics deal with such things? Even the quote I gave you in the previous comment had some of the humans failing to complete some of the tasks.
Also, to quote the researchers who coined the term:
- https://metr.org/time-horizons/> I wouldn’t consider anything well-defined and repetitively measurable a “long-time horizon task” - adding a new HTTP handler isn’t one, adding a new React route isn’t one.
First, "long" is a relative statement, not absolute. The early models could *only* reliably help with things that take a human a few seconds, e.g. stubbing out a function. Now they're up to 1.5 hours at P(success)=80%, or 11h59m at P(success)=50%. These are what "long time horizons" means in these cases: https://metr.org/time-horizons/
Second, the entire point of the METR study I linked you to, is to put those tasks you are dismissive of on the same chart as frontier models and early models, in order to find out what kind of things each model can do. I suggest reading it or watching the video, both explain this point.
> Edit: Apparently there are people who care to be precise about this. See: https://subq.ai and how they describe it as "long‑context tasks."
Incorrect. "Long context" is a third thing, "long context" != "Long time horizon" != "Long wall clock time without intervention".
In the car analogy, where "time horizon" maps to fuel efficiency, wall clock time maps to range, context maps to how good your field of view is from the driving seat.