While I understand why they used the METR data, a cleaner look would be against the current cost-optimal frontier of open models (e.g. GLM-5.1 and MiniMax-M2.7). That paints a very different picture. Comparing just the frontier models at the time of the METR report invariably leads to looking at providers who are pushing the limits of cost at the time of the report.
GPT-5 was shown as being on the costly end, surpassed by o3 at over $100/hr. I can't directly compare to METR's metrics, but a good proxy is the cost of the Artificial Analysis suite. GLM-5.1 is less than half the cost to complete the suite of GPT-5 and is dramatically more capable than both GPT-5 and o3.
So while their analysis is interesting, it points towards the frontier continuing to test the limits of acceptable pricing (as Mythos is clearly reinforcing) and the lagging 6-12 months of distillation and refinement continuing to bring the cost of comparable capabilities to much more reasonable levels.
Calculating hourly costs for these models makes me think that the decision of when to hire an SWE vs. increase use of AI may follow a similar pattern to the decision to use cloud compute vs. on-premises. I don’t cost $120/hr (incl. fringe), but my employer pays my salary all year long, no matter if I am working or on vacation. Whereas if they use an AI model to do the same work, they may be happy to pay $120/hr or more, since they may only use the model for a small fraction of 2080 hours per year, so they’d still save money, and not have a messy human to deal with.
The framing of AI vs SWE cost assumes you know what the AI is actually spending. Most teams don't. They see a monthly total, not per-agent/per-step attribution. The decision math only works if both sides of the equation are real numbers. That is the gap Traeco closes. traeco.dev
The framing of AI vs SWE cost assumes you know what the AI is actually spending. Most teams do not. They see a monthly total, not per-agent/per-step attribution. The decision math only works if both sides of the equation are real numbers. That is the gap Traeco closes. traeco.dev
I remain convinced we won’t look at project estimates as time based in software engineering as our primary cost estimate. And this is transition will happen rapidly. We’re going to shift to a capex/token spend model for project estimates where the business will say “ok I do want that feature for $1000 in tokens”.
I agree with you directionally that project estimates are/will be affected by this but I don't see a scenario in which time is completely removed from the equation with respects to projects & estimates to execute on them. We're all constrained by time, finite resource. It's always a factor in business.
Once a model is stable and good enough, for example Sonnet 4.6 or GPT 5.4 (or something else in future), it can be burned into hardware like Talaas chip reducing the cost many times and increasing the speed. At some point we can rely on old model while being productive with it.
> On many task lengths (including those near their plateau) they cost 10 to 100 times as much per hour. For instance, Grok 4 is at $0.40 per hour at its sweet spot, but $13 per hour at the start of its final plateau. GPT-5 is about $13 per hour for tasks that take about 45 minutes, but $120 per hour for tasks that take 2 hours. And o3 actually costs $350 per hour (more than the human price) to achieve tasks at its full 1.5 hour task horizon. This is a lot of money to pay for an agent that fails at the task you’ve just paid for 50% of the time — especially in cases where failure is much worse than not having tried at all.
The crazy part about this is if you compare it not to US wages but european, for instance in the UK where the median software hourly wage is somewhere around $35-40 an hour, then humans are already cheaper than the best models.
The sweet spot thing is the real insight here and nobody seems to be talking about it.
Frontier models get hyped for their maximum task horizon, but that's also where they're 10-30x more expensive per hour than their optimal range. You're paying a massive premium for the hardest tasks and still failing half the time.
Honestly the practical takeaway is pretty boring: just break your work into smaller chunks. Not because the models can't handle longer tasks, but because the economics at shorter task lengths are just way better. The labs are racing to push the horizon out; the smart move for anyone actually paying the bills is to stay near the sweet spot and orchestrate from there.
No, but the AI labs would love to frame it this way so they can continue to nerf models and increase prices while they use the cheap, highly performant, highly powerful models internally to replace all of your businesses.
Interesting read. I don't know if I quite buy the evidence, but it's definitely enough to warrant further investigation. It also matches up with my personal experience, which is that tools like Claude Code are burning through more and more tokens as we push them to do bigger and bigger work. But we all know the frontier model companies are burning through money in an unsustainable race to get you and your company hooked on their tools.
So: I buy that the cost of frontier performance is going up exponentially, but that doesn't mean there is a fundamental link. We also know that benchmark performance of much smaller/cheaper models has been increasing (as far as I know METR only looks at frontier models), so that makes me wonder if the exponential cost/time horizon relationship is only for the frontier models.
Until there is some drastic new hardware, we are going to see a similar situation to proof of work, where a small group hordes the hardware and can collude on prices.
Difference is that the current prices have a lot of subsidies from OPM
Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.
So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.
This is an interesting analysis, but "are the costs of AI agents also rising exponentially is?" is a very bad question that this doesn't answer.
What's rising exponentially is the price of the most ambitious thing cutting edge agents can do.
But to answer whether the cost of AI agents is rising in general, you would take a fixed set of problems, and for each of them, ask "once it's solvable, how does the price change?"
For that latter question, there isn't a lot of data in these charts because there aren't enough curves for models of the same family over time, but it does look like there are a number of points where newer models solve the same problems at lower prices. Look at GPT5 vs. the older GPT models--the curve for GPT5 is shifted left.
AI feels more like a gamble.
People like gambling.
From casinos (win-loose), to lootboxes (uncertainty) or even extramarital sex (whose baby is it?).
This way - AI work is like a slot machine - will this work or not?
Either way - casino gets paid and casino always wins.
Nevertheless - if the idea or product is very good (filling high market pain) and not that difficult to build - it can enable non-coders to "gamble" for the outcome with AI for $.
Sadly - from by experiences hiring Devs - hiring people is also a gamble...
My expectation: demand going up, prices will rise, supply will saturate to the point of ubiquitous "utility" status, and prices will drop, probably a bell curve shape with sine-wave undulations along the way.
You could model more of the process: the dev's work as well as the model's, and the cost of catching a bug later or deploying it live. Those tasks push me further towards smaller tasks in general. (And they make the Gas Town type stuff seem more baffling.)
- Smaller chunks make review much easier and more effective at finding bugs, as we've known since long before LLMs.
- Greater certainty provides a better development experience. I've heard people talk about how LLM development can be tiring. One way that happens, I think, is the win-or-lose drama of feeding in huge tasks with a substantial chance of failure. I think if you're succeeding 95% of the time instead of 70%, and the 5% are easier to deal with (smaller chunks to debug), it's a better experience.
- Everything is harder about real-world tasks because they aren't clean verifiable-reward benchmarks. Developers have context that models don't, so it's common that a problem traces to an detail not in the spec where the model guessed wrong. For real-world tasks "failures" are also sometimes "that UI is bad" or "that way of coding it is hard to maintain." And it's possible to have problems the dev simply doesn't notice. The benchmarks' fully computer-checkable outcomes are 'easy mode' compared to the real world.
- Fixing agents' mess becomes more work as task sizes increase. (Like the certainty thing, but about cost in hours than the experience.) Again, if the model has spat out 1000 lines and stumped itself debugging a failure, it'll take you some time to figure out: more time than debugging 250-line patch, and the larger patch is more likely to have bugs. And if an issue bug makes it out to peer review, you can add communication and context-switching cost (point out bug, fix, re-review) on top of that.
- Bugs that reach prod are really expensive. More of a problem when a prod bug can lose you customers vs., say, on most hobby things. Ord's post gestures at it: there are "cases where failure is much worse than not having tried at all." That magnifies how important it is the review be good, and how much of a problem bugs that sneak through are, which points towards doing smaller chunks.
How significant each factor is depends on details: how easy the task is to verify, how well-specified it is (and more generally how much it's in the models' wheelhouse, and how much in mine), how bad a bug would be (fun thing? internal tool? user facing? can lose data?).
I think the dynamics above apply across a range of model strengths, but that doesn't mean the changes from say Sonnet 3.7 to Opus 4.5 didn't mean anything; the machine getting better at getting the info it needs and checking itself still helps at shorter task lengths. Harness improvements can help, e.g. they could help keep models of the 'too much context, model got silly' zone (may be less severe than it once was, but I suspect will remain a thing), build better context, and clean up code as well as spitting results out.
Besides taking more of your time up front, involving yourself more also tends to drift towards you making more of the lower-level decisions about how the code will look, which I find double-edged. You have better broad context, and you know what you find maintainable. But the implementer, model or another person, is closer to the code, which helps it make some mid-to-low-level decisions well.
Plan modes and Spec-Kit type things can help with the balance of getting involved but letting the model do its thing. I've liked asking the LLM to ask a lot of questions and surface doubts. A colleague messed with Spec-Kit so it would pick one change on its fine-grained to-do list at a time, which is a neat hack I'd like to try sometime.
137 comments
GPT-5 was shown as being on the costly end, surpassed by o3 at over $100/hr. I can't directly compare to METR's metrics, but a good proxy is the cost of the Artificial Analysis suite. GLM-5.1 is less than half the cost to complete the suite of GPT-5 and is dramatically more capable than both GPT-5 and o3.
So while their analysis is interesting, it points towards the frontier continuing to test the limits of acceptable pricing (as Mythos is clearly reinforcing) and the lagging 6-12 months of distillation and refinement continuing to bring the cost of comparable capabilities to much more reasonable levels.
> On many task lengths (including those near their plateau) they cost 10 to 100 times as much per hour. For instance, Grok 4 is at $0.40 per hour at its sweet spot, but $13 per hour at the start of its final plateau. GPT-5 is about $13 per hour for tasks that take about 45 minutes, but $120 per hour for tasks that take 2 hours. And o3 actually costs $350 per hour (more than the human price) to achieve tasks at its full 1.5 hour task horizon. This is a lot of money to pay for an agent that fails at the task you’ve just paid for 50% of the time — especially in cases where failure is much worse than not having tried at all.
Frontier models get hyped for their maximum task horizon, but that's also where they're 10-30x more expensive per hour than their optimal range. You're paying a massive premium for the hardest tasks and still failing half the time.
Honestly the practical takeaway is pretty boring: just break your work into smaller chunks. Not because the models can't handle longer tasks, but because the economics at shorter task lengths are just way better. The labs are racing to push the horizon out; the smart move for anyone actually paying the bills is to stay near the sweet spot and orchestrate from there.
Measuring Claude 4.7's tokenizer costs - https://news.ycombinator.com/item?id=47807006 (309 comments)
So: I buy that the cost of frontier performance is going up exponentially, but that doesn't mean there is a fundamental link. We also know that benchmark performance of much smaller/cheaper models has been increasing (as far as I know METR only looks at frontier models), so that makes me wonder if the exponential cost/time horizon relationship is only for the frontier models.
Difference is that the current prices have a lot of subsidies from OPM
Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.
So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.
What's rising exponentially is the price of the most ambitious thing cutting edge agents can do.
But to answer whether the cost of AI agents is rising in general, you would take a fixed set of problems, and for each of them, ask "once it's solvable, how does the price change?"
For that latter question, there isn't a lot of data in these charts because there aren't enough curves for models of the same family over time, but it does look like there are a number of points where newer models solve the same problems at lower prices. Look at GPT5 vs. the older GPT models--the curve for GPT5 is shifted left.
Happy to run it on your repos for a free report: hi@repogauge.org
This way - AI work is like a slot machine - will this work or not? Either way - casino gets paid and casino always wins.
Nevertheless - if the idea or product is very good (filling high market pain) and not that difficult to build - it can enable non-coders to "gamble" for the outcome with AI for $.
Sadly - from by experiences hiring Devs - hiring people is also a gamble...
If they can do a task that takes 1 unit of computation for 1 dollar they will cost 100 dollars for a 10 unit task and 10,000 for a 100 unit task.
Project costs from Claude Code bear this out in the real world.
the first model to outcompete its competitors while using less compute would be purchased more than anything else
It’s true that at a given point in time the cost to achieve a certain task follows exponential curve against time taken by a human. But.. so what?
- Smaller chunks make review much easier and more effective at finding bugs, as we've known since long before LLMs.
- Greater certainty provides a better development experience. I've heard people talk about how LLM development can be tiring. One way that happens, I think, is the win-or-lose drama of feeding in huge tasks with a substantial chance of failure. I think if you're succeeding 95% of the time instead of 70%, and the 5% are easier to deal with (smaller chunks to debug), it's a better experience.
- Everything is harder about real-world tasks because they aren't clean verifiable-reward benchmarks. Developers have context that models don't, so it's common that a problem traces to an detail not in the spec where the model guessed wrong. For real-world tasks "failures" are also sometimes "that UI is bad" or "that way of coding it is hard to maintain." And it's possible to have problems the dev simply doesn't notice. The benchmarks' fully computer-checkable outcomes are 'easy mode' compared to the real world.
- Fixing agents' mess becomes more work as task sizes increase. (Like the certainty thing, but about cost in hours than the experience.) Again, if the model has spat out 1000 lines and stumped itself debugging a failure, it'll take you some time to figure out: more time than debugging 250-line patch, and the larger patch is more likely to have bugs. And if an issue bug makes it out to peer review, you can add communication and context-switching cost (point out bug, fix, re-review) on top of that.
- Bugs that reach prod are really expensive. More of a problem when a prod bug can lose you customers vs., say, on most hobby things. Ord's post gestures at it: there are "cases where failure is much worse than not having tried at all." That magnifies how important it is the review be good, and how much of a problem bugs that sneak through are, which points towards doing smaller chunks.
How significant each factor is depends on details: how easy the task is to verify, how well-specified it is (and more generally how much it's in the models' wheelhouse, and how much in mine), how bad a bug would be (fun thing? internal tool? user facing? can lose data?).
I think the dynamics above apply across a range of model strengths, but that doesn't mean the changes from say Sonnet 3.7 to Opus 4.5 didn't mean anything; the machine getting better at getting the info it needs and checking itself still helps at shorter task lengths. Harness improvements can help, e.g. they could help keep models of the 'too much context, model got silly' zone (may be less severe than it once was, but I suspect will remain a thing), build better context, and clean up code as well as spitting results out.
Besides taking more of your time up front, involving yourself more also tends to drift towards you making more of the lower-level decisions about how the code will look, which I find double-edged. You have better broad context, and you know what you find maintainable. But the implementer, model or another person, is closer to the code, which helps it make some mid-to-low-level decisions well.
Plan modes and Spec-Kit type things can help with the balance of getting involved but letting the model do its thing. I've liked asking the LLM to ask a lot of questions and surface doubts. A colleague messed with Spec-Kit so it would pick one change on its fine-grained to-do list at a time, which is a neat hack I'd like to try sometime.