Can the AGI race become self-funding?
Will inference pay the bills?
If you've spent any time reading about the AI industry in the last year, you've probably come away thinking it's a house of cards. HSBC estimates that OpenAI faces a $207 billion funding shortfall by 2030. The Economist called AI cash burn one of the defining bubble questions of 2026. Bill Gurley, the Benchmark partner who watched Uber's subsidy cycle from the inside, has been warning that AI burn rates are "more than Uber ever lost, more than Amazon ever lost." The natural assumption is that AI is running the same play Uber did, subsidize now to create dependence, spike later.
When I actually looked into the numbers, the Uber comparison didn't hold up. In January, Epoch AI published a detailed analysis of OpenAI's finances during GPT-5's four-month tenure as the flagship model. The GPT-5 bundle brought in $6 billion in revenue against roughly $4 billion in inference compute costs, leaving a gross margin around 30%. Anthropic has reported roughly 40%. Uber needed to raise prices because the ride itself lost money, but AI inference already makes money.
Open-weight models confirm the pricing trend from the other side, and they've closed the quality gap alarmingly fast, partly because several allegedly distilled their training from frontier models. It's worth noting that distillation copies the training data, not the architecture, so the inference costs of running these models have nothing to do with what it costs to run Claude or GPT-5. The frontier today sits at Gemini 3.1 Pro and GPT-5.4, both scoring around 57 on Artificial Analysis's composite intelligence index, although vibes wise I feel Opus 4.6 wins on every actual use case for me. Meanwhile GLM-5, an open-weight model from Zhipu AI released in February, scores 49.8 at $1.55 per million tokens, which would have matched the best model in the world three months ago. Chinese open-weight models are often accused of benchmaxxing, and the skepticism is fair, but even if you discounted their scores they are better than previous-generation frontier models from less than a year ago.
Pricing is interesting, but it's not the question that matters most. The people running these labs are not optimizing for margins, they believe they are building something approaching general intelligence, and they need tens of billions in compute, talent, and data centers to get there. For a true AGI believer, profitability is not a goal but a constraint. The nightmare is having investors, creditors, or governments tell you to stop burning cash. Profitability matters only insofar as it buys independence from anyone who might pump the brakes. So the question that actually organizes everything else is: can the business generate enough internal cash to protect the people building AGI from outside constraint?
Right now it can't, and the reason is economic obsolescence and too-rapid depreciation. A lab spends billions training a frontier model. For a few months, it commands premium pricing from those doing high-value work which merits paying the premium. Then a newer model arrives and the previous flagship slides down a tier. The pricing for that now second-tier model was set at a healthy 30-40% margin, but the price shown to the consumer usually stays the same or decreases, because the newer model arrives at a lower price point. We saw this with the move from Opus 4.1 to 4.5, where the blended API price actually dropped from $30 to $15 per million tokens. The old model doesn't get aggressively repriced to capture new volume. It just sits there serving slower-moving enterprises where any model update requires new assessments, new benchmarks, new safety tests, or sensitive workflows tuned to a specific pinned version.
Notwithstanding the above, the margin trajectory is moving in the right direction. Anthropic swung from negative 94% to roughly 40% in a single year, and projects 77% by 2028. The cash burn is eyewatering, Fortune reported Anthropic consumed $2.8 billion more cash than it took in during 2025, but investor documents show the burn is projected to drop sharply toward break-even in 2028.
What's more, Epoch found that OpenAI spent roughly $5 billion on R&D in the four months before GPT-5 launched, and GPT-5 only generated about $2 billion in gross profit during its life before Gemini 3 Pro arguably surpassed it. The front-loaded capex is enormous and the pricing power evaporates in months. The premium tier earns healthy margins while it's actually deployed, but it can't recoup training costs before the next model makes it obsolete. If you believe you are building a technology that will 1000x research, extend human life, catapult us into a new era, then the costs are worth it, but if you are the investors lending to build out data centers, multibillion training runs, etc, and you are not convinced of AGI, then it begins to seem a bit insane. There is only so long that the skeptical among the investors will keep funding a race where every generation costs more to train than the last one earned.
The recoupment problem
A frontier model can't recover its training cost before the next one makes it obsolete
So what could change this? What if, instead of letting last-generation models quietly lose their pricing power, someone aggressively drove the cost of serving them toward zero, on hardware purpose-built to do exactly that? William Stanley Jevons observed in 1865 that more efficient coal engines didn't reduce coal consumption but actually increased it, because cheaper energy opened up industrial applications that were previously uneconomical.
Now, if the cost of inference dropped far enough, the volume of demand could explode (more than it has already been doing), and the aggregate revenue from millions of cheap requests could exceed what the model ever earned at premium prices. The commodity tier, not the premium tier, is where the money to fund the next training run might actually come from.
The demand side of Jevons is already visible. OpenRouter, which routes developer traffic across hundreds of models, processed roughly 12 trillion tokens in a single week in late February 2026, about 13 times the volume from a year earlier. Google's CEO told investors on the Q3 2025 earnings call that the company processes over 1.3 quadrillion tokens monthly, 20 times what it handled a year before. That growth came from more users doing more things at prices that kept coming down.
In June 2025, the only model scoring above 40 on Artificial Analysis's intelligence index was o3-pro at $35 per million tokens. By February 2026, MiMo-V2-Flash matched that level for $0.15. A 233x price collapse in eight months. Each intelligence tier follows the same path: it debuts as frontier, commands premium for a few months, then gets matched by cheaper alternatives a few months after that.
The hardware to make intentional cascading possible is already being built. In February, a Toronto startup called Taalas came out of stealth with a chip that has Meta's Llama 3.1 8B physically etched into silicon. It doesn't have to be loaded into memory, it is literally etched into the chip. The weights are wired into the transistors. Taalas claims 17,000 tokens per second per user, roughly 73 times faster than Nvidia's H200, at 20 times lower cost and a tenth of the power consumption. The tradeoffs are steep of course and for the moment, it's a small, heavily quantized model that sacrifices quality for speed. It is also inflexible by design since a new model means having to modify the chip itself. Go try the demo at chatjimmy.ai and you'll feel both what works and what's missing. According to Taalas, they want to have "frontier" model running on it by next winter.
At 17,000 tokens per second, the response doesn't stream in as we have gotten used to, it basically arrives fully formed. That enables application categories that don't currently exist or don't work perfectly, real-time voice agents that respond without perceptible delay, agentic systems where a model makes hundreds of tool calls per second, more powerful AI embedded directly in edge devices where streaming from a data center isn't an option. The Jevons effect works on two axes here. Cheaper unlocks more use cases, and faster unlocks entirely new ones. If you could have Opus 4.6 running 1000x faster and for 10x cheaper, would you really care about a marginal improvement in the next model? I sure wouldn't, and would be fine holding out until an actual stepwise improvement.
The quality ceiling on document summarization and translation is not infinity. Coding and scientific research will keep getting meaningfully better for a long time, and that's where frontier R&D still makes sense. For the bulk of ordinary workloads, though, we may be closer to "good enough" than people realize, and the real gains from here might come from making the intelligence we already have faster and cheaper rather than marginally smarter.
If this kind of hardware proliferates, the economics could flip. The premium tier earns healthy margins from enterprises doing hard tasks, but as I've already shown, it can't recoup training costs before the next model makes it obsolete. The commodity tier is where Jevons could change the equation — previous-generation models served on optimized hardware to an ever-expanding user base, generating massive aggregate revenue at thin per-unit margins, could be what actually funds the next training run. There's even a floor below that though, in February, OpenAI began running ads on ChatGPT's free and low-cost plans, and ChatGPT has over 800 million weekly active users with the vast majority not paying. The cascade would go from premium to cheap to ad-supported to free, with each level generating revenue through a different mechanism.
I should be precise about what Jevons needs to do here, because volume growth alone is not enough. Massive token growth with brutal price compression can still leave mediocre gross profit. The question is not whether volume explodes. It's whether revenue explodes enough, despite falling price per unit, to outpace the cost of the next training run. And that cost is not growing linearly. If GPT-5 cost roughly $5 billion in R&D and the next generation costs $10 billion or more, global token demand has to scale fast enough to close an exponentially widening gap. Even if it does, there's a deeper question: cheap abundant inference can create a huge market, but whether that market finances AGI depends on who owns the surplus. If the margin flows to chip fabricators, cloud providers, or open-weight hosting companies rather than the labs doing the frontier research, Jevons creates a thriving commodity economy that funds everyone except the people trying to reach AGI. If margins keep improving toward the 77% Anthropic projects by 2028, and if the labs rather than third parties capture enough of the commodity revenue, the math could work. That is a real possibility, not a certainty.
Several things could prevent it. Jevons might just not kick in hard enough in revenue terms: if cutting prices by 80% only doubles gross profit rather than increasing it tenfold, the labs stay dependent on investor patience. The open-weight ecosystem could close the frontier gap too fast: GLM-5 already scores 49.8 on Artificial Analysis's index while the frontier sits at 57, only 13% behind at a sixth of the price, and if that gap compresses further, the premium window could shrink from six months to two months, leaving even less time to recoup training costs. Competition could thin to just two or three surviving labs with locked-in customers, killing the incentive to cascade prices down at all and potentially raising them, yet I find this unlikely given how well distillation for open weights models seems to work. There's also an elasticity ceiling to consider, where even at near-zero token prices, human attention and enterprise workflows are finite (as some already say, humans are the bottleneck), and the only way past that constraint is agentic AI where models talk to models, which is a rather nascent field.
I don't think most of these are where things stand today. Coding and research tasks do still show genuine improvement with each model generation, and those are the tasks where enterprise customers currently pay the most for. A law firm might pay premium rates for a frontier model today because the work it displaces costs $500 per hour in associate time, but that calculus could shift if a frozen open-weight alternative at a tenth of the price turns out to handle 90% of the same work. The premium only holds as long as the capability gap justifies it, and that gap is compressing fast. Still, the Jevons data so far is encouraging, and Anthropic's revenue more than doubled in three months while prices fell.
If the cascade works, the labs get what they actually want, which is enough internally generated cash to keep racing without asking anyone's permission. If it doesn't, the intelligence still gets cheaper and the consumers still win. The only thing that stops is the race at the very top, and the AGI dreamers lose the independence they need to keep running it on their own terms.