The Road to Declining Marginal Intelligence
The performance of AI models follows a roughly log-scale relationship with training cost, meaning there is a constantly declining return on additional investment
The quick gist of this essay is:
AI model performance increases like the logarithm of FLOPs used to produce it, and so, there is constantly declining marginal returns
The limit of performance saturates at the complexity of the environment its trained in.
We will inevitably seek out new model architectures and new computing substrates and these will undermine our predictions or assumptions about data center build-out and energy demand into the future
That’s all okay, because LLM-based AI will still be super useful and valuable for everyone.
The long march of technological progress and infrastructure investments has, for the last couple hundred years, followed an incredibly consistent and predictable pattern. This cycle has applied equally well to things like steam engines, trains, radio communications, computers, and software. It goes like this:
Main Stages of the Infrastructure Investment Cycle
Installation Period: This initial half of the cycle involves introducing and installing the new technological paradigm, attracting heavy investment but often leading to instability.
Irruption Phase: Early breakthrough innovations emerge, drawing initial financial capital to fund new industries and infrastructure. This sets the stage for a "new economy" but is not yet a full boom.
Frenzy Phase: A speculative boom ensues as financial capital floods in, chasing quick gains. This leads to asset inflation, real estate bubbles, reckless debt, and overinvestment, decoupling finance from productive realities.
Turning Point: The bust phase, often a financial crash or recession (lasting 2-5 years), where the bubble bursts. It exposes excesses from the Frenzy, causing economic downturns. Government intervention is crucial here to regulate finance, recouple it with production, and prevent prolonged stagnation. Without effective policies, this can lead to inequality or a "gilded age" instead of broad prosperity.
Deployment Period: The latter half (20-30 years) spreads the paradigm's benefits economy-wide, enabling sustainable growth if the turning point is navigated well.
Synergy Phase: A productive boom follows, with innovation across sectors, positive-sum games between finance and production, and widespread prosperity (e.g., "golden ages" like the post-WWII boom).
Maturity Phase: Growth stabilizes as the paradigm becomes the "new normal," but saturation sets in, paving the way for the next revolution's irruption.
I wrote about this pattern previously in the context of the SaaS buildout from the early 90’s bubble through the scaling of infrastructure like AWS and Google Cloud, how successive newcomers eat away at the margins driving once high-performing markets to commodity-like returns, arguing that hardware-first industries enabled by intelligent compute are inevitably the next generation of high performance investments (I use railroad buildout as an example).
Where do you think we are in the context of AGI? I’ll give you a hint with a chart showing projected investments by major AI companies into new data center construction.
The coming build-out in data centers is absolutely colossal, and increasing looking at YoY buildout rates.
How does this compare with the past bubbles in infrastructure build-out? We are slightly above the dot-com bubble but nowhere near the railroad building mania of a century before
To read more on the financial-bubble component of the data center buildout I highly recommend Noah Smith’s article here (from which I also nabbed the previous three plots):
But where do we land in terms of “AGI” - the long-promised ever improving ultimate general machine, the thing that is supposed to threaten humanity with apocalypse or perhaps usher in utopia?
The constantly-increasing plowing of ever-more resources into data center buildout and training model cost is, unfortunately, already running into declining marginal returns. A couple days ago OpenAI dropped GPT-5. So far in the AI race the consistent trend has been: every new major model beats out the previous major models across a wide range of tests. Not the case for GPT-5 - it falls short of Claude Opus in software development, and short of Grok 4 Heavy on Humanity’s last exam.
The general trend is that the performance of a model scales like the logarithm of its training data size, or by proxy the amount of computing FLOPs required to train it. This means if you double the amount of data going into a model you get less than a doubling of that models performance across a wide variety of benchmarks.
Here’s a relatively low-effort plot I had some LLM’s make:
This hasn’t been such a big issue so far in the AI race, because AI models kept getting better and better by fairly large or substantial amounts. For a couple years now it’s largely felt like each new ‘major model’ release out-competed all the others by some margin. It has been neck-and-neck for quite some time, which is interesting to note, since these companies are in flat-out races and hoovering up colossal amounts of capital.
With GPT-5, that era is over. The company which started the transformer / LLM arms race launches their new latest model, and it doesn’t beat the competitors models which were released some months before. We are now in the ‘choppy waters’ where different models are better at different things, and there’s no clear winner. In other words, we are already hitting declining marginal returns on model performance, but the models themselves are still extremely useful and can refactor a lot of the economy assuming only marginal improvements in performance from here.
There are a few next-steps that seem obvious in the exploration of the AI tech tree, and all will be pursued but not all by the same people.
Feature Expansion and Penetration
This seems to be the approach being pursued by OpenAI currently, and likely they’ve realized for some time the scaling laws aren’t holding up indefinitely and have gradually pivoted from an AI research company into something like a more conventional software company that is looking to iterate on features. They struck a crazy product-market-fit with ChatGPT and it seems like thats still the primary product, a chat interface with an LLM. The two broad categories are
Vertically integrate. LLM’s percolate into all our common productivity software like email, google docs, Microsoft Office. It’s surprising Microsoft hasn’t done this already considering their massive market share of Office tools and large stake in OpenAI. There might be technical barriers to implementing this, like sensitive company documents being sent to an outside server, that require things like federated learning, or the ability to keep things private but still learn from them across a large number of enterprise customers.
Horizontally proliferate: There are a lot of fundamentally new products LLMs enable as the ultimate general store-and-retrieve database, and there’s already a lot being produced. The simplest is something like an AI assistant, but this isn’t surprising to anyone.
Both of these are basically the “synergy” and “maturity” phases of the infrastructure development lifecycle, as a new technology proliferates into every area of the economy it can wiggle its way into. That’s a good thing, but its not AGI.
Strip-Mining Architecture Space
The scaling laws and performance with LLM’s is enough to incentivize an essentially unlimited stream of future investment capital into every conceivable architecture for novel AI’s, things like symbolic reasoning, in the hopes of something that can beat the limits of performance for transformer based architecture. Are there any solid performance limits in general for LLM’s or is it just log-scale return forever? In a previous post I make the argument that it impossible to become ‘smarter’ than your environment, and for LLM’s the environment is human knowledge. You can have better recall than any other person, or greater breadth of knowledge, or write faster, and all of these at once, but the high water mark in any endeavor will be set by humans.
So, for an AI to become smarter than humans it has to directly observe its environment and translate experiences into new narratives to populate its world model. This isn’t a radical premise, its the basic principle of empiricism: knowledge comes from direct observations of the world, and reading what other people have written can either teach you things they observed, or things latent in their observations (like what Kepler did with Tyco Brahe’s observational data to construct his planetary laws of motion), but not things they haven’t observed.
New Computing Substrates and Methods
Current projections for data center build-out and energy demand seem quite gargantuan. We will need to build many giga-watt class data centers all over the world to satisfy energy requirements of deploying AI at-scale. Is this inevitable and correct? I don’t think so, and the reason is simple: silicon-based NPN transistor junctions are what we have used so far for computing, but its not the ideal or best possible format for AI. There are a few angles here.
The Landaur Limit: There is a thermodynamic lower bound on the energy it takes to erase a single bit of information. The rule of thumb is you get about a mole of bits per joule of energy operating at the thermodynamic limit. If your computational processes are irreversible such that they only store the output and not the inputs or intermediate steps, its irreversible, and so there is an energy floor to the computation. Not all computation is necessarily irreversible, however. You can approach this in two ways: through an algorithmic method that preserves intermediate results such that the computation can be run backwards. Or, by using a physical computing substrates that don’t lose energy with each computation, like an adiabatic CMOS circuit that borrows energy and returns it for a compuation, or optical gates that operate without losses. In both cases information is preserved and so the Landaur limit can be avoided.
Current computing methods are far above the Landaur limit per bitwise operation, by many orders of magnitude. So there is a lot of room for improvement.
Probabilistic Computing: What it means to ‘learn’ something is best described by Karl Friston’s active inference, which is basically just minimizing the amount of surprise of future events by modeling the world more accurately. There is a natural connection between ‘surprise’ or information and energy given by Shannon’s definition of entropy, which is pretty much the same as Gibbs entropy used in biology. Living systems in biology like to minimize Gibbs free energy. In the context of active inference, this is also called free energy minimization. I’ll write up more about this in the future on the statistical mechanics of machine learning, for now lets keep it brief.
First off, the two formulas follow the same pattern. We subtract an entropy term from an energy term and minimize the resulting quantity.
Here’s Gibbs free energy:
Here’s Karl Fristons free energy:
This is pretty much the ‘bare metal’ of learning new things and using Bayesian inference to predict future events, which is a pretty good definition of what it means to be intelligent. Are there AI training methods that follow this? Yes, in fact, Energy Based Models (EBMs) are mathematically equivalent to free energy minimization and so in some sense ‘optimal’ for learning phenomena as accurately as possible, however they are computationally extremely expensive.
EBMs are expensive primarily because of the random sampling process used to construct their ‘guts’ and training update loop, which involves approximating a partition function. Modeling random distributions with silicon gate logic is more expensive than it needs to be when you consider that physical systems are inherently governed by probabilistic processes, for example the position or momentum of a particle trapped in a potential well. So, you can ‘generate’ random numbers from a probability distribution by simply making measurements of a physical system that obeys the desired probability distribution. The net-net is you can implement EBMs while reducing the fundamental energy cost of the computations in generative AI applications by many orders of magnitude, perhaps up to a factor of a million. This is the premise of Extropic which presented at my first Deep Tech Week in SF, in 2024.
Where does this all leave us?
LLMs are here to stay, there’s no doubt about it. Will they lead to an intelligence explosion? No. Will they get us to what ‘feels’ like AGI? Probably not, no. Are they the doom of mankind? Also no.
Instead they will percolate into a great many areas of society and economic functions, doing for professional or intellectual services what mass production did for consumer goods. A great deflation in the cost of health care, education, counseling, therapy, etc - these tools are excellent at recalling information from a vast corpus of knowledge. And, in generating images, video content, speech, AI avatars. There is a massive proliferation of tools that greatly enhance the creative leverage of the individual, leading to a kind of content explosion.
There is also the productive leverage given to the intellectual class. Lawyers, scientists, researchers, engineers, all benefit from better research assistants, coding assistants, design tools, and so on.
This all seems obvious, so, lets make three non-obvious predictions:
The projected need for energy production and data center build out will not hold out into the medium-to-far future as much as people think it will, as model efficiency, specialized hardware, new computing substrates undercut the current linear extrapolations
The race for novel model architectures is now on, things that can outperform LLMs either in efficiency in learning therefore cheaper to train, or else cheaper to inference, or else able to learn directly from the environment, like humans do, instead of learning from text
True ‘AGI’ will require an artificial intelligence that learns via observations of the world directly, such that it is not bounded by the complexity or knowledge implicit in a corpus of written human knowledge. This is ‘empirical’ AI.












Agreed. After reading into Grok 4 and GPT-5, they have milked pretty much everything out of the current transformer architecture possible, and scaled up compute to the max. Something new must emerge
This connects beautifully to the ecological view that intelligence isn't something we "have" but something we *do* in dynamic coupling with our environment. Your point about LLMs being bounded by human knowledge corpus highlights how current AI is essentially doing sophisticated pattern matching on linguistic traces rather than engaging in the embodied sense-making that characterizes biological intelligence.
The constraint isn't in our heads (or silicon) but in the environment's capacity to support intelligent behavior. Even human intelligence emerges through ongoing organism-environment transactions—we don't possess abstract reasoning, we become capable of it through participation in culturally mediated activities.
This suggests AGI isn't about building systems with sufficient internal intelligence, but creating systems capable of the right kinds of environmental coupling. Current approaches hit walls because they can't participate in the dynamic dance of environmental attunement that makes intelligence possible in the first place.
Your "empirical AI" that learns from direct observation points toward this—intelligence as an active process of structural coupling with the world, not computational power applied to static datasets.