Rethinking Intelligence: Why AI Benchmarks Miss the Point
(And What We Should Measure Instead)
A company's latest language model scores 15% higher on standard benchmarks than its predecessor, yet users consistently report it feels "less intelligent" and harder to work with.
Meanwhile, AlphaGo dominates the world's best Go players but can't tell you why moving a piece two spaces left was brilliant. We're living through a measurement crisis in AI, one where our most sophisticated systems excel at tests that fundamentally misunderstand what intelligence actually is.
The problem isn't just academic. It's actively misleading how we build, evaluate, and deploy AI systems. When our metrics don't capture what we actually mean by intelligence, we end up optimizing for the wrong things entirely.
⭐ When "Smarter" AI Feels Dumber
You've probably experienced this disconnect firsthand. GPT-5ish outperforms GPT-4ish on virtually every standard benchmark, yet many users report frustrating interactions where the "smarter" model seems less helpful or intuitive. Chess engines reached superhuman performance decades ago, but they can't explain their strategic thinking in ways that help human players improve. Image recognition systems achieve near-perfect accuracy on test sets but fail catastrophically when encountering slightly different lighting conditions in the real world.
This isn't a temporary growing pain. It's a fundamental flaw in how we conceptualize and measure intelligence.
The core issue is anthropocentrism. We've built our evaluation frameworks around human-designed tasks and human-style cognition. IQ tests made sense when we were comparing humans to humans, but they become meaningless when applied to systems that can be specifically trained on test questions. It's like judging a calculator's intelligence by how well it performs mental arithmetic. You're not measuring intelligence, you're measuring optimization for a narrow task.
Consider the broader implications. Companies pour resources into improving benchmark scores that don't translate to user value. Researchers chase metrics that don't capture genuine capability. The entire field risks optimizing for measurements that miss the point of what we're trying to achieve.
⭐ Six Ways We're Getting Intelligence Wrong
Let's examine why our current approaches fall short, starting with the most obvious culprit.
The IQ Fallacy represents our most direct attempt to port human intelligence measurement to AI systems. But standardized tests become fundamentally meaningless when systems can be specifically trained on them. Unlike humans, who bring general reasoning to novel test questions, AI systems can memorize vast datasets of similar problems. We're not measuring intelligence; we're measuring training data coverage.
The Dark Room Problem exposes a deeper flaw in reward optimization approaches. If intelligence is about maximizing rewards, why don't truly intelligent agents just find high-reward states and stay there? A reward-maximizing system might discover that sitting in a room pressing a "happiness button" yields higher scores than engaging with complex real-world challenges. This isn't intelligence. It's gaming the system.
The Complexity Mirage reveals how problem-solving ability often fails to transfer across domains. Systems that excel at complex mathematical proofs might struggle with simple common-sense reasoning. The ability to handle complicated tasks doesn't necessarily indicate general intelligence if that ability doesn't generalize.
The Adaptation Trap suggests that environmental adaptation alone isn't sufficient. Viruses adapt remarkably well to their environments, but we don't consider them intelligent. Adaptation without understanding or prediction is just sophisticated pattern matching.
The Learning Speed Illusion focuses on how quickly systems acquire new capabilities. But efficiency in absorbing training data doesn't capture the deeper aspects of intelligence we care about. A system might learn incredibly fast while remaining fundamentally brittle and narrow.
The Prediction Gap gets closest to something meaningful but still falls short. Systems can achieve remarkable prediction accuracy while completely failing to act on those predictions beneficially. Weather models predict hurricanes with impressive precision, but the model itself can't evacuate coastal areas or reinforce buildings. Prediction without the ability to benefit from predictions isn't intelligence, it's just sophisticated forecasting.
But what if intelligence isn't about any single capability, but about the combination of two crucial elements: accurate prediction plus the ability to benefit from those predictions?
This Extended Predictive Hypothesis (EPH) offers a more robust foundation. Intelligence becomes the capacity to make accurate predictions about future states and to use those predictions to achieve beneficial outcomes. It's not enough to forecast what will happen; intelligent systems must also be able to act on that foresight effectively.
Think about how this plays out in practice. A truly intelligent system predicting a market crash doesn't just forecast the decline—it adjusts investment strategies, hedges positions, or alerts stakeholders. A system recognizing that a conversation is becoming unproductive doesn't just predict continued frustration—it shifts approach, asks clarifying questions, or suggests taking a break.
The framework also introduces a crucial distinction between spontaneous and reactive prediction. Spontaneous prediction involves self-generated insights and proactive forecasting—the kind of thinking that leads to creative solutions and strategic planning. Reactive prediction responds to specific prompts or stimuli. Both matter, but spontaneous prediction captures something closer to what we intuitively recognize as intelligence.
This approach elegantly explains diverse aspects of intelligence under one coherent model. Creativity becomes the ability to predict novel combinations that will be valuable. Learning is updating predictive models based on new evidence. Planning is predicting the consequences of different action sequences and choosing beneficial paths.
Most importantly, the EPH framework works across biological and artificial systems without anthropocentric bias. It doesn't privilege human-style cognition or human-designed tasks. Instead, it provides a universal standard that applies equally to biological intelligence, current AI systems, and future artificial general intelligence.
⭐ Practical Implications: Rebuilding AI Evaluation
So what would AI evaluation look like if we took the Extended Predictive Hypothesis seriously?
First, we'd fundamentally redesign benchmarks. Instead of measuring performance on isolated tasks, we'd evaluate how well systems predict outcomes in complex, dynamic environments and then act on those predictions to achieve beneficial results. This might involve multi-step scenarios where systems must forecast how situations will evolve and then demonstrate they can leverage those forecasts effectively.
For language models, this could mean evaluating not just response quality, but the system's ability to predict how conversations will develop and steer them toward productive outcomes. For decision-making systems, we'd measure both forecasting accuracy and the real-world value generated by acting on those forecasts.
Development priorities would shift accordingly. Rather than optimizing for narrow task performance, teams would focus on building systems with strong predictive models coupled with effective action selection mechanisms. This might actually lead to more generalizable AI, since prediction and beneficial action transfer better across domains than task-specific skills.
The framework also offers principled approaches to human-AI collaboration. We can identify scenarios where AI predictions should inform human decisions (high prediction accuracy, complex forecasting domains) versus situations where human judgment should override AI recommendations (cases requiring values-based decisions or areas where beneficial action is culturally contextual).
⭐ What This Means for Your AI Strategy
The implications extend beyond academic research into practical AI development and deployment decisions.
**Evaluation Overhaul**: If your organization currently relies heavily on standard benchmarks to assess AI capabilities, it's time to supplement those metrics with prediction-plus-benefit evaluations. This might mean developing domain-specific tests that measure both forecasting accuracy and the practical value generated by acting on those forecasts.
**Research Focus Shift**: Instead of chasing benchmark improvements that don't translate to user value, teams can focus on developing genuine predictive capabilities coupled with effective action selection. This approach is more likely to yield systems that feel meaningfully more intelligent rather than just better at gaming tests.
**Realistic Expectations**: The framework helps calibrate expectations about current AI capabilities. Systems might excel at prediction in narrow domains while struggling with beneficial action, or vice versa. Understanding these trade-offs leads to more effective deployment strategies and clearer communication about system limitations.
Consider how this might change procurement decisions. Rather than selecting AI vendors based primarily on benchmark scores, organizations could evaluate how well systems predict relevant outcomes in their specific domains and demonstrate beneficial action based on those predictions.
⭐Beyond the Measurement Crisis
The Extended Predictive Hypothesis isn't just academic philosophy—it's a practical framework that could revolutionize how we build and evaluate AI. By focusing on prediction plus beneficial action, we can move beyond the current measurement crisis toward systems that are genuinely more intelligent, not just better at gaming our tests.
This shift matters because it aligns our development incentives with our actual goals. Instead of optimizing for metrics that don't capture what we care about, we can build toward AI systems that genuinely understand their environments and act beneficially within them.
The measurement crisis in AI reflects a deeper challenge: we're trying to evaluate something we haven't clearly defined. The Extended Predictive Hypothesis offers a path forward—a universal definition of intelligence that's both theoretically sound and practically implementable.
The question isn't whether AI will become more intelligent, but whether we'll be smart enough to measure it correctly when it does. With better frameworks for understanding intelligence, we can build AI systems that don't just score well on tests, but actually think and act in ways we recognize as genuinely intelligent.
That's a future worth predicting—and acting upon.
**Source:**
This post is based on research from "On the universal definition of intelligence" which proposes the Extended Predictive Hypothesis as a framework for understanding and measuring intelligence across biological and artificial systems.




