Sutton & Rafiee: AI Needs a Body and Experience, Not Just Data

Banafsheh Rafiee from the University of Alberta and Richard S. Sutton — Turing Award laureate and founding figure of reinforcement learning — published “Toward Enactive Artificial Intelligence” on arXiv on May 22, 2026. They argue that the AI mainstream — from symbolic systems through supervised learning to large language models — remains trapped in representationalism: intelligence as passive processing of internal world-maps. As an alternative, they propose enactive cognition, a framework from cognitive science where intelligence emerges from active, embodied engagement between an agent and its environment.

Key takeaways

Paper arXiv:2605.24238v1 published May 22, 2026 by Rafiee and Sutton (University of Alberta, AMII).
Four pillars of enactive AI: experience, action-perception inseparability, autonomy, embodiment.
Reinforcement learning (RL) is structurally the closest of all AI approaches to the enactive model, but does not achieve equivalence — reward functions remain external to the agent.
Large language models (LLMs) and supervised models remain “disembodied” — they learn from human-generated data, not from their own actions in the world.
The paper provides no new algorithms — it is a theoretical roadmap identifying gaps and directions for future research.

Representationalism and its limits

The dominant approach to perception in AI is representationalism: a system receives sensory data, encodes it as internal representations of the world, and then generates actions based on those representations. This model performs well in constrained environments, but according to Rafiee and Sutton it is fundamentally flawed for dealing with an open, dynamic world.

The core problem is that no finite internal map can faithfully capture reality. As roboticist Rodney Brooks put it: “the world is its own best model” — the most accurate and up-to-date information always exists outside the agent, not in its internal surrogates. This argument, long present in behavior-based robotics, acquires new urgency as language models scale.

Enactivism, a subfield of cognitive science formulated by Varela, Thompson and Rosch in 1991, proposes a reversal: cognition does not precede or represent the world — it emerges from the agent's active engagement with its environment. Perception is an activity, not a reception of signals.

Four pillars of enactive AI

Experience: the agent as participant

In the enactive framework, experience is not equivalent to data. Data are traces of interaction — records of something someone else lived through. Genuine experience is a continuous, mutual exchange with the environment in which the agent itself shapes what it perceives through its own actions. Supervised learning supplies models exclusively with others' experiences. RL goes further — the agent collects its own data through interaction with the environment. Yet even RL does not reach the full enactive sense of experience: the skillful, normative, and genuinely embodied dimensions remain largely absent.

Action-perception inseparability

Enactivism rejects the perception → processing → action sequence as a false simplification. Perception itself is a form of action. A person does not passively receive a visual scene — they move their eyes, head, and body, actively modulating what reaches the visual system. This capacity is described as mastery of sensorimotor contingencies: the agent knows how its movements change incoming sensory data.

The implications for AI are significant. Video generation models can predict subsequent frames with high accuracy, but — as the paper's analysis shows — they can only continue statistical regularities. When the situation demands intervention (a fault, an unknown object), the system has nothing to fall back on. An enactive agent does not merely anticipate the next state — it can actively change it.

Autonomy and normativity

Autonomy in the enactive sense derives from autopoiesis: the agent is a self-maintaining system that actively sustains its own organization. This gives rise to normativity — evaluations of success and failure grounded in the agent itself, not imposed from outside. In supervised learning, success is defined by a human-provided label. In RL, the success criterion is the reward function — still external to the agent. The authors note that RL approaches normativity through temporally extended evaluation of behavior, but full enactive autonomy — where criteria emerge from the agent's own organization — remains unrealized.

Embodiment

The body is not an execution platform — it is a necessary condition for perception to make sense. Joint geometry, sensor placement, range of motion — all of these shape the sensorimotor contingencies available to the agent. Gibson's concept of affordances captures this: a chair is “sittable” not as an objective property of the furniture, but as a relation to the capacities of a specific body. Yet many Embodied AI systems in robotics still treat the body as an external engineering constraint rather than a constitutive principle shaping how the agent experiences and categorizes the world.

RL as natural ally — and its three gaps

Sutton and Rafiee credit reinforcement learning as the approach most structurally aligned with the enactive model: the agent actively explores the environment, collects its own data, and evaluation involves temporally extended analysis of the consequences of actions, not just the current state. This is a clear difference from supervised learning, where a model never modifies the data it learns from.

But structural resonance is not theoretical equivalence. The authors identify three unresolved gaps:

First, the reward function comes from outside. An RL agent does not have self-referential success criteria — its normativity is imposed by the designer. Intrinsic motivation and goal discovery methods move toward a solution, but full enactive autonomy remains an aspiration.
Second, perception and action are still often treated as separate stages. Even in deep RL, the standard pipeline assumes reading an observation, passing it through a network, and selecting an action — a sequence that retains the trace of representationalism. Approaches such as active inference and predictive coding frameworks better model the coupling loop.
Third, embodiment in robotic RL is most often treated as an external boundary condition — a simulator that knowledge must escape into the real world (the sim-to-real gap) — not a constitutive principle for the learning process itself.

Why it matters

The Rafiee and Sutton paper is a rare example of work that asks a structural question — not “how do we improve a benchmark?” but “are we building the right kind of system at all?” The argument is provocative: language model scale is growing exponentially, yet these models do not interact actively with an environment, cannot evaluate their own actions, and have no body. They become increasingly proficient at predicting tokens, but that proficiency is accompanied by no mechanism for verification through real-world consequences of action.

For the RL community, the paper is a call to expand the theoretical foundations. Reward shaping and RLHF are operational techniques, but they do not answer where the agent's normativity is supposed to come from in the first place. The cognitive science tradition of enactivism supplies a ready conceptual vocabulary — and a proposal for how to operationalize it in AI systems.

Practical implications are visible in mobile robotics, vehicle autonomy, and long-horizon learning in open environments — wherever static datasets and external reward become the bottleneck for adaptive capability. The paper is a theoretical manifesto, but it points to concrete directions: benchmarks measuring skillful engagement rather than pattern classification, RL architectures with internally generated reward, and physical models incorporating agent body morphology.

What's next?

The authors identify open questions for future work: how to measure the degree of action-perception inseparability in a concrete system, and how to define self-maintenance for a software agent (no battery, no hardware — what are the equivalents?). These are tasks for multiple research groups.
NeurIPS 2026 and ICML 2026 will be the first major venues where the RL and robotics communities can respond to this theoretical proposal.
Growing interest in long-horizon continual learning and the Big World Hypothesis in RL may accelerate adoption of enactive concepts — both are cited in the paper as most compatible with the enactive model.