The Physical World is the Hidden Compute
Published:
Fairly speculative and written mostly as a way of organizing intuitions which have been bouncing around my head for a while, although I think the core point is probably right in the boring sense that every scientist already knows it implicitly.
I have recently been thinking a lot about the future of AI and scientific discovery through the lens of substrate, by which I mean the literal physical medium in which cognition, experiment, measurement, simulation, and discovery take place. This sounds somewhat grandiose, and perhaps it is, but I think it points at a real confusion in how we talk about AI scientists, because a lot of the discussion implicitly treats scientific discovery as if it were primarily an exercise in manipulating propositions inside a model’s context window, while the actual history of science looks much more like a sequence of increasingly clever ways to couple human cognition to the physical world.
The obvious way to say this is that science is empirical, although that formulation feels too flat to carry the point, because the deeper issue is that the world itself is doing an enormous amount of computation for us all the time. A cell folds proteins, a forest integrates climate and soil over decades, a reaction vessel explores a tiny region of chemical dynamics, a telescope converts ancient photons into a measurement, and a wet-lab assay runs a little physically instantiated program whose output we then compress into a number, an image, a curve, or a failed experiment which still contains information if we are careful enough to read it.
In this sense, a human scientist is not simply a brain which reasons about the world from the outside, since the human scientist is a sensorimotor system embedded inside the world and equipped with instruments, social institutions, inherited notation, tacit craft knowledge, graduate students, lab technicians, field sites, grant committees, and a long chain of other humans who have already paid much of the cost of coupling thought to reality. The strange thing about modern AI is that it has become extremely strong at manipulating the compressed residue of all this coupling, especially text, code, equations, figures, and data tables, while remaining comparatively weak at building the loops which generated that residue in the first place.
This is why the phrase “AI scientist” can be both exciting and slightly misleading, because there is a version of the AI scientist which writes papers, searches literature, proposes hypotheses, runs existing code, produces plots, and iterates inside a computational sandbox, and there is another version which actually earns new information from the world. The first version is already becoming real in domains where evaluation is cheap, formal, or at least automatable, while the second version requires a much deeper integration of memory, instruments, robotics, simulators, active learning, institutional trust, and a mechanism for deciding which contact with the world is worth paying for.
This matters because there is an enormous difference between a system that can compute a Turing-computable function in principle and a system that can discover something important in the finite physical world with finite time, finite money, finite energy, finite sensors, finite actuators, and finite access to clean experimental feedback. The Turing framing is seductive because it gives us a very general language for what kinds of functions can be computed, and it is natural to ask whether future AI systems with persistent memory and good orchestration will eventually move from something closer to finite-state behavior into the regime of general computation. However, scientific discovery usually lives inside a harsher regime where the question is not merely whether the desired mapping is computable in principle, but whether the system can acquire enough bits from the relevant part of the world before the budget runs out.
This is where memory architectures like Titans are interesting, even if one should be careful about over-updating from any single paper or architecture, because they point toward the fact that today’s context-window-based systems have a strangely shallow relationship to time. A person doing science accumulates a research taste over years, remembers the vague smell of failed approaches, carries around half-formed intuitions from old conversations, and gradually develops a compressed internal model of what the field is likely to reward or punish. A language model in a chat session can appear to have this kind of continuity for a few minutes, although without durable memory, self-maintained research state, and an ability to update from consequences, it is closer to a brilliant visitor who keeps waking up in a new hotel room with a dossier slid under the door.
Of course, this can change, and probably will change quickly, because long-context attention, retrieval, neural memory, agent workspaces, tool logs, code repositories, and external knowledge stores are all attempts to give models a better temporal substrate. The question I keep returning to is whether this kind of memory merely makes AI systems better research assistants, or whether it eventually crosses a threshold where the system can maintain a scientific life of its own, in the sense of having long-running projects, accumulated taste, self-correction from experimental feedback, and a stable enough identity over time to notice that a line of work has quietly become promising.
There is a related theoretical temptation to ask whether Transformers are Turing complete, and the literature around this is surprisingly clarifying because the strongest theoretical results often depend on idealized assumptions about precision, scaling families, or context management. This matters in practice, because a real deployed AI system is always a fixed physical object embedded in a particular inference loop, with a context manager, memory policy, tool interface, sampling procedure, latency constraint, and budget. The abstract computational power of the architecture is important, although the effective computational power of the whole system is often hiding in the boring engineering details around the model rather than inside the next-token predictor considered in isolation.
The same point applies even more strongly once we leave text and code and ask about the physical world. If an AI system needs to understand a turbulent atmosphere, an evolving tumor, a microbial consortium, a forest ecosystem, or a catalytic reaction, it can either simulate some approximation of the system, learn a surrogate from existing data, or interact with the real process through instruments and experiments. Each route has a different price, and the price is paid in resolution, time, energy, sample complexity, compute, money, and the opportunity cost of exploring one region of the world while leaving another region untouched.
This is what I call implicit computation, because a physical system evolves under its own dynamics whether or not we understand it, and an experiment is often a way of asking the universe to perform a computation on our behalf. When a human chemist runs a reaction, they are not simulating all electronic structure and solvent effects in their head, since they are arranging a physical situation whose evolution contains the answer in a form they can partially observe. When a biologist grows cells under a perturbation, the cells perform the computation, and the scientist builds a measurement channel around that computation so that a tiny digestible piece of it enters the human knowledge system.
An AI system which lacks high-bandwidth access to this implicit computation has to pay for substitutes, and those substitutes can be brutally expensive. To simulate a physical system at useful fidelity, the model must choose a discretization, track a state over space and time, obey stability constraints, update an enormous number of interacting degrees of freedom, and then somehow compress the result into a usable hypothesis. Even when the simulation is cheaper than reality, it inherits the assumptions of the simulator, and those assumptions can quietly become the boundary of the discovery process.
This is why I think the next era of AI for science will split into two rather different tracks. The first track will be computationally native science, where the world of interest is already formalized enough that ideas can be generated, tested, and selected inside an automated loop. Machine learning research, algorithm discovery, theorem proving, coding, some parts of mathematics, some parts of chip design, and some parts of computational infrastructure sit closer to this regime, which is why systems like The AI Scientist and AlphaEvolve feel like early signals of a real phase transition.
The second track will be physically grounded science, where the bottleneck is not only having a clever hypothesis but also earning the right data from a stubborn world. Biology, materials, climate, agriculture, medicine, robotics, and much of chemistry live closer to this regime, because the real object of study contains causal structure which has only been thinly sampled by human datasets. In these domains, the future AI scientist probably looks more like an orchestration layer wrapped around simulators, robots, instruments, databases, active learning loops, and human experts who understand all the tacit ways an experiment can lie.
This is also why self-driving laboratories seem much more important than they are usually given credit for in mainstream AI discourse. A self-driving lab is not just automation applied to science, because it is a proposed sensory and motor system for an AI scientist. It gives the model a way to choose actions, observe consequences, update beliefs, and continue the cycle across many rounds, which means it starts to close the loop between explicit cognition and implicit physical computation. The primitive versions will be narrow, expensive, brittle, and full of hidden human labor, although the direction of travel seems extremely clear once one accepts that discovery requires contact with reality.
The same idea also changes how one should interpret recent AI co-scientist systems. A multi-agent system which generates, criticizes, ranks, and refines hypotheses is genuinely useful, especially in literature-heavy domains where humans are drowning in papers and cannot keep all relevant mechanisms in working memory. Yet hypothesis generation by itself is the airy part of science, because the harder part is often deciding which hypothesis deserves a scarce experiment, what measurement would actually distinguish it from its rivals, and how to update when the result comes back ambiguous, ugly, or suspiciously clean.
One reason computational domains have moved so quickly is that they have excellent verifiers. Code either runs or fails in relatively legible ways, a theorem proof can be checked, a benchmark can be computed, and an algorithmic improvement can be measured against a test suite. This does not make these domains easy, because the search spaces are still vast and deceptive, although it means that the loop from proposal to feedback can be made fast enough for evolutionary search, reinforcement learning, and agentic tree search to bite. The moment the verifier becomes slow, expensive, noisy, socially mediated, or conceptually underspecified, the whole dream of cheap autonomous discovery becomes much more subtle.
This suggests that the real unit of progress in AI science may be the feedback loop rather than the model. A frontier model with poor experimental access is a powerful imagination machine trapped behind glass, while a somewhat weaker model connected to a fast, reliable, high-information feedback loop can become scientifically dangerous in the good sense. The history of science is full of this pattern, where new instruments and new measurement channels suddenly make old questions tractable, because the bottleneck was never pure intelligence in the abstract, but the ability to create a situation where reality answers in a compressed and repeatable form.
There is a version of the future where AI systems become extremely good at science by building increasingly elaborate substitutes for human embedding. They will maintain persistent research memories, construct world models across modalities, design experiments using active learning, run simulations to prune the search space, send only the most informative candidates to robotic labs, interpret the results with mechanistic priors, and then iterate until the system has produced knowledge which no human could have generated unaided. In that world, the AI scientist is not a brain in a box, but a distributed cybernetic organism whose body consists of lab automation, sensor networks, cloud compute, databases, robotic platforms, and scientific institutions reshaped around machine-speed iteration.
There is another version of the future where we overestimate text-native reasoning and flood the scientific world with plausible papers, shallow hypotheses, synthetic benchmarks, and agent-written manuscripts whose main achievement is to increase the entropy of the literature. This is already a real danger, because the marginal cost of producing something that looks like research is collapsing faster than the marginal cost of producing something that changes what we know. The scientific community already has a hard enough time distinguishing signal from noise, and AI-generated research slop could make the epistemic commons much worse unless evaluation, replication, provenance, and experimental grounding improve alongside generation.
I am increasingly convinced that the optimistic and pessimistic stories are the same story viewed at different points in the loop. If AI is coupled to strong verifiers, good measurements, honest uncertainty, and expensive contact with the real world when that contact is needed, then it can accelerate discovery in a way that feels almost unfair by historical standards. If AI is coupled mainly to publication incentives, weak peer review, benchmark chasing, and the aesthetic of scientific language, then it will accelerate the appearance of discovery while making actual discovery harder to locate.
The substrate question also reframes robotics and embodiment. It is tempting to look at Boston Dynamics robots, Unitree humanoids, autonomous vehicles, and warehouse robots and say that machines are already learning to navigate the physical world, which is true in the same rough sense that LLMs already navigate the textual world. Yet physical intelligence is constrained by channel capacity, latency, actuator limits, sensor noise, adversarial edge cases, and the simple fact that the world does not pause while the system thinks. A robot cannot see everything, cannot infer every hidden state, cannot sample every possible action, and cannot escape the cost of making mistakes in a world where some mistakes break hardware or harm people.
This does not imply a mystical human advantage, since humans are also bandwidth-limited, metabolically constrained, and mostly terrible at high-dimensional inference. The human advantage, where it exists, comes from being trained by continuous contact with physics from birth, having bodies whose priors were shaped by evolution, and living inside cultures that have already amortized enormous amounts of world-learning across generations. AI systems may eventually surpass this by scaling experience through simulation, imitation, teleoperation, real-world fleets, and robotic data factories, although the route to doing so runs through the substrate rather than around it.
Scientific discovery has the same structure. A human scientist does not know the world by containing a perfect model of it, because they know the world by living inside a civilization that has built countless partial interfaces to it. The AI scientist will need its own interfaces, and the quality of those interfaces will determine what kinds of knowledge it can create. The question is therefore not simply whether the model is smart enough, but whether the entire coupled system has enough memory, bandwidth, actuation, measurement, verification, and institutional scaffolding to turn intelligence into contact with reality.
This is where the latent manifold picture of human knowledge becomes useful again. Existing knowledge is not uniformly distributed across idea space, because it is clustered around the places where humans had instruments, incentives, mathematical formalisms, funding, social attention, and tractable experimental loops. The empty regions between these clusters are not merely unexplored thoughts, since many of them correspond to regions where the world has been too expensive, too slow, too dangerous, too small, too large, or too complex for us to interrogate. An AI trained on the textual residue of science will naturally become powerful inside the dense regions of this manifold, while its ability to jump into sparse regions depends on whether it can buy new information from the world.
This also explains why AI for science will probably feel uneven and strange. We may see rapid progress in algorithm discovery, protein engineering, materials screening, and some parts of biomedical hypothesis generation, while other domains remain stubborn because the necessary data cannot be cheaply generated or the relevant causal variables are poorly observed. We may discover that some apparently difficult scientific problems were bottlenecked by search, representation, and literature integration, while others were bottlenecked by the sheer cost of asking nature the right question at the right resolution.
The practical conclusion is that anyone trying to build AI scientists should care more about building the full discovery stack. The stack needs persistent memory, because science is cumulative across time. It needs uncertainty-aware world models, because overconfident models will spend experiments badly. It needs active learning, because experiments are expensive. It needs simulators, because reality is slow. It needs reality, because simulators lie. It needs human experts, because tacit knowledge remains a massive compression of previous failures. It needs institutions, because discovery becomes knowledge only when other agents can inspect, contest, and reuse it.
My current guess is that the most important AI science systems will be hybrids for a long time. They will contain language models, search, retrieval, code agents, Bayesian optimization, symbolic tools, simulation engines, robotic labs, human review, and domain-specific data infrastructure. Over time the AI parts will eat more of the loop, first by automating literature synthesis and experimental planning, then by running computational experiments, then by controlling robotic platforms, then by proposing whole research programs whose logic humans can track only at a higher level of abstraction.
At some point, the phrase “AI scientist” will start meaning a persistent discovery process distributed across many substrates. Some of the cognition will happen in neural weights, some in external memory, some in code, some in simulations, some in robotic action, some in the evolving state of a laboratory, and some in the human institutions that decide what counts as evidence. Once seen this way, the question of whether AI can do science becomes less mysterious and more operational, because it asks whether we can build systems that close the loop between hypothesis and physical consequence faster, cheaper, and more honestly than the current human system.
The really interesting possibility is that AI may eventually discover knowledge by exploiting substrate combinations that humans rarely use. It may run millions of cheap simulations to shape a prior, execute thousands of robotic experiments to correct it, mine decades of literature to avoid old traps, evolve symbolic hypotheses to preserve extrapolation, and use human experts mainly as sparse high-value evaluators. This would not be a replacement of human science in the simple sense, because it would be a new kind of scientific instrument, perhaps closer to the telescope or microscope than to the lone genius, except that the instrument would also choose where to point itself.
I do not think the central question is whether machines can “understand” science in the way humans do, because that question tends to dissolve into arguments about words. The better question is whether machine systems can create reliable new constraints on our beliefs about the world. If they can propose a molecule that works, identify a mechanism that survives experiment, design an algorithm that improves a real system, or discover a lawlike compression that predicts outside the training distribution, then some important kind of scientific understanding has entered the world through them.
The caveat is that all of this depends on the substrate. A model trapped in text can rearrange the fossil record of past human contact with reality, and this is already enormously useful because the fossil record is huge. A model connected to memory can sustain a research trajectory. A model connected to tools can act on formal worlds. A model connected to laboratories can ask reality new questions. A model connected to institutions can turn answers into shared knowledge.
My strongest intuition here is that the future of AI and scientific discovery will be decided by how well we engineer these couplings. Intelligence alone is too abstract a quantity to carry the story, because the world is not a theorem waiting in a context window. The world is the substrate which computes itself, and science is the art of building channels through which that computation becomes knowledge.