Perspectives on the future of AI

13 minute read

Published:

How big are the models going to get and how much longer is the scaling hypothesis going to hold? It’s unclear, but according to current performance trends, which haven’t shown signs of plateauing (GPT-4o, Claude 3.5 Sonnet, Gemini-1.5-Pro, Llama-3.1-405B, Grok-2), and the power budget of announced data centres (5GW OpenAI/Microsoft Stargate campus), it is likely that there is an order of magnitude left (OOM) to climb in model size. This Epoch AI research covers these scenarios in depth and estimates training runs of the order of ~2e29 FLOPs being possible by 2030, which would be 4 OOMs larger than GPT-4 (2e25 FLOPs). These training runs will primarily be power constrained, followed by chips, data, and latency.

Where is this additional energy going to come from? As these compute centres need baseload, will new colocalized natural gas, hydroelectric power plants be built, can solar+wind power plus storage deliver part of this, will new energy sources like enhanced geothermal be part of the equation? These questions remain to be answered, however, in absence of hail mary’s like drastically new forms of energy, the additional energy has to come from a mix of these sources. It is unlikely that nuclear power plants can be commissioned and built in the time frame relevant here. An alternate route is to perform geographically distributed training, which would also allow training runs with an even larger power budget. Some of the frontier models like Gemini may already be trained this way. Recent research by Nous Research like DisTrO shows that it is possible to design optimizers that reduce the inter-GPU communication by 4-5 OOMs, powering low-latency training of large models on sluggish internet bandwidths. So the future may involve completely geographically distributed training, as a work around for power constraints.

The power of synthetic data has become obvious in the past year, with the Llama-3.1 technical report making multiple references to it and showing how it can contribute in a big way to post-training including supervised fine-tuning and task-specific capabilities such as coding, long-context reasoning, tool use, multilinguality etc. The Llama-3.1 technical report is a goldmine of useful information which will take months to fully digest and will undoubtedly go down as one of the best open source contributions to be made, highly recommend checking parts of it out. It seems that synthetic data generated from weaker but cheaper models may actually be compute optimal compared to stronger and more expensive models. Grounded synthetic data generation may have a role to play in generating chain of thought trajectories as well, which can be used for bootsrapping in-context learning (ICL).

ICL has been a revelation in the past couple of years, with it being able to unlock capabilities during inference time. Even though the word “agent” is abused by the AI influencer class, ICL does enable agentic behaviour by providing power ups in reasoning (few shot examples), tool use, and task-specific retrieval (retrieval augmented generation (RAG), Graph RAG). Cohere seems to be going all out on this and using it to build AI for enterprise applications. It doesn’t stop there. With prompt engineering becoming a real thing (magic prompts do exist such as the SuperPrompt), it will increasingly become a skill to elicit enhanced behaviour from the models. For a detailed survey of all the prompting techniques see the Prompt Report. Prompting is not just about finding magic words but also baking in reasoning techniques in context, such as Chain-of-Thought, Self-Consistency, Self-Refine, Self-Verification, Self-Reflection, Tree of Thoughts, Graph of Thoughts, Quiet-STaR, Reflexion, and many others. Some methods even try to automatically create agentic system designs, including coming up with novel building blocks and integrating them such as ADAS. ADAS shows that a meta agent can be used to program new agents in code. Interfacing everything in code, seems like a promising way forward for agentic systems.

Google’s Gemini seems to be boasting longer and longer context lengths and their technical report claims near-perfect retrieval up to at least 10M tokens. Such long context reasoning could be powerful when handling multiple long documents, entire codebases, and multimodal inputs. Such long context could potentially be achieved by modified attention mechanisms such as Infini-attention and Ring Attention. Many-Shot ICL (MICL) has been shown to provide substantial performance gains over few-shot ICL and long context maybe the enabler. Further, long contexts can enable elaborate deliberate planning during reasoning, such as PlanSearch. Chain-of-Thought (COT), in particular, has been shown to unlock the ability to perform serial computation, which would othwerise be outside the region of expressiveness of transformers. What happens when you combine multiple of these agents with different base models and leverage their collective strengths to problem solving? You get Mixture-of-Agents (MoA) as demonstrated by Together AI and trade-off latency for performance gains.

The great thing about agentic systems and ICL is that they unlock an orthogonal dimension for capability enhancement, when compared to pre-training and post-training. Another orthogonal dimension is unlocked by inference time compute capability. As this Epoch AI research estimates that it may be possible to increase inference time compute by 1-2 OOMs to save ~1 OOM in training compute. Performance improvement with repeated sampling and optimal test-time compute allocation has been recently demonstrated. This would be a direction that big labs will increasingly pursue not just because it could potentially offer a way to do deliberate search during inference and enhance reasoning skills, but also because it would offer a way to offload the compute to the user during inference and make the user pay for it instead. This is already vindicated by the release of OpenAI o1-preview and o1-mini, and substantial improvements seen in multiple independent reasoning benchmarks. This superior performance comes at the cost of latency and to some extent brittleness. It seems that it is much easier to converse, interrupt, and steer a model like Claude 3.5 sonnet than it is steer o1 because it gets too entangled in its own COT and can’t maneuver effectively. Regarding whether this is a good approach both strategically and technically for general purpose reasoning problems remains to be seen in the months to come.

Transformers by themselves have shown to reduce multi-step compositional reasoning into linearized subgraph matching (Faith and Fate), performing fuzzy pattern matching. Therefore, without deliberate planning, COT, and search, reasoning maybe out of reach. A surprsing finding is that reinforcement learning (RL) and search maybe effective at not just domains where outputs can be formally verified such as math and coding, but also for general purpose reasoning. To some extent, the benefits of adding search to LLMs was out in the open (The Bitter-er Lesson). The initial findings from o1 in this direction are indeed promising but not yet fully convincing. The precedent for superhuman RL comes of course from Go, with AlphaGo using Monte Carlo Tree Search (MCTS) from DeepMind. This was later further generalized to multiple games by them in AlphaZero and then extended to learn without any knowledge of underlying dynamics in MuZero. Whether MCTS is powerful enough to span the general purpose reasoning space in a granular manner remains to be seen, but it would require a great policy model (which the frontier LLMs already are), and a reward model that can perform reliable process supervision (Process Reward Models (PRMs)). Process supervision has been shown to provide gains in reasoning such as in Let's Verify Step by Step, Iterative Preference Learning, ReST-MCTS*, Critical Planning, Reasoning via Planning, Q*, Automated Process Supervision, and AlphaMath Almost Zero, and most of these methods use the MCTS framework to learn step-level rewards and perform search. To what extent techniques like procedural cloning can be a substitute for process supervision is unclear.

As shown by many of these methods, math and code seem to be the low hanging fruits in reasoning, with our ability to get explicit rewards for problems in these domains. Progress in these domains has been demonstrated by multiple players recently, including DeepSeek with its DeepSeek-Prover-V1.5 and DeepSeek-Coder-V2. The whale has some incredibly talented people and its open source contributions in these domains is nothing short of game changing. There has been other evidence of progress in math being possible with DeepMind’s AlphaProof and AlphaGeometry achieving a silver-medal level performance in the International Mathematical Olympiad (IMO). This combined with impressive performances in competitions such as the AI Mathematical Olympiad (AIMO) by projects like Numina, which finetuned DeepSeekMath-Base 7B is quite impressive to say the least. This could be foreseen for a while because of the ability of a formal language like Lean to act as a verifier and introduction of projects like LeanCopilot using LLMs in the loop for theorem proving.

Architectural and algorithmic improvements could hold the key to unlock the next generation of AI systems. Will transformers rule till the end or do newer architectures like Mamba and xLSTM have a chance? At this moment, doesn’t look like it, but things may change. However, there have been models that activate a subset of parameters for each token (Mixture-of-experts (MoE)), and such models may exhibit distinct behaviour when compared to models that don’t employ MOE. Which is better remains uncertain, but MOE does allow you to decouple model size from computational cost, potentially enabling utilization of a massive number of experts (Mixture of A Million Experts) and unlocking the next OOM of transformer scaling. Routing can also be done at the token level and FLOPs can be dynamically allocated across model depths (Mixture-of-Depths), allowing certain tokens to use more compute than others. This is going to be crucial going forward to enable inference time compute scaling, as not all tokens need long compute paths. Better optimizers such as Scaling Shampoo and AdEMAMix, better sampling methods such as Min P, and better gradient manipulation during optimization such as Grokfast may provide meaningful gains for the next geneartion of models.

Intepretability has seen some significant progress in the past few years mainly driven by work at Anthropic. Basically everything on interpretability by Anthropic is worth checking out, particularly Transformer Circuits, Induction Heads, Superposition, and Monosemanticity. The evidence that interpretable features occur as a superposition of multiple neurons and that with sparse autoencoders (SAEs) we can arrive at learned features that are more monosemantic in nature has been revealing. SAEs are an unsupervised method for learning a sparse decomposition of a neural network’s latent representations. With the scaling of SAEs to bigger models like Claude 3 Sonnet and the release of open source tools like Gemma Scope and SAELens, this direction looks very promising for steering LLMs (steering vectors, feature steering, clamping). It’s possible that the distinctly superior performance of Sonnet 3.5 could be because Anthropic did some steering, although this is not certain. There are other interpretability studies that probe network behaviour with synthetic data, which are worth looking at as well.

Smaller models are getting better. 8B models are reaching performance levels previously only achievable by models OOM bigger, mainly through power of techniques like pruning and distillation. The Minitron family of models by Nividia and the accompanying paper are a great source detailing these techniques. We will be able to cram more and more performance in smaller models, and it remains to be seen how small can we go and achieve reliable performance on the edge. Quantizating models from FP16 to FP8 and FP4, also provides considerable speedups that enables inference in compute and memory constrained applications. Moreover, binary (BitNet) and ternary quantization (BitNet b1.58, MatMul-free) maybe viable as well, potentially leading to matrix multiplication free accelerators. Any conversation about AI is incomplete without talking about hardware. “During a gold rush, sell shovels” they say, and boy would that have been perfect advice when AI was taking off, because NVIDIA’s growth over the past couple of decades has been astronomical. Will Nvidia stay in the lead with its CUDA ecosystem and continuously improving GPUs (Blackwell), or will hardware providers like Groq and Cerebras take over the inference space with their custom hardware? They or someone else may end up with inference dominance. The complete dynamics between Google’s TPU ecosystem, Meta, Apple, OpenAI, and Microsoft would be hard to predict, but would be surprsing if Nvidia doesn’t maintain its lead in the hardware space.

I haven’t touched upon so many relevant things here, such as developments in image, video, voice, time-series, multi-modality, 3D, robotics, biology, climate, and many others. The applications in all these domains are equally exciting and the pace of progress doesn’t show signs of slowing. Robotics, for instance, is poised to take over the world in the next decade. Every home may have a personal robot that is good at a variety of tasks. Biology promises to model molecules and their interactions in a fine grained way to tackle various diseases, allowing us to tame biological complexity. Smaller models maybe pervasive once they are plugged into inference time resoning chains. These could all be positive things, but they may not be. Technology is always a double edged sword. Let’s hope we get this right.