AgentBench’s dbbench-std task evaluates an agent’s ability to answer SQL questions in a multi-hop tool use setting. The controller exposes interaction endpoints, so that every task instance can be completed with a small, repeatable tool repertoire:
I’d been following the INTELLECT-2 paper and other PrimeIntellect work, but what really piqued my curiosity was PrimeIntellect-ai/prime-rl. The promise was bold: fully asynchronous, file-based RL that scales across decentralized devices. I wanted to understand exactly how it worked—scheduler quirks, memory tricks, the rollout loop, so I asked o3 to be my copilot. What followed was a week-long conversation in which we spelunked through every Python file until a coherent picture emerged. (While at it, I started a fork and sprinkled a few small QoL commits of my own → kevinbdsouza/prime-rl.)
AgentBench’s dbbench-std task evaluates an agent’s ability to answer SQL questions in a multi-hop tool use setting. The controller exposes interaction endpoints, so that every task instance can be completed with a small, repeatable tool repertoire:
For the better part of a decade, Adam has been the default optimizer for training deep learning models. But the ground is shifting. As we scale to massive models, a new family of geometry-aware optimizers, most notably Muon [1, 2], has emerged as a promising contender. The results from the modded-nanogpt [3] speedrun showed that by respecting the unique geometry of neural network layers, we could achieve faster and more efficient training. This is backed by simultaneous and follow up works like Scion [4], Modular Duality [5], Gluon [6], steepest descent under a particular norm and manifold [7, 8], and spectral condition for feature learning [9].
I’ve been exploring how far reinforcement-learning paradigms can push large language models when the reward is verifiable reasoning correctness. That led me to (i) extending Reasoning Gym with a procedurally-generated, multi-hop puzzle set that forces deduction ↔ induction ↔ abduction ↔ transduction hand-offs, (ii) wiring it into the TRL training loop, and (iii) seeing what the first accuracy curves look like. Below is the why, the how, and the initial results.
AgentBench’s dbbench-std task evaluates an agent’s ability to answer SQL questions in a multi-hop tool use setting. The controller exposes interaction endpoints, so that every task instance can be completed with a small, repeatable tool repertoire:
For the better part of a decade, Adam has been the default optimizer for training deep learning models. But the ground is shifting. As we scale to massive models, a new family of geometry-aware optimizers, most notably Muon [1, 2], has emerged as a promising contender. The results from the modded-nanogpt [3] speedrun showed that by respecting the unique geometry of neural network layers, we could achieve faster and more efficient training. This is backed by simultaneous and follow up works like Scion [4], Modular Duality [5], Gluon [6], steepest descent under a particular norm and manifold [7, 8], and spectral condition for feature learning [9].
For the better part of a decade, Adam has been the default optimizer for training deep learning models. But the ground is shifting. As we scale to massive models, a new family of geometry-aware optimizers, most notably Muon [1, 2], has emerged as a promising contender. The results from the modded-nanogpt [3] speedrun showed that by respecting the unique geometry of neural network layers, we could achieve faster and more efficient training. This is backed by simultaneous and follow up works like Scion [4], Modular Duality [5], Gluon [6], steepest descent under a particular norm and manifold [7, 8], and spectral condition for feature learning [9].
I’ve been exploring how far reinforcement-learning paradigms can push large language models when the reward is verifiable reasoning correctness. That led me to (i) extending Reasoning Gym with a procedurally-generated, multi-hop puzzle set that forces deduction ↔ induction ↔ abduction ↔ transduction hand-offs, (ii) wiring it into the TRL training loop, and (iii) seeing what the first accuracy curves look like. Below is the why, the how, and the initial results.
I’ve been exploring how far reinforcement-learning paradigms can push large language models when the reward is verifiable reasoning correctness. That led me to (i) extending Reasoning Gym with a procedurally-generated, multi-hop puzzle set that forces deduction ↔ induction ↔ abduction ↔ transduction hand-offs, (ii) wiring it into the TRL training loop, and (iii) seeing what the first accuracy curves look like. Below is the why, the how, and the initial results.
I’d been following the INTELLECT-2 paper and other PrimeIntellect work, but what really piqued my curiosity was PrimeIntellect-ai/prime-rl. The promise was bold: fully asynchronous, file-based RL that scales across decentralized devices. I wanted to understand exactly how it worked—scheduler quirks, memory tricks, the rollout loop, so I asked o3 to be my copilot. What followed was a week-long conversation in which we spelunked through every Python file until a coherent picture emerged. (While at it, I started a fork and sprinkled a few small QoL commits of my own → kevinbdsouza/prime-rl.)
I’ve been exploring how far reinforcement-learning paradigms can push large language models when the reward is verifiable reasoning correctness. That led me to (i) extending Reasoning Gym with a procedurally-generated, multi-hop puzzle set that forces deduction ↔ induction ↔ abduction ↔ transduction hand-offs, (ii) wiring it into the TRL training loop, and (iii) seeing what the first accuracy curves look like. Below is the why, the how, and the initial results.
Climate change has emerged as one of the most pressing challenges of the 21st century, posing unprecedented risks to economies, ecosystems, and human well-being. India, with its diverse geography and significant dependence on climate-sensitive sectors like agriculture, faces heightened vulnerability. Rising temperatures, extreme heat events, changing precipitation patterns, droughts, floods, and coastal hazards are increasingly evident, threatening rural livelihoods and urban infrastructure alike. Although India has been proactive in formulating climate policies—such as the National Action Plan on Climate Change (NAPCC) and State Action Plans on Climate Change (SAPCCs)—and has undertaken mitigation initiatives, the intensifying impacts demand a sharper focus on adaptation. This article reviews India’s key climate risks, summarizes existing adaptation strategies, and discusses the urgent need for scaling up investments in resilience-building measures. It concludes by proposing a strategic path forward to mainstream and finance climate adaptation across sectors.
Protein structure and sequence modeling has seen a fresh wave of resurgence in the last couple of years owing to some interesting developments in machine learning (ML) and deep learning (DL) based techniques. These techniques appear in a variety of flavours including using Equivariant neural network modules to respect the structural properties of 3D macromolecules, deeper networks that can benefit from the increased available experimental structures, powerful node-to-node relationship learners like transformers, and masked language modeling on the protein sequence space to learn evolutionary information. While structure prediction methods like AlphaFold (AF) [1] and RosettaFold (RF) [2] have become ubiquitious in computational structural biology, there remain challenges to be tackled on multiple fronts, where ML will play an important role.
The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.
I’ve recently been thinking a lot about what the intrinsic space of all human knowledge looks like, what kind of topology and structure does the neural latent manifold have, how sparse is it, and how to think about all the space in between pockets of density. For instance, it is not clear to me what the dimensionality of the original space is and whether using tokens as the basic entities of this space even makes sense. Maybe tokens are too granular to be useful for this kind of a thought experiment and we need to think about this at a higher level, say sentences and concepts. The reason such a thought experiment is appealing to me is because I think it lies at the heart of a question I’m interested in - whether AI can discover truly new knowledge.
Open-source Large Language Models (LLMs) have made advanced conversational AI accessible to a broader audience [1]. Despite their impressive capabilities, these models often grapple with a challenge: factual hallucinations. Factual hallucinations occur when an AI model generates content that is unfaithful to the source material or cannot be verified against reliable data [2]. This issue is particularly concerning in critical and information-dense fields such as health, law, finance, and education, where misinformation can have catastrophic consequences [3][4]. This essay explores the integration of inference-time decoding strategies with model steering as an approach to enhance the factual accuracy of LLMs. By combining these two methods, we can potentially build adaptive systems capable of detecting and mitigating factual hallucinations.
How big are the models going to get and how much longer is the scaling hypothesis going to hold? It’s unclear, but according to current performance trends, which haven’t shown signs of plateauing (GPT-4o, Claude 3.5 Sonnet, Gemini-1.5-Pro, Llama-3.1-405B, Grok-2), and the power budget of announced data centres (5GW OpenAI/Microsoft Stargate campus), it is likely that there is an order of magnitude left (OOM) to climb in model size. This Epoch AI research covers these scenarios in depth and estimates training runs of the order of ~2e29 FLOPs being possible by 2030, which would be 4 OOMs larger than GPT-4 (2e25 FLOPs). These training runs will primarily be power constrained, followed by chips, data, and latency.
Protein structure and sequence modeling has seen a fresh wave of resurgence in the last couple of years owing to some interesting developments in machine learning (ML) and deep learning (DL) based techniques. These techniques appear in a variety of flavours including using Equivariant neural network modules to respect the structural properties of 3D macromolecules, deeper networks that can benefit from the increased available experimental structures, powerful node-to-node relationship learners like transformers, and masked language modeling on the protein sequence space to learn evolutionary information. While structure prediction methods like AlphaFold (AF) [1] and RosettaFold (RF) [2] have become ubiquitious in computational structural biology, there remain challenges to be tackled on multiple fronts, where ML will play an important role.
Protein structure and sequence modeling has seen a fresh wave of resurgence in the last couple of years owing to some interesting developments in machine learning (ML) and deep learning (DL) based techniques. These techniques appear in a variety of flavours including using Equivariant neural network modules to respect the structural properties of 3D macromolecules, deeper networks that can benefit from the increased available experimental structures, powerful node-to-node relationship learners like transformers, and masked language modeling on the protein sequence space to learn evolutionary information. While structure prediction methods like AlphaFold (AF) [1] and RosettaFold (RF) [2] have become ubiquitious in computational structural biology, there remain challenges to be tackled on multiple fronts, where ML will play an important role.
Climate change has emerged as one of the most pressing challenges of the 21st century, posing unprecedented risks to economies, ecosystems, and human well-being. India, with its diverse geography and significant dependence on climate-sensitive sectors like agriculture, faces heightened vulnerability. Rising temperatures, extreme heat events, changing precipitation patterns, droughts, floods, and coastal hazards are increasingly evident, threatening rural livelihoods and urban infrastructure alike. Although India has been proactive in formulating climate policies—such as the National Action Plan on Climate Change (NAPCC) and State Action Plans on Climate Change (SAPCCs)—and has undertaken mitigation initiatives, the intensifying impacts demand a sharper focus on adaptation. This article reviews India’s key climate risks, summarizes existing adaptation strategies, and discusses the urgent need for scaling up investments in resilience-building measures. It concludes by proposing a strategic path forward to mainstream and finance climate adaptation across sectors.
This was written when I was younger, and both the content and the form of my opinions on this topic have changed since then. Leaving this here for the sake of continuity.
Studying the effects of technical change on critical mineral demand and supply in the context of the low-carbon energy transition is an important and open area of research. Despite the crucial role played by these minerals in low-carbon technologies, long-term demand projections remain uncertain due to intricate interactions between drivers of technical change. In this writeup, I lay out what a framework that studies the effects of technical change on critical mineral demand would look like, how it can be developed, and what are its potential use cases.
Studying the effects of technical change on critical mineral demand and supply in the context of the low-carbon energy transition is an important and open area of research. Despite the crucial role played by these minerals in low-carbon technologies, long-term demand projections remain uncertain due to intricate interactions between drivers of technical change. In this writeup, I lay out what a framework that studies the effects of technical change on critical mineral demand would look like, how it can be developed, and what are its potential use cases.
The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.
This was written when I was younger, and both the content and the form of my opinions on this topic have changed since then. Leaving this here for the sake of continuity.
I’ve recently been thinking a lot about what the intrinsic space of all human knowledge looks like, what kind of topology and structure does the neural latent manifold have, how sparse is it, and how to think about all the space in between pockets of density. For instance, it is not clear to me what the dimensionality of the original space is and whether using tokens as the basic entities of this space even makes sense. Maybe tokens are too granular to be useful for this kind of a thought experiment and we need to think about this at a higher level, say sentences and concepts. The reason such a thought experiment is appealing to me is because I think it lies at the heart of a question I’m interested in - whether AI can discover truly new knowledge.
How big are the models going to get and how much longer is the scaling hypothesis going to hold? It’s unclear, but according to current performance trends, which haven’t shown signs of plateauing (GPT-4o, Claude 3.5 Sonnet, Gemini-1.5-Pro, Llama-3.1-405B, Grok-2), and the power budget of announced data centres (5GW OpenAI/Microsoft Stargate campus), it is likely that there is an order of magnitude left (OOM) to climb in model size. This Epoch AI research covers these scenarios in depth and estimates training runs of the order of ~2e29 FLOPs being possible by 2030, which would be 4 OOMs larger than GPT-4 (2e25 FLOPs). These training runs will primarily be power constrained, followed by chips, data, and latency.
Open-source Large Language Models (LLMs) have made advanced conversational AI accessible to a broader audience [1]. Despite their impressive capabilities, these models often grapple with a challenge: factual hallucinations. Factual hallucinations occur when an AI model generates content that is unfaithful to the source material or cannot be verified against reliable data [2]. This issue is particularly concerning in critical and information-dense fields such as health, law, finance, and education, where misinformation can have catastrophic consequences [3][4]. This essay explores the integration of inference-time decoding strategies with model steering as an approach to enhance the factual accuracy of LLMs. By combining these two methods, we can potentially build adaptive systems capable of detecting and mitigating factual hallucinations.
I’ve recently been thinking a lot about what the intrinsic space of all human knowledge looks like, what kind of topology and structure does the neural latent manifold have, how sparse is it, and how to think about all the space in between pockets of density. For instance, it is not clear to me what the dimensionality of the original space is and whether using tokens as the basic entities of this space even makes sense. Maybe tokens are too granular to be useful for this kind of a thought experiment and we need to think about this at a higher level, say sentences and concepts. The reason such a thought experiment is appealing to me is because I think it lies at the heart of a question I’m interested in - whether AI can discover truly new knowledge.
The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.
Open-source Large Language Models (LLMs) have made advanced conversational AI accessible to a broader audience [1]. Despite their impressive capabilities, these models often grapple with a challenge: factual hallucinations. Factual hallucinations occur when an AI model generates content that is unfaithful to the source material or cannot be verified against reliable data [2]. This issue is particularly concerning in critical and information-dense fields such as health, law, finance, and education, where misinformation can have catastrophic consequences [3][4]. This essay explores the integration of inference-time decoding strategies with model steering as an approach to enhance the factual accuracy of LLMs. By combining these two methods, we can potentially build adaptive systems capable of detecting and mitigating factual hallucinations.
Open-source Large Language Models (LLMs) have made advanced conversational AI accessible to a broader audience [1]. Despite their impressive capabilities, these models often grapple with a challenge: factual hallucinations. Factual hallucinations occur when an AI model generates content that is unfaithful to the source material or cannot be verified against reliable data [2]. This issue is particularly concerning in critical and information-dense fields such as health, law, finance, and education, where misinformation can have catastrophic consequences [3][4]. This essay explores the integration of inference-time decoding strategies with model steering as an approach to enhance the factual accuracy of LLMs. By combining these two methods, we can potentially build adaptive systems capable of detecting and mitigating factual hallucinations.
How big are the models going to get and how much longer is the scaling hypothesis going to hold? It’s unclear, but according to current performance trends, which haven’t shown signs of plateauing (GPT-4o, Claude 3.5 Sonnet, Gemini-1.5-Pro, Llama-3.1-405B, Grok-2), and the power budget of announced data centres (5GW OpenAI/Microsoft Stargate campus), it is likely that there is an order of magnitude left (OOM) to climb in model size. This Epoch AI research covers these scenarios in depth and estimates training runs of the order of ~2e29 FLOPs being possible by 2030, which would be 4 OOMs larger than GPT-4 (2e25 FLOPs). These training runs will primarily be power constrained, followed by chips, data, and latency.
The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.
I’ve recently been thinking a lot about what the intrinsic space of all human knowledge looks like, what kind of topology and structure does the neural latent manifold have, how sparse is it, and how to think about all the space in between pockets of density. For instance, it is not clear to me what the dimensionality of the original space is and whether using tokens as the basic entities of this space even makes sense. Maybe tokens are too granular to be useful for this kind of a thought experiment and we need to think about this at a higher level, say sentences and concepts. The reason such a thought experiment is appealing to me is because I think it lies at the heart of a question I’m interested in - whether AI can discover truly new knowledge.
Studying the effects of technical change on critical mineral demand and supply in the context of the low-carbon energy transition is an important and open area of research. Despite the crucial role played by these minerals in low-carbon technologies, long-term demand projections remain uncertain due to intricate interactions between drivers of technical change. In this writeup, I lay out what a framework that studies the effects of technical change on critical mineral demand would look like, how it can be developed, and what are its potential use cases.
The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.