Posts by Tags

Evaluating DSPy-Based Optimisation on AgentBench

7 minute read

Published: April 25, 2025

AgentBench’s dbbench-std task evaluates an agent’s ability to answer SQL questions in a multi-hop tool use setting. The controller exposes interaction endpoints, so that every task instance can be completed with a small, repeatable tool repertoire:

Peeking Under the Hood of prime-rl

10 minute read

Published: May 10, 2025

I’d been following the INTELLECT-2 paper and other PrimeIntellect work, but what really piqued my curiosity was PrimeIntellect-ai/prime-rl. The promise was fully asynchronous, file-based RL that scales across decentralized devices. I wanted to understand exactly how it worked: scheduler quirks, memory tricks, the rollout loop, so I asked o3 to be my copilot. In a week-long conversation, we went through each file in the project until a coherent picture emerged. (While at it, I started a fork and added a few small QoL commits of my own → kevinbdsouza/prime-rl.)

Evaluating DSPy-Based Optimisation on AgentBench

7 minute read

Published: April 25, 2025

AgentBench’s dbbench-std task evaluates an agent’s ability to answer SQL questions in a multi-hop tool use setting. The controller exposes interaction endpoints, so that every task instance can be completed with a small, repeatable tool repertoire:

Evaluating a Self-Tuning Version of Muon on the NanoGPT Speedrun

14 minute read

Published: May 01, 2025

For the better part of a decade, Adam has been the default optimizer for training deep learning models. But the ground is shifting. As we scale to massive models, a new family of geometry-aware optimizers, most notably Muon [1, 2], has emerged as a promising contender. The results from the modded-nanogpt [3] speedrun showed that by respecting the unique geometry of neural network layers, we could achieve faster and more efficient training. This is backed by simultaneous and follow up works like Scion [4], Modular Duality [5], Gluon [6], steepest descent under a particular norm and manifold [7, 8], and spectral condition for feature learning [9].

Stress-Testing LLMs With Reasoning Gym: Building & Training a Multi-step Reasoning Task

8 minute read

Published: June 07, 2025

I’ve been exploring how far reinforcement-learning paradigms can push large language models when the reward is verifiable reasoning correctness. That led me to (i) extending Reasoning Gym with a procedurally-generated, multi-hop puzzle set that forces deduction ↔ induction ↔ abduction ↔ transduction hand-offs, (ii) wiring it into the TRL training loop, and (iii) seeing what the first accuracy curves look like. Below is the why, the how, and the initial results.

Evaluating DSPy-Based Optimisation on AgentBench

7 minute read

Published: April 25, 2025

AgentBench’s dbbench-std task evaluates an agent’s ability to answer SQL questions in a multi-hop tool use setting. The controller exposes interaction endpoints, so that every task instance can be completed with a small, repeatable tool repertoire:

Evaluating a Self-Tuning Version of Muon on the NanoGPT Speedrun

14 minute read

Published: May 01, 2025

For the better part of a decade, Adam has been the default optimizer for training deep learning models. But the ground is shifting. As we scale to massive models, a new family of geometry-aware optimizers, most notably Muon [1, 2], has emerged as a promising contender. The results from the modded-nanogpt [3] speedrun showed that by respecting the unique geometry of neural network layers, we could achieve faster and more efficient training. This is backed by simultaneous and follow up works like Scion [4], Modular Duality [5], Gluon [6], steepest descent under a particular norm and manifold [7, 8], and spectral condition for feature learning [9].

Evaluating a Self-Tuning Version of Muon on the NanoGPT Speedrun

14 minute read

Published: May 01, 2025

For the better part of a decade, Adam has been the default optimizer for training deep learning models. But the ground is shifting. As we scale to massive models, a new family of geometry-aware optimizers, most notably Muon [1, 2], has emerged as a promising contender. The results from the modded-nanogpt [3] speedrun showed that by respecting the unique geometry of neural network layers, we could achieve faster and more efficient training. This is backed by simultaneous and follow up works like Scion [4], Modular Duality [5], Gluon [6], steepest descent under a particular norm and manifold [7, 8], and spectral condition for feature learning [9].

Stress-Testing LLMs With Reasoning Gym: Building & Training a Multi-step Reasoning Task

8 minute read

Published: June 07, 2025

I’ve been exploring how far reinforcement-learning paradigms can push large language models when the reward is verifiable reasoning correctness. That led me to (i) extending Reasoning Gym with a procedurally-generated, multi-hop puzzle set that forces deduction ↔ induction ↔ abduction ↔ transduction hand-offs, (ii) wiring it into the TRL training loop, and (iii) seeing what the first accuracy curves look like. Below is the why, the how, and the initial results.

Stress-Testing LLMs With Reasoning Gym: Building & Training a Multi-step Reasoning Task

8 minute read

Published: June 07, 2025

I’ve been exploring how far reinforcement-learning paradigms can push large language models when the reward is verifiable reasoning correctness. That led me to (i) extending Reasoning Gym with a procedurally-generated, multi-hop puzzle set that forces deduction ↔ induction ↔ abduction ↔ transduction hand-offs, (ii) wiring it into the TRL training loop, and (iii) seeing what the first accuracy curves look like. Below is the why, the how, and the initial results.

Peeking Under the Hood of prime-rl

10 minute read

Published: May 10, 2025

I’d been following the INTELLECT-2 paper and other PrimeIntellect work, but what really piqued my curiosity was PrimeIntellect-ai/prime-rl. The promise was fully asynchronous, file-based RL that scales across decentralized devices. I wanted to understand exactly how it worked: scheduler quirks, memory tricks, the rollout loop, so I asked o3 to be my copilot. In a week-long conversation, we went through each file in the project until a coherent picture emerged. (While at it, I started a fork and added a few small QoL commits of my own → kevinbdsouza/prime-rl.)

Stress-Testing LLMs With Reasoning Gym: Building & Training a Multi-step Reasoning Task

8 minute read

Published: June 07, 2025

I’ve been exploring how far reinforcement-learning paradigms can push large language models when the reward is verifiable reasoning correctness. That led me to (i) extending Reasoning Gym with a procedurally-generated, multi-hop puzzle set that forces deduction ↔ induction ↔ abduction ↔ transduction hand-offs, (ii) wiring it into the TRL training loop, and (iii) seeing what the first accuracy curves look like. Below is the why, the how, and the initial results.

Climate Risks for India in the Coming Decades and the Need to Invest in Adaptation Projects

27 minute read

Published: August 15, 2024

Climate change has emerged as one of the most pressing challenges of the 21st century, posing unprecedented risks to economies, ecosystems, and human well-being. India, with its diverse geography and significant dependence on climate-sensitive sectors like agriculture, faces heightened vulnerability. Rising temperatures, extreme heat events, changing precipitation patterns, droughts, floods, and coastal hazards are increasingly evident, threatening rural livelihoods and urban infrastructure alike. Although India has been proactive in formulating climate policies—such as the National Action Plan on Climate Change (NAPCC) and State Action Plans on Climate Change (SAPCCs)—and has undertaken mitigation initiatives, the intensifying impacts demand a sharper focus on adaptation. This article reviews India’s key climate risks, summarizes existing adaptation strategies, and discusses the urgent need for scaling up investments in resilience-building measures. It concludes by proposing a strategic path forward to mainstream and finance climate adaptation across sectors.

Developments in Machine Learning for Antibody Design

23 minute read

Published: November 24, 2022

Protein structure and sequence modeling has seen a fresh wave of resurgence in the last couple of years owing to some interesting developments in machine learning (ML) and deep learning (DL) based techniques. These techniques appear in a variety of flavours including using Equivariant neural network modules to respect the structural properties of 3D macromolecules, deeper networks that can benefit from the increased available experimental structures, powerful node-to-node relationship learners like transformers, and masked language modeling on the protein sequence space to learn evolutionary information. While structure prediction methods like AlphaFold (AF) [1] and RosettaFold (RF) [2] have become ubiquitious in computational structural biology, there remain challenges to be tackled on multiple fronts, where ML will play an important role.

Teaching a 1.5-Billion-Parameter LLM to Classify with RLVR and Spatial Heuristics

9 minute read

Published: April 12, 2025

I wanted to know whether a compact 1.5-B parameter model could learn to be spatial classifier, and this meant probing two things at once:

Expressive power: do today’s distilled language models understand enough geography and have enough spatial awareness to be decision makers?
RLVR: can reinforcement learning from verifiable rewards (RLVR) scale beyond familiar domains?

AI and Labour

4 minute read

Published: March 31, 2025

The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.

On Knowledge and Substrate

6 minute read

Published: February 15, 2025

I’ve recently been thinking a lot about what the intrinsic space of all human knowledge looks like, what kind of topology and structure does the neural latent manifold have, how sparse is it, and how to think about all the space in between pockets of density. For instance, it is not clear to me what the dimensionality of the original space is and whether using tokens as the basic entities of this space even makes sense. Maybe tokens are too granular to be useful for this kind of a thought experiment and we need to think about this at a higher level, say sentences and concepts. The reason such a thought experiment is appealing to me is because I think it lies at the heart of a question I’m interested in - whether AI can discover truly new knowledge.

Enhancing Factual Accuracy in Large Language Models: Integrating Decoding Strategies and Model Steering

19 minute read

Published: December 02, 2024

Open-source Large Language Models (LLMs) have made advanced conversational AI accessible to a broader audience [1]. Despite their impressive capabilities, these models often grapple with a challenge: factual hallucinations. Factual hallucinations occur when an AI model generates content that is unfaithful to the source material or cannot be verified against reliable data [2]. This issue is particularly concerning in critical and information-dense fields such as health, law, finance, and education, where misinformation can have catastrophic consequences [3][4]. This essay explores the integration of inference-time decoding strategies with model steering as an approach to enhance the factual accuracy of LLMs. By combining these two methods, we can potentially build adaptive systems capable of detecting and mitigating factual hallucinations.

Perspectives on the Future of AI

13 minute read

Published: September 17, 2024

How big are the models going to get and how much longer is the scaling hypothesis going to hold? It’s unclear, but according to current performance trends, which haven’t shown signs of plateauing (GPT-4o, Claude 3.5 Sonnet, Gemini-1.5-Pro, Llama-3.1-405B, Grok-2), and the power budget of announced data centres (5GW OpenAI/Microsoft Stargate campus), it is likely that there is an order of magnitude left (OOM) to climb in model size. This Epoch AI research covers these scenarios in depth and estimates training runs of the order of ~2e29 FLOPs being possible by 2030, which would be 4 OOMs larger than GPT-4 (2e25 FLOPs). These training runs will primarily be power constrained, followed by chips, data, and latency.

Developments in Machine Learning for Antibody Design

23 minute read

Published: November 24, 2022

Protein structure and sequence modeling has seen a fresh wave of resurgence in the last couple of years owing to some interesting developments in machine learning (ML) and deep learning (DL) based techniques. These techniques appear in a variety of flavours including using Equivariant neural network modules to respect the structural properties of 3D macromolecules, deeper networks that can benefit from the increased available experimental structures, powerful node-to-node relationship learners like transformers, and masked language modeling on the protein sequence space to learn evolutionary information. While structure prediction methods like AlphaFold (AF) [1] and RosettaFold (RF) [2] have become ubiquitious in computational structural biology, there remain challenges to be tackled on multiple fronts, where ML will play an important role.

Developments in Machine Learning for Antibody Design

23 minute read

Published: November 24, 2022

Protein structure and sequence modeling has seen a fresh wave of resurgence in the last couple of years owing to some interesting developments in machine learning (ML) and deep learning (DL) based techniques. These techniques appear in a variety of flavours including using Equivariant neural network modules to respect the structural properties of 3D macromolecules, deeper networks that can benefit from the increased available experimental structures, powerful node-to-node relationship learners like transformers, and masked language modeling on the protein sequence space to learn evolutionary information. While structure prediction methods like AlphaFold (AF) [1] and RosettaFold (RF) [2] have become ubiquitious in computational structural biology, there remain challenges to be tackled on multiple fronts, where ML will play an important role.

Climate Risks for India in the Coming Decades and the Need to Invest in Adaptation Projects

27 minute read

Published: August 15, 2024

Climate change has emerged as one of the most pressing challenges of the 21st century, posing unprecedented risks to economies, ecosystems, and human well-being. India, with its diverse geography and significant dependence on climate-sensitive sectors like agriculture, faces heightened vulnerability. Rising temperatures, extreme heat events, changing precipitation patterns, droughts, floods, and coastal hazards are increasingly evident, threatening rural livelihoods and urban infrastructure alike. Although India has been proactive in formulating climate policies—such as the National Action Plan on Climate Change (NAPCC) and State Action Plans on Climate Change (SAPCCs)—and has undertaken mitigation initiatives, the intensifying impacts demand a sharper focus on adaptation. This article reviews India’s key climate risks, summarizes existing adaptation strategies, and discusses the urgent need for scaling up investments in resilience-building measures. It concludes by proposing a strategic path forward to mainstream and finance climate adaptation across sectors.

Are We Explorers or Caretakers?

7 minute read

Published: March 03, 2018

This was written when I was younger, and both the content and the form of my opinions on this topic have changed since then. Leaving this here for the sake of continuity.

The Need for a Critical Mineral Demand Model Incorporating Technical Change

17 minute read

Published: January 12, 2024

Introduction

Studying the effects of technical change on critical mineral demand and supply in the context of the low-carbon energy transition is an important and open area of research. Despite the crucial role played by these minerals in low-carbon technologies, long-term demand projections remain uncertain due to intricate interactions between drivers of technical change. In this writeup, I lay out what a framework that studies the effects of technical change on critical mineral demand would look like, how it can be developed, and what are its potential use cases.

The Need for a Critical Mineral Demand Model Incorporating Technical Change

17 minute read

Published: January 12, 2024

Introduction

Studying the effects of technical change on critical mineral demand and supply in the context of the low-carbon energy transition is an important and open area of research. Despite the crucial role played by these minerals in low-carbon technologies, long-term demand projections remain uncertain due to intricate interactions between drivers of technical change. In this writeup, I lay out what a framework that studies the effects of technical change on critical mineral demand would look like, how it can be developed, and what are its potential use cases.

AI and Labour

4 minute read

Published: March 31, 2025

The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.

Are We Explorers or Caretakers?

7 minute read

Published: March 03, 2018

This was written when I was younger, and both the content and the form of my opinions on this topic have changed since then. Leaving this here for the sake of continuity.

On Knowledge and Substrate

6 minute read

Published: February 15, 2025

I’ve recently been thinking a lot about what the intrinsic space of all human knowledge looks like, what kind of topology and structure does the neural latent manifold have, how sparse is it, and how to think about all the space in between pockets of density. For instance, it is not clear to me what the dimensionality of the original space is and whether using tokens as the basic entities of this space even makes sense. Maybe tokens are too granular to be useful for this kind of a thought experiment and we need to think about this at a higher level, say sentences and concepts. The reason such a thought experiment is appealing to me is because I think it lies at the heart of a question I’m interested in - whether AI can discover truly new knowledge.

Perspectives on the Future of AI

13 minute read

Published: September 17, 2024

How big are the models going to get and how much longer is the scaling hypothesis going to hold? It’s unclear, but according to current performance trends, which haven’t shown signs of plateauing (GPT-4o, Claude 3.5 Sonnet, Gemini-1.5-Pro, Llama-3.1-405B, Grok-2), and the power budget of announced data centres (5GW OpenAI/Microsoft Stargate campus), it is likely that there is an order of magnitude left (OOM) to climb in model size. This Epoch AI research covers these scenarios in depth and estimates training runs of the order of ~2e29 FLOPs being possible by 2030, which would be 4 OOMs larger than GPT-4 (2e25 FLOPs). These training runs will primarily be power constrained, followed by chips, data, and latency.

Enhancing Factual Accuracy in Large Language Models: Integrating Decoding Strategies and Model Steering

19 minute read

Published: December 02, 2024

Open-source Large Language Models (LLMs) have made advanced conversational AI accessible to a broader audience [1]. Despite their impressive capabilities, these models often grapple with a challenge: factual hallucinations. Factual hallucinations occur when an AI model generates content that is unfaithful to the source material or cannot be verified against reliable data [2]. This issue is particularly concerning in critical and information-dense fields such as health, law, finance, and education, where misinformation can have catastrophic consequences [3][4]. This essay explores the integration of inference-time decoding strategies with model steering as an approach to enhance the factual accuracy of LLMs. By combining these two methods, we can potentially build adaptive systems capable of detecting and mitigating factual hallucinations.

Teaching a 1.5-Billion-Parameter LLM to Classify with RLVR and Spatial Heuristics

9 minute read

Published: April 12, 2025

I wanted to know whether a compact 1.5-B parameter model could learn to be spatial classifier, and this meant probing two things at once:

Expressive power: do today’s distilled language models understand enough geography and have enough spatial awareness to be decision makers?
RLVR: can reinforcement learning from verifiable rewards (RLVR) scale beyond familiar domains?

On Knowledge and Substrate

6 minute read

Published: February 15, 2025

I’ve recently been thinking a lot about what the intrinsic space of all human knowledge looks like, what kind of topology and structure does the neural latent manifold have, how sparse is it, and how to think about all the space in between pockets of density. For instance, it is not clear to me what the dimensionality of the original space is and whether using tokens as the basic entities of this space even makes sense. Maybe tokens are too granular to be useful for this kind of a thought experiment and we need to think about this at a higher level, say sentences and concepts. The reason such a thought experiment is appealing to me is because I think it lies at the heart of a question I’m interested in - whether AI can discover truly new knowledge.

AI and Labour

4 minute read

Published: March 31, 2025

The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.

Enhancing Factual Accuracy in Large Language Models: Integrating Decoding Strategies and Model Steering

19 minute read

Published: December 02, 2024

Open-source Large Language Models (LLMs) have made advanced conversational AI accessible to a broader audience [1]. Despite their impressive capabilities, these models often grapple with a challenge: factual hallucinations. Factual hallucinations occur when an AI model generates content that is unfaithful to the source material or cannot be verified against reliable data [2]. This issue is particularly concerning in critical and information-dense fields such as health, law, finance, and education, where misinformation can have catastrophic consequences [3][4]. This essay explores the integration of inference-time decoding strategies with model steering as an approach to enhance the factual accuracy of LLMs. By combining these two methods, we can potentially build adaptive systems capable of detecting and mitigating factual hallucinations.

Teaching a 1.5-Billion-Parameter LLM to Classify with RLVR and Spatial Heuristics

9 minute read

Published: April 12, 2025

I wanted to know whether a compact 1.5-B parameter model could learn to be spatial classifier, and this meant probing two things at once:

Expressive power: do today’s distilled language models understand enough geography and have enough spatial awareness to be decision makers?
RLVR: can reinforcement learning from verifiable rewards (RLVR) scale beyond familiar domains?

Enhancing Factual Accuracy in Large Language Models: Integrating Decoding Strategies and Model Steering

19 minute read

Published: December 02, 2024

Open-source Large Language Models (LLMs) have made advanced conversational AI accessible to a broader audience [1]. Despite their impressive capabilities, these models often grapple with a challenge: factual hallucinations. Factual hallucinations occur when an AI model generates content that is unfaithful to the source material or cannot be verified against reliable data [2]. This issue is particularly concerning in critical and information-dense fields such as health, law, finance, and education, where misinformation can have catastrophic consequences [3][4]. This essay explores the integration of inference-time decoding strategies with model steering as an approach to enhance the factual accuracy of LLMs. By combining these two methods, we can potentially build adaptive systems capable of detecting and mitigating factual hallucinations.

Perspectives on the Future of AI

13 minute read

Published: September 17, 2024

How big are the models going to get and how much longer is the scaling hypothesis going to hold? It’s unclear, but according to current performance trends, which haven’t shown signs of plateauing (GPT-4o, Claude 3.5 Sonnet, Gemini-1.5-Pro, Llama-3.1-405B, Grok-2), and the power budget of announced data centres (5GW OpenAI/Microsoft Stargate campus), it is likely that there is an order of magnitude left (OOM) to climb in model size. This Epoch AI research covers these scenarios in depth and estimates training runs of the order of ~2e29 FLOPs being possible by 2030, which would be 4 OOMs larger than GPT-4 (2e25 FLOPs). These training runs will primarily be power constrained, followed by chips, data, and latency.

AI and Labour

4 minute read

Published: March 31, 2025

The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.

Teaching a 1.5-Billion-Parameter LLM to Classify with RLVR and Spatial Heuristics

9 minute read

Published: April 12, 2025

I wanted to know whether a compact 1.5-B parameter model could learn to be spatial classifier, and this meant probing two things at once:

Expressive power: do today’s distilled language models understand enough geography and have enough spatial awareness to be decision makers?
RLVR: can reinforcement learning from verifiable rewards (RLVR) scale beyond familiar domains?

On Knowledge and Substrate

6 minute read

Published: February 15, 2025

I’ve recently been thinking a lot about what the intrinsic space of all human knowledge looks like, what kind of topology and structure does the neural latent manifold have, how sparse is it, and how to think about all the space in between pockets of density. For instance, it is not clear to me what the dimensionality of the original space is and whether using tokens as the basic entities of this space even makes sense. Maybe tokens are too granular to be useful for this kind of a thought experiment and we need to think about this at a higher level, say sentences and concepts. The reason such a thought experiment is appealing to me is because I think it lies at the heart of a question I’m interested in - whether AI can discover truly new knowledge.

The Need for a Critical Mineral Demand Model Incorporating Technical Change

17 minute read

Published: January 12, 2024

Introduction

Studying the effects of technical change on critical mineral demand and supply in the context of the low-carbon energy transition is an important and open area of research. Despite the crucial role played by these minerals in low-carbon technologies, long-term demand projections remain uncertain due to intricate interactions between drivers of technical change. In this writeup, I lay out what a framework that studies the effects of technical change on critical mineral demand would look like, how it can be developed, and what are its potential use cases.

AI and Labour

4 minute read

Published: March 31, 2025

The rapid advancement of AI technologies will transform industries and labor markets at an unprecedented pace. Despite these anticipated changes, the relationship between AI and labor remains surprisingly understudied. Recent works, notably by Korinek & Suh (2024), Acemoglu (2025), and Epoch AI's GATE model (2025), illustrate the complexity of AI’s economic impacts, but also highlight significant gaps in understanding AI’s real-world implications for labor.

Posts by Tags

Agents

Asynchronous Distributed Learning

DSPy

Deep Neural Networks

Large Language Models

Multi-hop tool use

Muon

Optimizer

Reasoning Gym

Reinforcement Learning

Tool Calling

adaptation

antibody design

artificial intelligence

biologics

climate change

critical mineral

Introduction

demand models

Introduction

economics

environment

generalization

hallucinations

heuristics

human knowledge

human labour

interpretability

large language models

perspective

productivity

reinforcement learning

substrate

technical change

Introduction

wages