On a quest to understand intelligence and ensure that advanced AGI is safe and beneficial.

Satvik Golechha

Hi! I’m a Research Scientist at the AI Security Institute (AISI), a directorate of the UK Department for Science, Innovation, and Technology (DSIT). My research focuses on frontier alignment, security, interpretability, and reinforcement learning.

Previously, as an independent researcher, I worked on RL for efficient multi-turn exploration at the Center for Human-Compatible AI (CHAI) at UC Berkeley. I was a scholar at the ML Alignment & Theory Scholars (MATS) program with Adrià Garriga-Alonso (working on frontier deception), with Nandi Schoots (on feature geometry and modularity), and I did Neel Nanda’s MATS training program on mechanistic interpretability.

Before deciding to focus full-time on AI safety, I worked at Microsoft Research on language models. Prior to that, I was an Associate Research Scientist at Wadhwani AI working on AI for Social Good and Healthcare.

Writing fiction and poetry along the way!

Drop me an email at zsatvik@gmail.com to discuss research and collaboration!

Research

I study intelligence (via its emergence and expression in neural networks) to ensure that advanced AGI is safe, beneficial, and useful. This involves working on alignment, security, interpretability, and reinforcement learning for frontier AI systems and agents. Here is some of my recent work:

Auditing Games for Sandbagging

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Z-M., Oliver M., Connor K., Kola A., Jacob M., Sam Marks, Chris Cundy, Joseph Bloom

2025, UK AISI (in collaboration with FAR AI)

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha, Adrià Garriga-Alonso

NeurIPS 2025 (Spotlight) (MATS)

A is for Absorption: Studying Feature Splitting and Absorption in SAEs

David Chanin, James W.S., Tomáš D., Hardik B., Satvik Golechha, Joseph Bloom

NeurIPS 2025 (Oral) (MATS)

ABBEL: Acting through Belief Bottlenecks Expressed in Language

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr

NeurIPS 2025 (Spotlight, LAW workshop) (CHAI, UC Berkeley)

Auditing Language Models for Hidden Objectives

Samuel Marks, Johannes Treutlein, . . ., Satvik Golechha, . . ., Evan Hubinger

2025, Anthropic (external collaboration)

Who’s the Evil Twin? Differential Auditing for Undesired Behavior

Ishwar B. , Hasith V. , Greta K., Ronan A. , Satvik Golechha

Mentored at SPAR 2025. Under review.

Intricacies of Feature Geometry in Large Language Models

Satvik Golechha, Lucius Bushnaq, Euan Ong, Neeraj Kayal, Nandi Schoots

ICLR 2025 (poster) (best blog award)

Studying Cross-cluster Modularity in Neural Networks

Satvik Golechha, Maheep C., Joan V., Alessandro Abate, Nandi Schoots

NeurIPS 2024: Workshop on Science of Deep Learning

Some Lessons from the OpenAI-FrontierMath Debacle

Satvik Golechha

Some investigative journalism that became pretty popular :)

Progress Measures for Grokking on Real-world Tasks

Satvik Golechha

ICML 2024:Workshop on High-Dim. Learning Dynamics (independent)

Challenges in Mechanistically Interpreting Harmful Representations

Satvik Golechha, James Dao

ICML 2024: Workshop on Mechanistic Interpretability (independent)

NICE: To Optimize In-Context Examples or Not?

Pragya Srivastava*, Satvik Golechha*, Amit Deshpande, Amit Sharma

ACL 2024 (main, poster) (work done at Microsoft Research)

BYoEB: An LLM-Powered Expert-in-the-Loop Chat System

Pragnya R.*, Bhuvan S.*, Satvik Golechha*, Mohit Jain, and others

UbiComp 2025 (work done at Microsoft Research)

Predicting Treatment Adherence of Tuberculosis Patients at Scale

Mihir Kulkarni*, Satvik Golechha*, Rishi R.*, Jithin S.*, Alpan Raval

NeurIPS 2022 (work done at Wadhwani AI)

Poetry

Writing metaphorical poetry allows a channel into emotions that could not have been expressed another way. Check out my poetry page!

Almost done with my first poetry book, Anuswaad!

Fiction

A beautiful thing happens when fiction is written. A good story reflects back to us aspects of ourselves that we’re not aware of. Really, it is the story that’s writing us.

Algebra to Zombies

A 29-week curriculum that covers foundational math required to do AI research. This accompanies a study group I used to run at Microsoft Research in India.

Research Blog

Some notes around AI research. For my research, please see my research statement and Scholar profile.

PS: For a more general (and hopefully fun) introduction to the less-taught parts of AI check out Alice!

Other Stuff

Intelligence: I write about intelligence and a number of interesting ideas in my fiction and research. I plan to bundle it into a blog series someday.

School: I’m writing a book (or a series of posts) on my version of an ideal school — I believe good schooling is highly impactful, undervalued, and achievable.

Like Winds & Dystop.ai: Slowly working on finishing these novels but aah so little time!

Infinite Jest: Reading this epic book; will take more than a couple months.

Exploring London: I’ve moved to London for the first time, HMU!

All life is bound together by mutual support and interdependence.
Acharya UmaswaTi