Research Statement

Hi! I’m a Research Scientist at the UK AI Security Institute (AISI), where I work on frontier alignment research with a focus on training and interpreting model organisms of misalignment — such as of reward hacking, evaluation awareness, and sandbagging. This note is my research statement that highlights how I think about research and my journey and learnings so far.

For my past papers, please take a look at my Scholar profile and my CV. If you’re interested in my current projects, please send me an email and I’ll be happy to share drafts!

The story begins in 2021, when I finished an undergrad in CS from BITS Pilani. I was passionate about maximizing AI’s positive impact on the world, and thus I joined Wadhwani AI, a non-profit focused on solving challenging societal problems through AI, as an Associate ML Research Scientist. I formulated AI problems in healthcare and trained large-scale, robust, and deployable models to solve them.

I led research on Tuberculosis, a disease affecting 10M+ patients worldwide. To help the Government of India manage effective interventions, we built cohort-wise predictive models of patient adherence to their treatment regimen. This work won the best-paper award at the ML4H symposium at NeurIPS 2022 and has been published in the PMLR with leading media coverage and successful deployment in 10+ states impacting 45K+ patients monthly. As our model has gone on to power one of the world’s largest ML-based public health solutions, it has made me appreciate safety and interpretability in high-stake situations even more.

I am proud of all the impactful work we did at Wadhwani AI and am grateful to have been advised by three amazing mentors: Alpan Raval, Jithin Sreedharan, and Mihir Kulkarni. Around 2023, I realized that the AI research landscape was shifting to large and universally intelligent models, and thus I decided to join Microsoft Research as a Pre-Doctoral Research Fellow to pursue fundamental research around studying, aligning, and deploying language models.

Two of the projects I worked on at MSR were:

Build Your own Expert Bot: An expert-in-the-loop chat system with citations, automatic KB-updation, multi-modal, and multi-lingual capabilities that has been deployed at the Sankara Eye Hospital for 4k+ patients. Here’s the paper and a cool demo! (work done with Mohit Jain)
NICE: To Optimize In-Context Examples or Not?: As part of the LLM alignment team, I introduced a measure of the robustness of in-context-examples (ICE) in the presence of high-quality instructions. This metric (called nice!) can be used to better optimize for ICE vs. prompts for any new task. We presented it at ACL 2024 (main) and is being used by various Copilot teams at Microsoft. (paper) (work done with Amit Sharma and Amit Deshpande)

Ever-increasing AI capabilities made me want to pursue research ideas in AI safety and interpretability full-time. I wrote several research proposals and was fortunate to receive funding for my independent research. I also continued to be part of MSR as a visiting scholar for a year, where I ran a math group and worked on unsupervised representation theory with Neeraj Kayal.

And thus started a great chapter in my professional journey – independent research! It was a lot of uncertainty and stress at times! From being dejected (not getting Ph.D. offers) and going back home and thinking of leaving research, to moving around Bengaluru on my bicycle to find cafes to work out of, I eventually found what I can now say was the single most important contributor to my career in AI safety: MATS Research.

In 2023 (winter), I did the MATS training program with Neel Nanda, where I worked on trying to mechanistically interpret harmful representations in LLMs (instead of interpreting “Yes/No” logits which was common at the time). We found linear subspaces for dishonesty and found the interp tooling to fall short in answering most of the important questions around studying model representations. We presented it at the ICML 2024 Workshop on Mechanistic Interpretability.

In 2024 (summer), I joined MATS as a research scholar with Nandi Schoots, and we worked on modularity and feature geometry.

The modularity project did not go very well – we were trying to train models to be more modular and interpretable, which proved to be very difficult! It even shook my confidence in my skills. Very recently, Leo Gao at OpenAI showed that it indeed is difficult but possible to do in their cool weight-sparse transformers, which was an important learning for me on how to pick projects. On the other hand, the feature geometry work went quite well and we ended up debunking an award-winning paper in interpretability both theoretically and empirically.

I loved MATS so much that I did it again! In summer 2025, I joined the interpretability stream of Adrià Garriga-Alonso. We worked on something cool – we got LLMs to play Among Us with each other to measure and interpret deception!

This project was very important for me because after working on this I gained much more confidence in my research taste and skills and stopped being uncertain all the time about whether my ideas were good (thank you Adrià!). It ended up being a great project and won a Spotlight at NeurIPS 2025. I believe that the area of social deception games as a means to measure and study deceptive capability is great and would love follow-up work on it.

This was the first time I was in the US, and the first time I got a chance to discuss, learn from, and work with the best researchers in the field. Along with my primary research, I also got a chance to contribute to two really awesome projects:

Auditing Language Models for Hidden Objectives: The goal of this Anthropic project was to simulate an auditing exercise with a red-team training models with hidden goals and the blue team auditing them within specific constraints. As an external collaborator, I was part of the blue team.
A is for Absorption: Studying Feature Splitting and Absorption in SAEs: I’m a big fan of this work because it highlights some major flaws about sparse autoencoders (SAEs) about how parent features get “absorbed” by children features. I contributed a theoretical proof of why this happens due to the sparsity-based loss in SAE training.

I also ended up doing some more debunking work:

Some Lessons from the OpenAI-FrontierMath Debacle: I wrote about the new version of the FrontierMath paper that said that the project was funded by OpenAI and they had access to all the problems and solutions before they released SoTA results with o3. This was some piece of investigative journalism that became quite popular and got covered by some media too!
Progress Measures for Grokking for Real-world Tasks: As a means to study generalization in neural networks, I worked on grokking, a phenomenon where networks generalize long after overfitting on the training set. I showed that L2 norms did not explain grokking and introduced three progress measures that did (activation sparsity, absolute weight entropy, and local circuit complexity. This was my first solo paper and I presented it at an ICML 2024 workshop.

As all beautiful things, independent research came to an end because I wanted to do more impactful safety research – and so I’ve joined the UK AI Security Institute (AISI) as an RS. So far, it has gone great and I’ve worked on two projects:

Auditing Games for Sandbagging: This was a big in-house auditing exercise and involved a 6-month long team effort to successfully complete. Sandbagging turned out to be quite difficult to blue-team and we found in-distribution training on a single sample to be most effective.
(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL: We reproduced an Anthropic work using OS models, RL environments, algorithms, and tooling, and found an unexpected result related to CoT faithfulness and KL penalties during RL.

I continue to try to do impactful safety research and remind myself that a research career is a marathon and not a sprint. Things are changing very rapidly, though, and I’m thinking a lot about what skills will end up being useful with AI doing more and more of what I used to do – I’ve lost on math and programming but I still consider making beautiful plots as my comparative advantage. At least until Claude realizes plotly is the best. :)

Thank you for reading. I’m happy to connect and talk about research – send me an email!

All life is bound together by mutual support and interdependence.
Acharya UmaswaTi