Sankaran Vaidyanathan

I am a PhD candidate at the College of Information and Computer Sciences, UMass Amherst, where I am advised by David Jensen. My research focuses on developing principled tools grounded in causal reasoning for explaining and evaluating complex AI systems, including large language models (LLMs) and reinforcement learning agents.

In particular, I focus on problems where subjective human judgments play a central role, including mechanistic interpretability in neural networks, evaluating LLM outputs, blame and responsibility attribution, and alignment with human social norms. These domains are often difficult to model using conventional statistical approaches in causality and machine learning: human judgments are shaped by implicit expectations, context-sensitive reasoning, and the tendency to highlight some causes over others based on agreed-upon social norms.

By developing methods grounded in scientific rigor and the human values that guide real-world decision-making, I aim to enable reliable evaluation and responsible governance of AI systems.

news

Aug 22, 2025	Presented our work on Detecting and Characterizing Planning in Language Models at the 2nd New England Mechanistic Interpretability (NEMI) Workshop.
Feb 27, 2025	I am joining the 10th edition of AI Safety Camp.
May 06, 2024	Passed the qualifier stage and am officially a Ph.D. candidate!

selected publications

arXiv
Automated Discovery of Functional Actual Causes in Complex Environments

Caleb Chuck^*, Sankaran Vaidyanathan^*, Stephen Giguere, and 3 more authors

arXiv preprint arXiv:2404.10883, 2024

Abstract arXiv Bib PDF

Reinforcement learning (RL) algorithms often struggle to learn policies that generalize to novel situations due to issues such as causal confusion, overfitting to irrelevant factors, and failure to isolate control of state factors. These issues stem from a common source: a failure to accurately identify and exploit state-specific causal relationships in the environment. While some prior works in RL aim to identify these relationships explicitly, they rely on informal domain-specific heuristics such as spatial and temporal proximity. Actual causality offers a principled and general framework for determining the causes of particular events. However, existing definitions of actual cause often attribute causality to a large number of events, even if many of them rarely influence the outcome. Prior work on actual causality proposes normality as a solution to this problem, but its existing implementations are challenging to scale to complex and continuous-valued RL environments. This paper introduces functional actual cause (FAC), a framework that uses context-specific independencies in the environment to restrict the set of actual causes. We additionally introduce Joint Optimization for Actual Cause Inference (JACI), an algorithm that learns from observational data to infer functional actual causes. We demonstrate empirically that FAC agrees with known results on a suite of examples from the actual causality literature, and JACI identifies actual causes with significantly higher accuracy than existing heuristic methods in a set of complex, continuous-valued environments.
@article{chuck2024automated, title = {Automated Discovery of Functional Actual Causes in Complex Environments}, author = {Chuck, Caleb and Vaidyanathan, Sankaran and Giguere, Stephen and Zhang, Amy and Jensen, David and Niekum, Scott}, journal = {arXiv preprint arXiv:2404.10883}, year = {2024}, }
ACL GEM
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur^*, Kartik Choudhary^*, Venkat Srinik Ramayapally^*, and 2 more authors

In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM), Jul 2025

Abstract arXiv Bib HTML PDF

The LLM-as-a-judge paradigm offers a potential solution to scalability issues in human evaluation of large language models (LLMs), but there are still many open questions about its strengths, weaknesses, and potential biases. This study investigates thirteen models, ranging in size and family, as ‘judge models’ evaluating answers from nine base and instruction-tuned ‘exam-taker models’. We find that only the best (and largest) models show reasonable alignment with humans, though they still differ with up to 5 points from human-assigned scores. Our research highlights the need for alignment metrics beyond percent agreement, as judges with high agreement can still assign vastly different scores. We also find that smaller models and the lexical metric contains can provide a reasonable signal in ranking the exam-taker models. Further error analysis reveals vulnerabilities in judge models, such as sensitivity to prompt complexity and a bias toward leniency. Our findings show that even the best judge models differ from humans in this fairly sterile setup, indicating that caution is warranted when applying judge models in more complex scenarios.
@inproceedings{thakur2024judging, title = {Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges}, author = {Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke}, booktitle = {Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM)}, year = {2025}, month = jul, address = {Vienna, Austria and virtual meeting}, publisher = {Association for Computational Linguistics}, isbn = {979-8-89176-261-9}, pages = {404--430}, }
NeurIPS Mech Interp
Detecting and Characterizing Planning in Language Models

Jatin Nainani, Sankaran Vaidyanathan, Connor Watts, and 2 more authors

NeurIPS Mechanistic Interpretability Workshop, Dec 2025

Abstract arXiv Bib PDF

Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi-step reasoning tasks. Recent work suggests that LLMs may perform planning - selecting a future target token in advance and generating intermediate tokens that lead towards it - rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across models and tasks, we present formal and causally grounded criteria for detecting planning and operationalize them as a semi-automated annotation pipeline. We apply this pipeline to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark and a poem generation task where Claude 3.5 Haiku was previously shown to plan. Our findings show that planning is not universal: unlike Haiku, Gemma-2-2B solves the same poem generation task through improvisation, and on MBPP it switches between planning and improvisation across similar tasks and even successive token predictions. We further show that instruction tuning refines existing planning behaviors in the base model rather than creating them from scratch. Together, these studies provide a reproducible and scalable foundation for mechanistic studies of planning in LLMs.
@article{nainani2025detectingcharacterizingplanninglanguage, title = {Detecting and Characterizing Planning in Language Models}, author = {Nainani, Jatin and Vaidyanathan, Sankaran and Watts, Connor and Assis, Andre N. and Rigg, Alice}, journal = {NeurIPS Mechanistic Interpretability Workshop}, year = {2025}, month = dec, }