I am a PhD student at the College of Information and Computer Sciences, UMass Amherst, where I am advised by David Jensen. My research spans the areas of causal inference, probabilistic machine learning, and reinforcement learning. I aim to create tools for analyzing and evaluating the behavior of complex AI systems, with a focus on problems in blame and responsibility attribution, explainability, and alignment with human norms. I am also interested in causal inference methods for evaluation and mechanistic interpretability in large language models.
Unlike most applications of causal inference that involve objective experimentation and interaction with the external world, these issues are traditionally grounded in subjective human judgments. These involve norms that can be very counterintuitive and pose a significant challenge to purely statistical approaches in causal inference. By developing formal approaches for modeling norms and inference algorithms that align with norms, I hope to support open and scientific evaluation and auditing of AI systems, and the growth of AI systems that better align with norms.
For a complete list of my publications, see my Google Scholar.
Research
|
Automated Discovery of Functional Actual Causes in Complex Environments
Caleb Chuck*,
Sankaran Vaidyanathan*,
Stephen Giguere,
Amy Zhang,
David Jensen,
Scott Niekum
In preparation | arXiv
Classical definitions of actual causation often declare a large number of events and entities in an environment to be causes, even when many of them rarely influence the outcome. This is an issue of normality, or the distinction between normal and rare events as potential causes. By exploiting context-specific independencies in the environment, we can prune out events that do not affect the outcome in the observed context and identify a restricted and focused set of actual causes. We extend the formal definition of actual causation to account for these independencies and show how to automatically infer actual causes under this definition.
|
|
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur*,
Kartik Choudhary*,
Venkat Srinik Ramayapally*,
Sankaran Vaidyanathan,
Dieuwke Hupkes,
In preparation | arXiv
Large language models are often evaluated using the LLM-as-a-judge paradigm, but there are still many open questions about the evaluation paradigm itself. We evaluate various LLMs acting as judges using human annotations on a controlled setup using the TriviaQA benchmark. (1) Only GPT-4 Turbo and Llama3-70B shine among all the judge models we evaluate, but their alignment with human annotations still falls short of inter-human annotator agreement. (2) Scores assigned by judges with >80% percent human alignment can be ~20 points apart, and Cohen's kappa is a superior metric. (3) Most aligned in scores != most discriminative, sometimes judge models with low human alignment such as JudgeLM-7B and Contains (lexical match) outperform larger and more aligned models in terms of ranking models, since their biases are more systematic. (4) Judge LLMs are often lenient and can be easily tricked by controlled responses like "Yes," "Sure," and "I don't know". (5) It is not easy to steer large models, but smaller models get confused by adding too much detail. Overall, we urge caution in trusting LLMs as judges without including an evaluation of the judges themselves:
|