Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges, focusing on a clean scenario in which inter-human agreement is high. Investigating thirteen judge models of different model sizes and families, judging answers of nine different ’examtaker models’ - both base and instruction-tuned - we find that only the best (and largest) models achieve reasonable alignment with humans. However, they are still quite far behind inter-human agreement and their assigned scores may still differ with up to 5 points from human-assigned scores. In terms of their ranking of the nine exam-taker models, instead, also smaller models and even the lexical metric contains may provide a reasonable signal. Through error analysis and other studies, we identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency. The fact that even the best judges differ from humans in this comparatively simple setup suggest that caution may be wise when using judges in more complex setups. Lastly, our research rediscovers the importance of using alignment metrics beyond simple percent alignment, showing that judges with high percent agreement can still assign vastly different scores.
@inproceedings{thakur2024judging,title={Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges},author={Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke},booktitle={Proceedings of the Fourth Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)},year={2025},publisher={Association for Computational Linguistics},}
arXiv
Quantitative LLM Judges
Aishwarya
Sahoo, Jeevana Kruthi
Karnuthala, Tushar Parmanand
Budhwani, and
9 more authors
LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge’s textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
@article{sahoo2025quantitativellmjudges,title={Quantitative LLM Judges},author={Sahoo, Aishwarya and Karnuthala, Jeevana Kruthi and Budhwani, Tushar Parmanand and Agarwal, Pranchal and Vaidyanathan, Sankaran and Siu, Alexa and Dernoncourt, Franck and Healey, Jennifer and Lipka, Nedim and Rossi, Ryan and Bhattacharya, Uttaran and Kveton, Branislav},journal={arXiv preprint arXiv:2506.02945},year={2025},}
2024
Neural Networks
Data-driven learning of chaotic dynamical systems using Discrete-Temporal Sobolev Networks
Connor
Kennedy, Trace
Crowdis, Haoran
Hu, and
2 more authors
We introduce the Discrete-Temporal Sobolev Network (DTSN), a neural network loss function that assists dynamical system forecasting by minimizing variational differences between the network output and the training data via a temporal Sobolev norm. This approach is entirely data-driven, architecture agnostic, and does not require derivative information from the estimated system. The DTSN is particularly well suited to chaotic dynamical systems as it minimizes noise in the network output which is crucial for such sensitive systems. For our test cases we consider discrete approximations of the Lorenz-63 system and the Chua circuit. For the network architectures we use the Long Short-Term Memory (LSTM) and the Transformer. The performance of the DTSN is compared with the standard MSE loss for both architectures, as well as with the Physics Informed Neural Network (PINN) loss for the LSTM. The DTSN loss is shown to substantially improve accuracy for both architectures, while requiring less information than the PINN and without noticeably increasing computational time, thereby demonstrating its potential to improve neural network forecasting of dynamical systems.
@article{kennedy2024data,title={Data-driven learning of chaotic dynamical systems using Discrete-Temporal Sobolev Networks},author={Kennedy, Connor and Crowdis, Trace and Hu, Haoran and Vaidyanathan, Sankaran and Zhang, Hong-Kun},journal={Neural Networks},pages={106152},year={2024},publisher={Pergamon},}
arXiv
Automated Discovery of Functional Actual Causes in Complex Environments
Caleb
Chuck*, Sankaran
Vaidyanathan*, Stephen
Giguere, and
3 more authors
Reinforcement learning (RL) algorithms often struggle to learn policies that generalize to novel situations due to issues such as causal confusion, overfitting to irrelevant factors, and failure to isolate control of state factors. These issues stem from a common source: a failure to accurately identify and exploit state-specific causal relationships in the environment. While some prior works in RL aim to identify these relationships explicitly, they rely on informal domain-specific heuristics such as spatial and temporal proximity. Actual causality offers a principled and general framework for determining the causes of particular events. However, existing definitions of actual cause often attribute causality to a large number of events, even if many of them rarely influence the outcome. Prior work on actual causality proposes normality as a solution to this problem, but its existing implementations are challenging to scale to complex and continuous-valued RL environments. This paper introduces functional actual cause (FAC), a framework that uses context-specific independencies in the environment to restrict the set of actual causes. We additionally introduce Joint Optimization for Actual Cause Inference (JACI), an algorithm that learns from observational data to infer functional actual causes. We demonstrate empirically that FAC agrees with known results on a suite of examples from the actual causality literature, and JACI identifies actual causes with significantly higher accuracy than existing heuristic methods in a set of complex, continuous-valued environments.
@article{chuck2024automated,title={Automated Discovery of Functional Actual Causes in Complex Environments},author={Chuck, Caleb and Vaidyanathan, Sankaran and Giguere, Stephen and Zhang, Amy and Jensen, David and Niekum, Scott},journal={arXiv preprint arXiv:2404.10883},year={2024},}
arXiv
Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
Jatin
Nainani*, Sankaran
Vaidyanathan*, AJ
Yeung, and
2 more authors
Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.
@article{nainani2024adaptive,title={Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability},author={Nainani, Jatin and Vaidyanathan, Sankaran and Yeung, AJ and Gupta, Kartik and Jensen, David},journal={arXiv preprint arXiv:2411.16105},year={2024},}
JACS
Assessing Intraoperative Cognitive Workload by Leveraging Deep Learning Networks
Jake
Awtry, Sankaran
Vaidyanathan, Heather M
Conboy, and
6 more authors
Journal of the American College of Surgeons, Oct 2024
Surgeons’ cognitive workload (CWL) fluctuates in response to intraoperative events and cognitive overload may negatively impact operative performance and outcomes. We sought to use a deep neural network model to predict surgeons’ CWL during coronary artery bypass grafting (CABG). The root mean square of successive differences (RMSSD), a heart rate variability metric that reflects CWL, was collected via 3-lead electrocardiogram monitors and Kubios software for surgeons during non-emergent CABG procedures (n = 26). RMSSD was predicted at 5-minute intervals throughout operation via a long short-term memory (LSTM) neural network integrating time, surgical phase, and the RMSSD of surgeons at previous timepoints. Predictions were compared with a random model, linear ridge regression, and a simple autoregressive model in which RMSSD for the surgeon at time interval t equals RMSSD at interval t-1. The LSTM, linear ridge regression, and autoregressive models all performed similarly in predicting dynamic changes in surgeon RMSSD while outperforming the random model. Correlation coefficients for measured and predicted RMSSD values for all 3 models across all cases were 0.47, 0.48, and 0.49, respectively, compared with 0.03 for the random model, and indistinguishable from one another. Shapley additive explanations (SHAP) analysis revealed that a surgeon’s RMSSD at t-1 was the dominant predictor of RMSSD at time t across the range of RMSDD values. The deep LSTM model converged toward, and did not outperform, an autoregressive model, suggesting sustained trends in intraoperative surgeon CWL that would otherwise be difficult to effectively model with machine learning.
@article{awtry2024leveraging,title={Assessing Intraoperative Cognitive Workload by Leveraging Deep Learning Networks},author={Awtry, Jake and Vaidyanathan, Sankaran and Conboy, Heather M and Kennedy-Metz, Lauren and Clarke, Lori A and Avrunin, George and Dias, Roger and Jensen, David and Zenati, Marco},volume={239},issn={1879-1190},url={http://dx.doi.org/10.1097/XCS.0000000000001159},doi={10.1097/xcs.0000000000001159},number={5},journal={Journal of the American College of Surgeons},publisher={Ovid Technologies (Wolters Kluwer Health)},year={2024},month=oct,pages={S71-S79},}
2020
Complex Networks
A new measure of modularity in hypergraphs: Theoretical insights and implications for effective clustering
Tarun
Kumar*, Sankaran
Vaidyanathan*, Harini
Ananthapadmanabhan, and
2 more authors
In Complex Networks and Their Applications VIII: Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019 8, Oct 2020
Many real-world systems consist of entities that exhibit complex group interactions rather than simple pairwise relationships; such multi-way relations are more suitably modeled using hypergraphs. In this work, we generalize the framework of modularity maximization, commonly used for community detection on graphs, for the hypergraph clustering problem. We introduce a hypergraph null model that can be shown to correspond exactly to the configuration model for undirected graphs. We then derive an adjacency matrix reduction that preserves the hypergraph node degree sequence, for use with this null model. The resultant modularity function can be maximized using the Louvain method, a popular fast algorithm known to work well in practice for graphs. We additionally propose an iterative refinement over this clustering that exploits higher-order information within the hypergraph, seeking to encourage balanced hyperedge cuts. We demonstrate the efficacy of our methods on several real-world datasets.
@inproceedings{kumar2020new,title={A new measure of modularity in hypergraphs: Theoretical insights and implications for effective clustering},author={Kumar, Tarun and Vaidyanathan, Sankaran and Ananthapadmanabhan, Harini and Parthasarathy, Srinivasan and Ravindran, Balaraman},booktitle={Complex Networks and Their Applications VIII: Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019 8},pages={286--297},year={2020},organization={Springer International Publishing},}
Appl. NetSci
Hypergraph clustering by iteratively reweighted modularity maximization
Tarun
Kumar, Sankaran
Vaidyanathan, Harini
Ananthapadmanabhan, and
2 more authors
Learning on graphs is a subject of great interest due to the abundance of relational data from real-world systems. Many of these systems involve higher-order interactions (super-dyadic) rather than mere pairwise (dyadic) relationships; examples of these are co-authorship, co-citation, and metabolic reaction networks. Such super-dyadic relations are more adequately modeled using hypergraphs rather than graphs. Learning on hypergraphs has thus been garnering increased attention with potential applications in network analysis, VLSI design, and computer vision, among others. Especially, hypergraph clustering is gaining attention because of its enormous applications such as component placement in VLSI, group discovery in bibliographic systems, image segmentation in CV, etc. For the problem of clustering on graphs, modularity maximization has been known to work well in the pairwise setting. Our primary contribution in this article is to provide a generalization of the modularity maximization framework for clustering on hypergraphs. In doing so, we introduce a null model for graphs generated by hypergraph reduction and prove its equivalence to the configuration model for undirected graphs. The proposed graph reduction technique preserves the node degree sequence from the original hypergraph. The modularity function can be defined on a thus reduced graph, which can be maximized using any standard modularity maximization method, such as the Louvain method. We additionally propose an iterative technique that provides refinement over the obtained clusters. We demonstrate both the efficacy and efficiency of our methods on several real-world datasets.
@article{kumar2020hypergraph,title={Hypergraph clustering by iteratively reweighted modularity maximization},author={Kumar, Tarun and Vaidyanathan, Sankaran and Ananthapadmanabhan, Harini and Parthasarathy, Srinivasan and Ravindran, Balaraman},journal={Applied Network Science},volume={5},number={1},pages={52},year={2020},publisher={Springer International Publishing Cham},}