Illuminating Blind Spots of Language Models with Targeted Agent-in-the-Loop Synthetic Data
Philip Lippmann, Matthijs T. J. Spaan, Jie Yang
Med-CAM: Improving Medical Question Answering with Confidence-Aware Methods
Karina H. Halevy, Kshitish Ghate, Jimin Mun, Mona T. Diab, Maarten Sap
TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
Arjun Damerla, Jimin Lim, Nikil Selladurai, Nam Le, Yanxi Jiang
Medal Matters: Probing LLMs’ Failure Cases Through Olympic Rankings
Juhwan Choi, Seunguk Yu, JungMin Yun, YoungBin Kim
Extending AutoCompressors via Surprisal-Based Dynamic Segmentation
Richard Xu, Raine Ma, Dawson Park, David Guo, Srivishnu Ramamurthi, Charles Duong, Kevin Zhu, Vasu Sharma, Sean O’Brien
From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
Karim Saraipour, Shichang Zhang
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Yizhou Sun, Himabindu Lakkaraju, Shichang Zhang
On the Retention of Edited Knowledge in Fine-Tuned Language Models
Fufang Wen, Shichang Zhang
Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques
Lang Xiong, Raina Gao, Alyssa Jeong, Yicheng Fu, Kevin Zhu, Sean O’Brien, Vasu Sharma
Constructive Disobedience and Trust in Human-Agent Interaction: A Multi-Scale Study
Gordon Briggs, Christina Wasylyshyn
CONFI-Lingual: A Confidence Evaluation Approach for Machine Translation
Daniel Chechelnitsky, Gayathri Ganesh Lakshmy, Kaitlyn Zhou, Chrysoula Zerva, Maarten Sap
Let’s Roleplay: Examining LLM Alignment in Collaborative Dialogues
Abhijnan Nath, Carine Graff, Nikhil Krishnaswamy