Summer Interns, 2025!

interns in AGH
Altar, Terry, Guy

Here in the deep snows of winter (okay, one snow, not very deep, but 5 inches is not bad these days), we offer a throwback to summertime, to share the experiences of this year’s summer interns. Mentored by PhD students Yu Feng and Xingyu Fu and postdoc Tomer Wolfson, with input from former postdoc Vivek Gupta, the six students had come to Philadelphia from Arizona, California, China, and Israel to spend some or all of their summer with us, working on their various projects. It was great to have them join our ranks and participate in our research. I’ve asked them to give their thoughts on the experience and what they learned; here are five responses.

interns and mentors in AGH
Jen, Xuan, Prasham, Rohit, Tomer, Terry, Yu

[Some responses were compiled or polished using AI.]

Rohit Khoja
Master’s student at Arizona State University

This summer, I worked on improving retrieval and question answering over both unstructured and tabular data. My main focus was on enhancing retrieval precision and LLM reasoning accuracy on the OTT-QA and STaRK datasets, particularly for questions that require multi-step reasoning. As part of this, I built a GraphRAG system from scratch. Instead of relying on conventional knowledge graphs at the entity level, we constructed a chunk-level graph, where each node represented a chunk of 7–8 sentences and explored relationships between chunks. This approach led to improvements in retrieval quality and reasoning performance.

Beyond the technical work, I really enjoyed my time in Philadelphia, the beautiful summer weather, exploring different places around the city, and especially the collaborative atmosphere in the lab. Prof. Dan, Tomer and Jennifer were supportive and always ready to help, and the convenience of the Penn Transit service around campus was great.

During this internship, I learned a lot about the retrieval field and recent advancements in it, gaining a deeper understanding of why retrieval is such a critical component of modern AI systems. I also learned how to handle large datasets efficiently on GPUs and optimize data processing pipelines for large-scale experiments.

Prasham Titiya
Master’s student at Arizona State University

Prasham at MIT in Cambridge, MA

This summer, I had the wonderful opportunity to be a research intern with the Cognitive Computation Group at the University of Pennsylvania. I worked on developing information retrieval systems for structured, semi-structured, and unstructured data, focusing on how building a knowledge graph can improve retrieval performance and help with more efficient multi-hop reasoning. The project was rewarding, exploring how semantic and lexical relationships can be represented and leveraged to make retrieval more accurate and context-aware.

I learned a lot throughout the project, both technically and personally. Meetings and discussions with Prof. Vivek Gupta, Dr. Tomer Wolfson, and especially Prof. Dan Roth were incredibly insightful and formative. Prof. Roth’s feedback and perspective on approaching this problem helped me think more critically and systematically about my work. I also had the chance to meet PhD students and researchers from other labs, which broadened my horizons and gave me a better understanding of the variety of work happening in this field. I am especially grateful to Jennifer Sheffield for being so proactive and helpful throughout the course of my internship.

Outside of research, I really enjoyed my time in Philadelphia. The city was lively and full of great places to explore. The UPenn campus was very beautiful, filled with greenery, and had a historic charm that made it a wonderful place to spend the summer. Overall, it was an amazing experience and definitely one of the highlights of my year. I had a great time learning, working with incredible people, and exploring a new city.

Tomer, Alon, Terry, Xuan, Yu, Jen at the Penn Park Orchard

Terry Tong
Undergrad at the University of California, Davis

Hi! I’m Terry, I’m a rising senior at UC Davis. Over the summer, I worked at CCG with my mentors Yu Feng and Prof. Dan Roth. When deciding on a project, I wanted to challenge myself to work on research that is more theoretical in nature, settling on a project on theory behind neuro-symbolic integration in reasoning. 

Terry and the Statue of Liberty

I learned a lot with Dan and Yu. Dan is really good at pruning the research idea search space given his decades of experience in the field, which has saved us a lot from dead ends. We had a discussion on the characteristics that made LMs amenable to tool learning, and I vividly remember Dan bringing up ‘teachability of models’ and how people used to research this, but stopped for good reason. We stopped in our footsteps there and pivoted right away. He’d also bring up relevant papers from long ago (like before I was born) that guided the field into what it is now, e.g. one of his seminal ‘Learning to Reason’ papers. This has always helped me gain perspective on what’s important. While Dan is a busy person, whenever we did meet, I always found it helpful to answer some of the ‘big picture’ questions he asked. I felt challenged to step back from whatever low level details I was implementing and critically think about what we should prioritize—which has improved my research decision making skills overall. 

While I periodically met with Dan, I got a lot of help from Yu. My previous mentors have been more hands off, so when Yu would challenge some of my ideas, I actually found I preferred this type of back and forth. She would tease out details perhaps another researcher would ask, flesh out low-level ideas, which really complemented Dan’s style of high-level advising. I used to think research was all really technical math and derivations but actually found that scientific communication was really important, especially when time is limited and you have to pitch an idea, or get feedback on an experiment design. Making sure the other party knows exactly what you’re talking about helps decision making and ultimately the efficiency of the project. 

Personally, this was the first time I got to research full time outside of classes. I’ve always struggled with context switching between research and classes, so it was rewarding to just have a big chunk of time to let ideas flow. I think I nurtured a habit of trying to understanding things deeper, to spend time digging into neat ideas and deriving equations from scratch. It was really cool to reuse things like learning theory, or theory of computation, that I’d glossed over in my undergrad classes thinking I’d never use them again. Most importantly, this gave me more time to develop my research training skills. I’d be able to reflect on what went well during research and just do `film study’ (see Jacob Steinhardt’s blog on this) and become a better researcher. 

I’m grateful to both Yu and Dan for this opportunity, and all the other CCG members who made my time more enjoyable. The outings we would have w/ Jen to the Penn orchard or the Museum, the Coffee runs I’d have with Tomer, and lunches w/ Alon all helped keep me a happy researcher. 

Altar Horowitz
Undergrad at Tel Aviv University

Hey! My name is Altar, and I’m a second-year Bioinformatics student at Tel Aviv University. This summer, I had the privilege of working on a project in Professor Dan Roth’s lab, alongside my incredible research partner, Guy Kouchly.

Our project had two main parts. The first involved building an online tool that allows AI researchers to compare distances between embeddings of different sentences, based on their chosen embedding type. The second part focused on exploring whether prompt enrichment improves AI retrieval performance from a database.

For me, this was a very special experience – it was the first time I built an entire tool from scratch, which, as anyone who’s done this knows, is a truly unique and educational process. Moreover, being part of such a high-level laboratory and creating a tool that can be used by some of the best scientists in the world was incredibly empowering.

Another highlight of the summer was the honor of working with the amazing Dr. Tomer Wolfson, who dedicated so much of his time to advising and helping us. Overall, this was one of the most meaningful experiences I’ve had, and it definitely strengthened my motivation to keep working hard and pursue a path in the academic world!

Guy Kouchly
Undergrad at Ben Gurion University of the Negev

During the summer, I worked together with my research partner, Altar, on developing a demo for comparing text embeddings and visualizing the distances between them under different models. Later on, we joined another project under Tomer’s guidance, focusing on improving retrieval methods for large language models using prompt engineering. We experimented with a subset of questions from the OTT-QA dataset and evaluated GPT’s ability to retrieve the corresponding “gold” documents. Our approach involved generating a fictional document (using GPT) for each gold document and using it as a prompt. While this method didn’t yet improve results, Tomer believes there’s still potential – especially with more challenging datasets.  

I really enjoyed working on these projects this summer. NLP is new to me, and I’m grateful for the chance to gain hands-on experience so early in my studies. Just as importantly, the lab atmosphere was wonderful – I always felt comfortable asking for help, and everyone was incredibly kind, patient, and welcoming.

In terms of what I’ve learned—almost everything was new! On the theoretical side, I got to explore concepts like embeddings, dimensionality reduction (PCA), and retrieval-based reasoning. On the practical side, I learned about building demos, using APIs, and the general workflow of conducting research.

Many thanks to Dan, Tomer, Terry, Rohit, Prasham, and you, Jen, for all the support and for making this such a meaningful experience.

CCG Papers at EMNLP and *SEM 2025

The Conference on Empirical Methods in Natural Language Processing (EMNLP), celebrating its 30th anniversary, is underway in Suzhou, China, and the co-located 14th Joint Conference on Lexical and Computational Semantics (*SEM) kicks off tomorrow. We’re excited to share the works that will be presented by the group and our collaborating authors. You can find links to our EMNLP and *SEM 2025 papers below!

[Some summaries generated by AI.]

*SEM 2025 (11/08-11/09, co-located with EMNLP)

Cross-lingual Extractive Question Answering with Unanswerable Questions

This paper extends cross-lingual extractive QA, where models need to find answers in passages written in a language different from the question, to cases where no answer exists within the given context. Proposing two novel datasets and performing extensive experiments, we analyze the strengths and weaknesses of different language models and training strategies for this task as well as the effect of the language identity on the performance. 

Yuval Gorodissky and Elior Sulem and Dan Roth, Cross-lingual Extractive Question Answering with Unanswerable Questions *SEM (2025)

EMNLP 2025 (11/5-11/9)

Conflicts in Texts: Data, Implication, Challenges

This survey examines how conflicting information arises in NLP—from inconsistencies in natural texts, to annotation disagreements, to model-level hallucinations and knowledge clashes—and unifies them under a common framework. We analyze their implications, highlight mitigation strategies, and chart future directions for building conflict-aware NLP systems that can better reconcile contradictions.

Siyi Liu and Dan Roth, Conflicts in Texts: Data, Implication, Challenges EMNLP-Findings (2025)

AutoCT: Automating Interpretable Clinical Trial Prediction with LLM Agents

The paper introduces AutoCT, a framework that integrates large language models with classical machine learning to predict clinical trial outcomes. By autonomously proposing, constructing, and refining features through Monte Carlo Tree Search, AutoCT achieves competitive performance with state-of-the-art methods while maintaining interpretability and reducing reliance on human input.

Fengze Liu and Haoyu Wang and Joonhyuk Cho and Dan Roth and Andrew Lo, AutoCT: Automating Interpretable Clinical Trial Prediction with LLM Agents EMNLP (2025)

LogiCoL: Logically-Informed Contrastive Learning for Set-based Dense Retrieval

LogiCoL is a new training method for dense retrieval models that helps them better handle queries containing logical connectives (like “and,” “or,” “not”) by incorporating logical constraints directly into the learning process through contrastive learning. The authors demonstrate that this approach improves both retrieval accuracy and logical consistency when retrieving sets of Wikipedia entities that must satisfy complex logical relationships specified in queries.

Yanzhen Shen and Sihao Chen and Xueqiang Xu and Yunyi Zhang and Chaitanya Malaviya and Dan Roth, LogiCoL: Logically-Informed Contrastive Learning for Set-based Dense Retrieval EMNLP (2025)

Evaluating NL2SQL via SQL2NL

We propose a novel schema-aligned paraphrasing framework that leverages SQL-to-NL (SQL2NL) to automatically generate semantically equivalent, lexically diverse queries while maintaining alignment with the original schema and intent. This enables the first targeted evaluation of NL2SQL robustness to linguistic variation in isolation-distinct from prior work that primarily investigates ambiguity or schema perturbations.

Mohammadtaher Safarzadeh and Afshin Oroojlooyjadid and Dan Roth, Evaluating NL2SQL via SQL2NL EMNLP-Findings (2025)

Weaver: Interweaving SQL and LLM for Table Reasoning

Weaver, a novel framework for question answering over tables where questions are complex and require reasoning, when you have unstructured columns, or need to combine logical operations with semantic understanding. Weaver dynamically combines SQL for retrieving and aggregating, together with the semantic reasoning capabilities of LLMs.

Rohit Khoja and Devanshu Gupta and Yanjie Fu and Dan Roth and Vivek Gupta, Weaver: Interweaving SQL and LLM for Table Reasoning EMNLP (2025)

Rethinking LLM Uncertainty: A Multi-Agent Approach to Estimating Black-Box Model Uncertainty

We show in this paper that LLMs are inconsistent on factual questions and can easily be biased by the additional context in the input, and we propose a multi-agent approach to improve answer consistency and uncertainty quantification.

Yu Feng and Phu Mon Htut and Zheng Qi and Wei Xiao and Manuel Mager and Nikolaos Pappas and Kishaloy Halder and Yang Li and Yassine Benajiba and Dan Roth, Rethinking LLM Uncertainty: A Multi-Agent Approach to Estimating Black-Box Model Uncertainty EMNLP-Findings (2025)

CCG Papers at ICML 2025

The Forty-Second International Conference on Machine Learning (ICML) starts off today in Vancouver, Canada. We’re excited to share the works that will be presented by the group and our collaborating authors. You can find links to our ICML 2025 papers below!

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding 

This work presents ReFocus, a framework that equips multimodal LLMs with the ability to generate “visual thoughts” by performing visual editing on the input image through code, shifting and refining their visual focuses. With experiments on a wide range of structured image understanding tasks involving tables and charts, we present an in-depth analysis of the effects of different visual edits and the ways ReFocus can edit the input image until an answer is reached.

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang, ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding ICML (2025).

GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation

This paper introduces GIVE (Graph Inspired Veracity Extrapolation), a framework that enhances Large Language Models’ reasoning on knowledge-intensive tasks by combining their internal knowledge with minimal, structured external knowledge graph cues. The method enables smaller LLMs to achieve or exceed the performance of larger models on complex scientific reasoning tasks, while also reducing hallucination.

Jiashu He, Mingyu Ma, Jinxuan Fan, Dan Roth, Wei Wang, and Alejandro Ribeiro, GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation ICML (2025).

Includes some text generated by AI.

CCG Papers at ICLR 2025

The 2025 International Conference on Learning Representations (ICLR) is happening this week in Singapore. We’re excited to share the work that will be presented and published by the group and our collaborating authors. You can find links to our ICLR 2025 papers below!

BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models

Language models often struggle with reliable and consistent decisions under uncertainty, partially because they can’t reliably estimate the probability of each choice. We propose BIRD, a framework that significantly enhances LLM decision making under uncertainty. BIRD leverages LLMs for world modeling and constructs a Bayesian network using LLM-generated variables, enabling interpretable and trustworthy probability estimates.

Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth, BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models ICLR (2025)

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). For reliable assessment, each standard instance in MuirBench is paired with an unanswerable variant that has minimal semantic differences.

Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, and Muhao Chen, MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding ICLR (2025)

Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge

The effectiveness of automatic evaluation of generative models is typically measured by comparing labels generated via automation with labels by humans using correlation metrics. In this paper, we show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation, including LLM-as-a-Judge. Based on these findings, we propose stratifying data by human label uncertainty to provide a more robust analysis and introduce a new metric to better measure the effectiveness of automatic evaluations.

Aparna Elangovan, Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan Bodapati, and Dan Roth, Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge ICLR (2025)

CCG Papers at NAACL 2025

Red NAACL logo with text "NAACL 2025"

The 2025 Conference of the Nations of the Americas* Chapter of the Association for Computational Linguistics (NAACL) will be coming up in May, in Albuquerque, New Mexico. We’re excited to share the work that will be presented and published by the group and our collaborating authors. You can find links to our NAACL 2025 papers below!

Updated with presentation times and dates (note: not chronological, even though the first one is on April 30; so is the last).
___
* Here’s a blog post from the NAACL organizing committee about their name change.

On Reference (In-)Determinacy in Natural Language Inference

This paper introduces RefNLI, a diagnostic benchmark for identifying reference ambiguity in Natural Language Inference examples. We provide insight into how the reference determinacy assumption (the assumption that premise and hypothesis refer to the same context) impacts the downstream utility of NLI models, and discover that the existence of reference ambiguity in NLI examples can in part explain the inherent human disagreements in NLI.

Sihao Chen, Chaitanya Malaviya, Alex Fabrikant, Hagai Taitelbaum, Tal Schuster, Senaka Buthpitiya, and Dan Roth, On Reference (In-)Determinacy in Natural Language Inference. NAACL Findings (2025)

Wednesday, April 30, Session C: Oral/Poster 2, Hall 3, 14:00-15:30.

Towards Long Context Hallucination Detection

This paper studies hallucination detection where the context length is long (>=512 tokens). We construct a dataset to evaluate the task and propose a method to approach it.

Siyi Liu, Kishaloy Halder, Zheng Qi, Wei Xiao, Nikolaos Pappas, Phu Mon Htut, Neha Anna John, Yassine Benajiba, and Dan Roth, Towards Long Context Hallucination Detection. NAACL Findings (2025)


Friday, May 2, Session J: Oral/Poster 7, Hall 3, 9:00-10:30.

Open Domain Question Answering with Conflicting Contexts

We study open domain question answering when there is conflicting evidence presented on the web. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guides them through the process of reasoning with conflicting contexts.

Siyi Liu, Qiang Ning, Kishaloy Halder, Zheng Qi, Wei Xiao, Phu Mon Htut, Yi Zhang, Neha Anna John, Bonan Min, Yassine Benajiba, and Dan Roth, Open Domain Question Answering with Conflicting Contexts. NAACL Findings (2025)

Friday, May 2, Session J: Oral/Poster 7, Hall 3, 9:00-10:30.

H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables

Existing approaches to tabular reasoning employ either textual reasoning, which excels in semantic interpretation but struggles with mathematical operations, or symbolic reasoning, which handles computations well but lacks semantic understanding. H-STAR, a novel method introduced in this paper, integrates text comprehension with SQL-like logic to effectively answer queries from structured tables.

Nikhil Abhyankar, Vivek Gupta, Dan Roth, and Chandan K. Reddy, H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables. NAACL (2025)

Thursday, May 1, Session H: Oral/Poster 5, Hall 3, 14:00-15:30.

TRANSIENTTABLES: Evaluating LLMs’ Reasoning on Temporally Evolving Semi-structured Tables

The ability to reason over time allows us to identify future steps and to understand the effects of decisions on our lives. However, large language models are typically trained on static datasets, limiting their ability to perform effective temporal reasoning. To assess the temporal reasoning capabilities of LLMs, this paper presents the TRANSIENTTABLES dataset, with questions derived from over 14,000 tables, spanning multiple time periods.

Abhilash Shankarampeta, Harsh Mahajan, Tushar Kataria, Dan Roth, and Vivek Gupta, TRANSIENTTABLES: Evaluating LLMs’ Reasoning on Temporally Evolving Semi-structured Tables. NAACL (2025)

Friday, May 2, Session K: Oral/Poster 8, 11:00-12:30 (presentation 11:45).

Enhancing Temporal Understanding in LLMs for Semi-structured Tables

We introduce the C.L.E.A.R. prompting framework and auxiliary cross-format training to enhance LLM performance in temporal tabular reasoning. Our findings demonstrate that our method improves evidence-based reasoning across various models. Additionally, our experimental results reveal that indirect supervision with auxiliary unstructured data (TRAM) substantially boosts model performance.

Irwin Deng, Kushagra Dixit, Vivek Gupta, and Dan Roth, Enhancing Temporal Understanding in LLMs for Semi-structured Tables. NAACL Findings (2025)

Thursday, May 1, Session H: Oral/Poster 5, Hall 3, 14:00-15:30.

MAPWise: Vision-Language Models for Advanced Map Queries

This paper introduces a new benchmark for evaluating vision-language models (VLMs) on choropleth map question answering, featuring diverse maps and question types across multiple geographic regions. Evaluation of several VLMs reveals significant performance gaps, highlighting the need for further research in this area and providing a resource for future model development.

Srija Mukhopadhyay, Abhishek Rajgaria, Prerana Khatiwada, Manish Shrivastava, Dan Roth, and Vivek Gupta, MAPWise: Evaluating Vision-Language Models for Advanced Map Queries. NAACL (2025)

Thursday, May 1, Session F: Oral/Poster 4, Ballroom B, 10:30-12:00 (presentation 10:45).

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

NTSEBENCH is a Vision-Language Model benchmark dataset with 2,728 questions and 4,642 images from India’s NTSE exam, to evaluate VLMs on cognitive multimodal reasoning.

Pranshu Pandya, Vatsal Gupta, Agney Talwar, Tushar Kataria, Dan Roth, and Vivek Gupta, NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models. NAACL Findings (2025)

Friday, May 2, Session K: Oral/Poster 8, Hall 3, 11:00-12:30.

Aligning to Constraints for Data-Efficient Language Model Customization

This paper proposes ACT (Aligning to ConsTraints), a unified and efficient Language Model customization framework using automatic constraint verifiers to provide supervision signals for adapting models to downstream tasks.

Fei Wang, Chao Shang, Sarthak Jain, Shuai Wang, Qiang Ning, Bonan Min, Vittorio Castelli, Yassine Benajiba, and Dan Roth, Aligning to Constraints for Data-Efficient Language Model Customization. NAACL Findings (2025)

Tuesday, May 6, Gather Session 2, online, 15:00-16:30.

Leveraging LLM For Synchronizing Information Across Multilingual Tables

We explored the application of large language models (LLMs) for multilingual information synchronization, focusing on improving the accuracy and coherence of updates to Wikipedia tables in low-resource languages.

Siddharth Khincha, Tushar Kataria, Ankita Anand, Dan Roth, and Vivek Gupta, Leveraging LLM For Synchronizing Information Across Multilingual Tables. NAACL (2025)

Wednesday, April 30, Session D: Oral/Poster 3, Ballroom B, 16:00-17:30 (Presentation 17.00).

Interview with Muhao Chen (CCG 2019-2020)

Muhao Chen joined the Cognitive Computation Group in 2019 as a postdoctoral researcher, and he still collaborates with us from time to time.

Muhao Chen at a sign marking the Arctic Circle, positioned as though he's holding up the earth.
Muhao Chen at the Arctic Circle, holding up the earth.

He is currently an Assistant Professor in the Department of Computer Science at UC Davis. He directs the Language Understanding and Knowledge Acquisition (LUKA) Lab. Muhao’s research focuses on robust and minimally supervised data-driven machine learning for Natural Language Processing. Most recently, his group’s research has been focusing on accountability and security problems of large language models and multi-modal language models. Previously, Muhao was a Postdoctoral Fellow at UPenn, from 2019 to 2020. He received his Ph.D. degree from the Department of Computer Science at UCLA in 2019. Before joining UCLA as a Ph.D. student, he graduated with a Bachelor degree from Fudan University in 2014.

Hi, Muhao! Glad to catch up with you. Please tell me where you’re living and what you’re up to these days.

I’m living in Sacramento — the capital city of California. As usual, aside from working, I still enjoy traveling (especially driving to different places for road trips). Luckily, from Sac or the Bay Area it is easy to reach most places in the country through direct flights or driving. 

That’s excellent! I’m glad you’ve been getting to travel.
What are the most rewarding things about your current work?

What I feel most rewarding is to have had a well-established NLP group since I started to work as a faculty member. All students have done excellent jobs building strong academic records of their own. Over a year ago, the first batch of PhD students graduated and have been very successful researchers in the industry. A few more are upcoming and are looking for (or will soon be looking for) faculty positions (wish them the best of luck!). Hope one day, the group can be as successful as CCG and have a lot of successful alumni.

The most surprising? And yes, best of luck to them!

Most of my group members have pets (mostly cats; as we list in one of the sections here https://luka-group.github.io/people.html). I recently got my own pets (two Roborovski hamsters):

They’re very sweet. I like that the nonhumans get special recognition in your lab. (-:

How connected is your work now with what you did in our group?

NLP has been moving way too fast nowadays. But quite a few things we’ve done recently, especially those related to LLM reasoning and indirect supervision, are closely related to what I did at CCG. In fact, I still collaborate with Dan and other (past or current) CCG members like Ben, Wenpeng, Haoyu, Hongming, and Qiang on these topics. We have been giving tutorials every year since 2020 about this research.

What new development(s) in the field of NLP are you excited about right now?

Our group has been focusing on machine learning robustness since it was founded. Particularly, we recently have been very interested in safety issues of LLMs. We build systems that automatically identify safety issues of LLMs, safeguard LLMs from malicious use, and protect LLMs from threats and vulnerabilities when they interact with complex environments. This area is particularly important nowadays considering that LLMs are becoming backbones of more and more intelligent systems and are starting to handle thousands of tasks in real world.

I’m glad to hear you’re focusing on safety issues as LLMs grow.
Thoughts about the state of AI? 

This is an exciting time where AI researchers are building larger and larger learning-based systems and not only solve daily-life problems, but even help with frontier scientific discovery in many other fields like biology, medicine, chemistry, food science, etc. On the one hand, it is a good time for us to work with other fields of study on many scenarios where AI can contribute its force. On the other hand, it is an important time where academia should collaborate more closely with the industry as the AI systems we seek to build recently require significantly more computing and data resources.

How are things outside of work?

I just finished my checklist for traveling to all the national parks in US. Last summer I drove the Dalton Highway to reach the Gates of the Arctic.

Image

Congratulations! That’s fantastic. So, how many national parks have you visited? And which have made a particular impression on you?

I’ve been to 56 national parks (only counting real “National Parks”, and not national monuments or national historical parks etc. though I’ve been to many of these as well). There are 7 national parks I still haven’t been to (3 in Alaska, 1 island in California, 1 island in Florida, and 1 in American Samoa and 1 in Virgin Islands) because all these need to be reached by air taxis or cruise ships, while I just finished all that can be reached by driving. I really love the national park system in the US because almost every one of them is different from each other, with many unique scenes to see and roads to drive on.

Favorite park: if one, then definitely Yellowstone that stands out from all the rest. But I’ve been asked to pick my top 5 in the past and I eventually picked the top 6: Yellowstone (WY), Death Valley (CA), Arches (UT), Carlsbad Cavern (NM), Redwood (CA), Badlands (SD).

Excellent! Do you have a memory to share from your time with the group?

It was campus lock-down time in 2020, but a few of us had hotpot every Friday at my apartment. In fact, a few of us still spent time together in the 3401 Walnut building during the lock-down. A lot of fun happened during that time. There were times where we stayed late in the building before the paper deadline. There were also times where we brought game consoles to play in the room where Dan used to host his group meeting. Most of them have graduated now (except for Haoyu).

Any advice for the current students and postdocs in the group?

One important thing I learned from Dan is to develop a good research taste. Doing meaningful research is not about publishing more and more papers. In fact, only the first paper, the best paper, and probably also the last paper about a topic are the memorable ones.

Thanks for this insight. And thank you so much for this interview!

For more information on Muhao Chen’s research at UC Davis, please visit his website.

Just looked this up: the 414 mile Dalton Highway in Alaska, including the 100+ mile stretch Muhao would have driven to reach the Arctic Circle!

Summer Interns, 2024!

This summer, for the first time in a few years, we had a substantial group of summer interns working with us, mentored by PhD student Sihao Chen and postdoc Vivek Gupta. Some were visiting from universities as far away as Utah and California, but the group also included a Penn undergrad and a local high schooler. We were honored to work with them and incorporate them into our research work. As the summer term winds up, I’ve asked for their thoughts on the experience and what they learned and enjoyed here!

Harshavardhan Kalalbandi
Master’s student at the University of California, Riverside

My internship at Prof. Dan Roth’s Cognitive Computation Group has been an incredible journey. During the first few weeks I spent time going through many research papers to formulate a good problem. I found a keen interest in temporal QA on tables, where the tables can be represented in various ways. We started by establishing baselines using existing reasoning methods. We proposed a dynamic reasoning approach, allowing the model to have flexibility in formulating its approach. Initially, we tried solving the problem through simple prompting, but when that proved insufficient, we explored alternative approaches.

Harsha on a trip to Washington, DC

Our efforts led us to develop a multi-agent approach and a fine-tuning method for our system. A significant challenge has been keeping pace with the rapid advancements in state-of-the-art NLP research. A highlight of this experience was meeting Prof. Dan Roth and his incredible team, which was both inspiring and fun. Dr. Vivek Gupta’s mentorship and expertise was very helpful in seeing through this work. I also explored the incredible campus of Penn, went around Philadelphia, and traveled to New York City and DC during the weekends. It was a great fun experience where I learned a lot, met incredible people, and enjoyed myself.

Kushagra Dixit
Master’s student at the University of Utah

This summer, I had the amazing opportunity to intern with the Cognitive Computation Group at the University of Pennsylvania. During my time here, I worked on two fascinating domains: improving the temporal understanding of large language models on semi-structured data and exploring the mechanisms behind in-context learning in LLMs. These projects allowed me to delve deep into some of the most exciting trends in NLP research, providing invaluable experience that I will carry forward in my career.

Kushagra on a trip to Washington, DC

One of the most enjoyable aspects of this internship has been the vibrant and collaborative environment. I’ve had the pleasure of interacting with PhD students in the lab, sharing ideas with fellow interns, and receiving invaluable guidance from Dr. Vivek Gupta. Every discussion with Professor Dan Roth has been particularly special. His insights and expertise have consistently challenged and inspired me, leaving a profound impact on my approach to research. I will leave this internship not only with a wealth of new knowledge but also with cherished memories of the fun and engaging moments shared with the team.

In terms of skills, I’ve significantly sharpened my programming abilities, but more importantly, I’ve learned how to drive a research project from inception to completion. Engaging with so many experienced researchers has provided me with numerous opportunities to understand their thought processes, which has been a critical learning experience. As I move forward, I am excited to continue my engagement with CogComp, building on the strong foundation this internship has provided, and contributing further to NLP research.

Mike Zhou
Undergrad at the University of Pennsylvania

Mike at Mission Peak, San Francisco

Hi! My name is Mike and I’m currently a rising senior at Penn. This summer, I’ve had the pleasure to be a part of the cognitive computation lab, where I’ve been focusing on language model reasoning and understanding. My current projects involve exploring large language models’ abilities to reason, and whether their reasoning comes from a form of deduction or simply from semantic associations. My day-to-day routine in the lab involves keeping up with literature, designing experiments that I’ll run, and talking with other lab mates to give and gain insights on each other’s projects. Overall, I’d say that I’ve learned quite a bit about and gained a good number of insights on language models this summer, whether it was from my work, peers, or Professor Roth.

Yanzhen Shen
Undergrad at the University of Illinois Urbana-Champaign

Hi! I’m Yanzhen Shen, an undergraduate student at the University of Illinois at Urbana-Champaign. This summer, I had the privilege of conducting research at the University of Pennsylvania under Professor Dan Roth and my PhD mentor, Sihao Chen. During this experience, I worked on cutting-edge research in Information Retrieval and gained a deeper understanding of the qualities of a good researcher.

Yanzhen in Chicago

Our project focused on improving retrieval systems, particularly for complex queries. While current systems handle simple queries effectively, they struggle with ones with logical operators, such as “Orchids of Indonesia and Malaysia but not Thailand.” To address this, we are using text embedding models to better interpret logical operators like AND, OR, and NOT in complex queries. Our goal was to push the boundaries of how query and document dense embeddings can represent complex information.

Technically, I became proficient in dual encoder training frameworks and data processing for Information Retrieval tasks. 

More importantly, discussions with Professor Roth helped me view our ideas from broader perspectives. For example, he continued to encourage me to think about how our model differs from existing dense retrieval models and other retrieval systems. In addition, working closely with Sihao also gave me insights into how a senior PhD student approaches and resolves challenges in complex research problems. We also engaged in paper discussions, where he taught me to think about critical questions when reading papers, such as “What makes a good researcher question?” and “ Is this work influential?”

Atharv Kulkarni
Undergrad at the University of Utah

Atharv in conversation with Benjamin Franklin

This summer, I had the incredible opportunity to intern at the University of Pennsylvania’s Cognitive Computation group under the guidance of Professor Dan Roth. The main focus of my internship was to enhance the capabilities of language models for complex reasoning tasks using neurosymbolic approaches. I created a question-answer dataset based on the Wikipedia information of Olympic sport athletes and used it to evaluate state-of-the-art language models. This involved integrating Wikipedia data into a database and developing SQL queries to create a large synthetic dataset. I discovered that models often struggle with temporal reasoning tasks, which led to productive discussions with Professor Roth about using neurosymbolic techniques to improve reasoning performance. By refining models with fine-tuning and prompt engineering, I made significant progress in enhancing language models’ ability to transform natural language questions into SQL queries.

Beyond the technical work, Professor Roth’s mentorship was invaluable. His insightful feedback and guidance helped me develop skills in fine-tuning models and utilizing various APIs, significantly advancing my understanding of the field. His expertise provided me with a deeper appreciation for the nuances of research and inspired me to think critically about complex problems. Mentoring Aadit Bontha, a high school student interning with our CogComp group, was a rewarding experience, offering me a fresh perspective on language models. Exploring Philadelphia’s iconic sites, such as City Hall and the Benjamin Franklin Museum, along with my peers, added to the memorable experiences of this summer. Overall, I gained a deeper understanding of research and thoroughly enjoyed collaborating with my peers. I am grateful to Professor Dan Roth and Dr. Vivek Gupta for this invaluable opportunity.

Aadit Bontha
HS Student at La Salle College High School

During my internship at UPenn, I had the opportunity to work on a project centered around large language models (LLMs) and their ability to extract and organize data from various sources. My main task was to explore how these models could generate structured answers, like tables, from diverse data inputs—a challenge that required both technical skill and a deep understanding of AI.

Mentor Vivek Gupta with Aadit, holding matching program certificates

I started by working with the TANQ dataset, which includes a vast collection of question-answer pairs. My role involved extracting relevant information and using LLMs to generate structured outputs. This required extensive coding in Python, utilizing tools like BeautifulSoup for web scraping and the JSON module for data parsing.

One of the most interesting aspects of the internship was experimenting with different AI models, such as GPT-3.5 and Gemini 1.5. Through this, I gained insights into how these models perform in various scenarios—GPT-3.5 excels at handling complex queries, while Gemini 1.5 is more effective with simpler, structured tasks.

I also focused on refining the prompts we used to optimize model performance. This involved understanding how to craft questions that would elicit the most accurate and relevant responses from the models. Overall, the internship was a highly educational experience that enhanced my skills in AI and data processing, providing me with valuable insights into the practical applications of these technologies.

CCG papers at ACL 2024

The 2024 Annual Meeting of the Association for Computational Linguistics (ACL) is underway in Bangkok! We’re excited to share the work that’s being presented and published from CCG and our collaborating authors. You can find links to our ACL papers below!

ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models

In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. To design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework, consisting of 6 pillars — Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.

Aparna Elangovan, Ling Liu, Lei Xu, Sravan Bodapati, and Dan Roth, ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models (ACL 2024)

Winner of the Outstanding Paper Award at the ACL2024 Workshop on Knowledgeable LMs
Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval

Retrieving relevant tables containing the necessary information to accurately answer a given question over tables is critical to open-domain question-answering (QA) systems. However, many questions require retrieving multiple tables and joining them through a join plan that cannot be discerned from the user query itself. In this paper, we introduce a method that uncovers useful join relations for any query and database during table retrieval. We use a novel re-ranking method formulated as a mixed-integer program that considers not only table-query relevance but also table-table relevance that requires inferring join relationships.

Peter Baile Chen, Yi Zhang, and Dan Roth, Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval (ACL 2024)

FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts

This paper introduces FlowVQA to overcome the shortcomings of existing visual question answering benchmarks in visual grounding and spatial reasoning. FlowVQA features 2,272 flowchart images and 22,413 question-answer pairs to evaluate tasks like information localization, decision-making, and logical reasoning. The evaluation of various multimodal models highlights FlowVQA’s potential to advance multimodal modelling and improve visual and logical reasoning skills.

Shubhankar Singh, Purvi Chaurasia, Yerram Varun, Pranshu Pandya, Vatsal Gupta, Vivek Gupta, and Dan Roth, FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts (ACL-Findings 2024)

Evaluating LLMs’ Mathematical Reasoning in Financial Document Question Answering

In this paper, we assess LLM robustness in complex mathematical reasoning with financial tabular datasets, revealing that LLMs struggle with increasing table and question complexity, especially with multiple arithmetic steps and hierarchical tables. The new EEDP technique enhances LLM accuracy and robustness by improving domain knowledge, extracting relevant information, decomposing complex questions, and performing separate calculations.

Pragya Srivastava, Manuj Malik, Vivek Gupta, Tanuja Ganu, and Dan Roth, Evaluating LLMs’ Mathematical Reasoning in Financial Document Question Answering (ACL-Findings 2024)

Interview with Celine Lee (CCG 2019-2020)

Celine Lee
Celine Lee, PhD student


Celine Lee joined the Cognitive Computation Group as an undergraduate/masters student researcher in 2019 and graduated from Penn in 2020. She is now a PhD candidate at Cornell Tech. Celine explores questions in structured language, particularly problems in programming language semantics and reasoning.

Hi, Celine! Please tell me where you’re living and what you’re doing these days.

I live in New York City, working on my PhD at Cornell Tech, the campus on Roosevelt Island.

What are the most rewarding things about your current work?

The most fun thing about research is how big the search space of problems is. I get to spend every day thinking about where the interesting open problems are, then talking and working with some of the brightest minds in the field to devise experiments to address them. Most days don’t look the same, because I can’t predict what the path to the solution exactly looks like.

The most surprising?

Something that still surprises me every day is how small this community really is. I’ll meet a friend of a friend or join some colleagues for lunch, and suddenly I’m putting all these new faces to the names on papers that I have been reading for years. And everyone is so excited to talk about what we’re all obsessed with: our shared research interests!

That’s great! I remember your passion for research from your work with us at Penn. How did you originally get involved with the group?  I remember you participated in the Google Explore Research program in early 2020.

This is correct! I started working with Dan after taking his machine learning course, then got involved with the Google Explore Program soon after.

How connected is your work now with what you did in our group?

At CCG, I was working on semantic role labeling systems. Now I’m continuing my work on structured language tasks, but the grammar is that of computer programming languages. The tension of the differing levels of ambiguity between natural language and high level programming languages down the compute stack to compiler IRs all the way to bits leads to interesting questions about correctness, scalability, and adaptability of automatic programming systems. 

What are your thoughts about the state of AI?

Many brilliant people are asking and answering many questions that make computers more adept than I ever imagined possible. I was skeptical at first, then a bit scared, then ultimately excited because now I have more powerful tooling to think bigger– tackle some crazier ideas. 

That does sound exciting!
I know you also parlay your varied interests into creative work alongside your academic work. Will you talk a bit about your writing?

Over the pandemic, I found myself with an unprecedented abundance of time to explore topics only barely related to my work. This coincided with my increasing involvement with NLP research, through which I (1) learned a structured methodology for asking and answering questions, and (2) became extra interested in language. So I wrote and put out my first few blog posts, which turned out to be surprisingly super fun.

snippet from “Donut Wheel”

Fast forward through the the past few years, and this hobby has spiraled out into various formats– academically-leaning blog posts, short and silly illustrated zines, personal musings and essays… Side benefit of writing as a personal creative endeavor: writing papers for work is much less intimidating now.


We have a number of undergrad and master’s students joining us as interns this summer.  Any advice or thoughts about working with the group?

I have two primary pieces of advice. One is that Professor Roth’s expertise and experience make not only him one of the best people you could work with and learn from, but also all the other people around you in the lab. It would be wise to talk to everyone and learn their specialties so that you can maximize your surrounding resources.

The other piece of advice is to practice storytelling as much as possible– who are you as a researcher? Why is your work a compelling piece of science in todays massive volume of NLP work? I think you should be able to convince someone who isn’t personally invested in you but is interested in machine learning to root for your success.

That’s excellent advice.  Thank you so much for this interview!
To learn more about Celine’s research and creative pursuits, please visit her website.

snippet from “Please Be Seated”

CCG Papers at NAACL 2024

The 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) is coming up, from June 16-21 in Mexico City. We’re excited to share the work that’s being presented and published from CCG and our collaborating authors. You can find links to our NAACL papers below!

Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations

Figure 1: Given a proposition in a sentence (represented by a highlighted subset of tokens), the sub-sentence encoder produces a contextual embedding for the meaning of the proposition.

Text embeddings typically produce one embedding for the entire text sequence, but what if the text is long and says many things? Check out Sub-Sentence Encoder — A contextual text encoder model that learns to embed individual pieces of meaning in text. 

Sihao Chen, Hongming Zhang, Tong Chen, Ben Zhou, Wenhao Yu, Dian Yu, Baolin Peng, Hongwei Wang, Dan Roth, and Dong Yu, Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations (NAACL 2024).

SocREval: Large Language Models with the Socratic Method for
Reference-Free Reasoning Evaluation

The paper develops a reference-free evaluation of reasoning abilities of LLMs, surpassing the abilities of GPT-4 to evaluate reasoning abilities.

Hangfeng He, Hongming Zhang, and Dan Roth, SocREval: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation (NAACL Findings 2024).

What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception

In this work, we study the effect of intermediate explanation formats on the effectiveness of human feedback for correcting QA model responses. Further, we investigate the properties of explanations which allow users to understand and trust responses.

Chaitanya Malaviya, Subin Lee, Dan Roth, and Mark Yatskar, What if you said that differently? How Explanation Formats Affect Human Feedback Efficacy and User Perception (NAACL 2024).

ExpertQA: Expert-Curated Questions and Attributed Answers

This work conducts expert evaluation of responses to domain-specific questions according to various axes of attribution and factuality. Based on our evaluation, we present ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.

Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth, ExpertQA: Expert-Curated Questions and Attributed Answers (NAACL 2024).

ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented
Large Language Models via Transferable Adversarial Attacks

Figure 1: An example of how original evidence is edited by ReEval. The question is “When did Athens emerge as the wealthiest Greek city state?” The desirable answers, respectively, for answer swapping (Category 1) and context enriching (Category 2) are “the early 4th century BCE” and “the late 6th century BCE”. ChatGPT answers are next to the emoji.

Despite remarkable advancements in mitigating hallucinations in large language models (LLMs) by retrieval augmentation, it remains challenging to measure the reliability of LLMs using static question-answering (QA) data. Inspired by adversarial machine learning, we investigate the feasibility of automatically perturbing existing static benchmarks for dynamic evaluation. Specifically, this paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence and generate new test cases for evaluating the LLMs’ reliability in using new evidence for answering. 

Xiaodong Yu, Hao Cheng, Xiaodong Liu, Dan Roth, Jianfeng Gao, ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks (NAACL Findings 2024).