Please note: This PhD seminar will take place in DC 3301.
Nandan Thakur, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jimmy Lin
Traditional IR evaluation (e.g., TREC, Cranfield paradigm) constructs test collections that use fixed corpora and pool relevance judgments, a practice that minimally captures the challenges of RAG applications. This talk starts by mentioning limitations in prevalent IR benchmarks, comprising either stale data, incomplete labels, or simplistic queries. In particular, we motivate why retrieval evaluation must evolve and why a metric shift is needed for IR evaluation in modern-day systems as an emerging requirement. We survey FreshStack, a holistic benchmark that addresses these gaps, by constructing test collections with recent StackOverflow Q&As and GitHub documents to reflect real-world programming questions, providing insight on the diversity-focused metrics used in IR evaluation. The goal is to give practitioners insights into the limits of traditional IR evaluation and guide them toward more realistic, robust evaluation practice of IR systems in the modern-day RAG applications.