PhD Seminar • Data Systems • Query Expansion in the Era of Large Language Models

Wednesday, June 24, 2026 12:00 pm - 1:00 pm EDT (GMT -04:00)

Please note: This PhD seminar will take place in DC 3301.

Amin Bigdeli, PhD candidate
David R. Cheriton School of Computer Science

Supervisors: Professors Charles Clarke, Ebrahim Bagheri

Query expansion has long served as a foundational technique in information retrieval, bridging the vocabulary gap between user queries and relevant documents. Traditional approaches, including term-based feedback methods and statistical expansion techniques, rely on surface-level signals, limiting their ability to capture the intent behind a query. The emergence of large language models has fundamentally changed this landscape, offering the capacity to generate rich, semantically meaningful expansions. Yet most LLM-based methods treat the generator as a black box, producing plausible text without articulating what transformation is being applied or verifying whether the expansion actually improves retrieval on the target corpus. Furthermore, progress in this area is constrained by the absence of a unified framework that enables systematic development, fair comparison, and reproducible experimentation.

This work addresses these gaps through three contributions. First, we introduce ReFormer, a pattern-guided framework that induces a compact library of reusable reformulation patterns from pairs of queries and empirically stronger reformulations, and selects an appropriate pattern for each new query based on its retrieval context, making the reformulation policy explicit and transferable. Second, we present ADORE, an iterative framework that turns retrieval outcomes into structured feedback for the next query expansion round. At each iteration, a relevance assessor evaluates retrieved documents against the original query and partitions them into graded tiers, guiding the expansion to reinforce effective signals, incorporate missing aspects, and suppress sources of drift. Third, we present QueryGym, an open-source toolkit for reproducible LLM-based query expansion that provides a unified environment for implementing, executing, and comparing reformulation methods. Across standard retrieval benchmarks spanning passage retrieval, zero-shot retrieval, and reasoning-intensive retrieval, these contributions demonstrate consistent improvements over classical feedback methods and recent LLM-based approaches, while providing the infrastructure needed for reproducible experimentation in this rapidly growing area.