Please note: This master’s thesis presentation will take place in DC 3301, the DSG Lab.
Sepideh Abedini, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Tamer Özsu
Natural language–to–SQL (text-to-SQL) systems aim to enable users to interact with relational databases using natural language rather than SQL. Recent advances in large language models have significantly improved the performance of these systems, making them increasingly practical for real-world applications. With the rapid pace of progress and the growing adoption of text-to-SQL systems, proper benchmarking has become essential. However, existing benchmarks typically rely on a single correctness metric, lack alignment with real-world query usage patterns, and do not evaluate the scalability of generated queries, which limits their ability to provide a realistic and practical evaluation.
This thesis introduces SQLyzr, a comprehensive text-to-SQL benchmark and evaluation framework designed to address these limitations. SQLyzr incorporates a fine-grained taxonomy of SQL queries and reports evaluation results at the level of query categories, enabling detailed insights into system performance across different query types. In addition, SQLyzr extends traditional evaluation by introducing complementary metrics that assess not only correctness but also efficiency and structural complexity of the generated SQL queries. To better reflect real-world usage, SQLyzr aligns the distribution of query categories with empirical SQL workloads and supports dataset scaling to enable evaluation on larger databases.
Building on these ideas, we also introduce a configurable text-to-SQL benchmarking framework that allows users to customize and extend benchmark components such as workload, dataset, and evaluation metrics. The framework further provides novel features such as detailed error analysis for identifying incorrect queries with minor issues and workload augmentation for synthesizing additional NL–SQL pairs that target weaknesses of a specific text-to-SQL system.
We use SQLyzr to evaluate two state-of-the-art text-to-SQL systems with similar overall correctness scores. Our results demonstrate that SQLyzr enables clearer comparisons between systems and reveals deeper insights into their relative strengths and weaknesses.