Master’s Thesis Presentation • Data Systems • What Do You Mean? Using Large Language Models for Semantic Evaluation of NL2SQL Queries

Wednesday, April 16, 2025 12:00 pm - 1:00 pm EDT (GMT -04:00)

Please note: This master’s thesis presentation will take place in DC 3301.

Harrum Noor, Master’s candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Tamer Özsu

While significant research focuses on improving natural language to SQL (NL2SQL) translation, the evaluation of generated queries remains an understudied problem. Current metrics: Exact Match (EM) and Execution Accuracy (EA), fail to address critical nuances. EM prioritizes syntactic fidelity over semantic equivalence, penalizing valid query variations, while EA’s reliance on execution results introduces false positives when incorrect SQL coincidentally returns correct output due to dataset characteristics. These limitations make existing frameworks inadequate for real-world deployment, where robustness and intent preservation are of key importance.

The advent of large language models (LLMs) offers new opportunities to improve these evaluation methodologies. We propose a hybrid framework that integrates execution based validation with LLM-driven semantic analysis to address these gaps. Our pipeline validates candidate SQL queries through EA, and applies Qwen 2.5 Coder, a 1.5B-parameter model optimized for code generation, to perform semantic equivalence checks. This two-fold process eliminates false positives by detecting logical inconsistencies (e.g., incorrect JOIN conditions or aliases) that EA overlooks. Crucially, Qwen 2.5 operates without database connectivity or schema linking, enabling schema-agnostic evaluation through cross-attention between natural language questions and SQL.

Experiments on a combined dataset from Spider and other sources demonstrate that combining EA with Qwen 2.5’s generative reasoning achieves 94% accuracy. The framework identifies 100% of EA’s false positives. Ablation studies reveal that encoder models like Microsoft’s codeBERT perform best on simple SELECT-WHERE queries (94%F1), whereas Qwen 2.5 maintains 90% consistency across all complexity levels. This work establishes a new paradigm for NL2SQL evaluation.