PhD Seminar • Data Systems • Semantic Table Discovery in Model Lakes: A Benchmark

Wednesday, July 23, 2025 12:00 pm - 1:00 pm EDT (GMT -04:00)

Please note: This PhD seminar will take place in DC 3301.

Zhengyuan Dong, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Renée J. Miller

Model Lakes are emerging large-scale repositories of machine learning artifacts. Although they greatly facilitate model sharing, discovery still relies on keyword or full-text search over textual metadata, which overlooks the rich, structured information — especially performance and configuration tables-embedded in model reports.

In this work, we advance model discovery by leveraging table-discovery techniques within Model Lakes. We first formalize a novel ground-truth methodology for model relatedness, based on three complementary signals: explicit references in model cards, citation links among associated papers, and shared training datasets. We then build and publicly release a benchmark over 100 K Hugging Face models, extracting every table from model cards, GitHub READMEs, arXiv preprints, and BibTeX entries. Compared to standard data-lake tables, our tables are smaller but exhibit far denser inter-table relationships, reflecting the tight coupling of model evolution. To retrieve related models, we adapt a canonical Data Lake task, unionable table search, and compare against dense and sparse IR baselines.

Our union-based semantic search achieves 54.8% P@1 overall (54.7% on paper-citation ground truth, 30.8% on model-card inheritance, 30.2% on shared-dataset signals), while simple metadata retrieval peaks at 36.8% P@1. Denser citation-graph edges boost precision to 74.8%, and a header-value concatenation augmentation raises overall P@1 to 60.3%. To our knowledge, this is the first empirical study applying Data Lake management principles to Model Discovery using large-scale real-world machine learning artifacts. By demonstrating that structured table information uncovers deep model relationships, we lay the groundwork for more accurate retrieval, systematic comparison, and seamless integration of models within Model Lakes.