PhD Seminar • Data Systems • Effective First-stage Formula Retrieval via Approximately Matching Common Substructures

Wednesday, June 8, 2022 12:00 pm - 1:00 pm EDT (GMT -04:00)

Please note: This PhD seminar will be given online.

Wei Zhong, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Jimmy Lin

We introduce a top-down designed similarity measurement and its implementation for the retrieval of math formulas. This measurement tries to find the maximum number of paths matched between multiple common subtrees of query and candidate math expressions in an Operator Tree representation. An approximate algorithm is proposed to calculate the new similarity metrics on an inverted index. In addition to structure similarity, a greedy scoring schema for symbol-wise similarity is also proposed. Compared with a similar alignment algorithm that optimally computes a single best-matched common subtree at the reranking stage, our efficiency makes it practical to be applied to the first stage of retrieval with multiple common substructure awareness and better search recall.

Evaluation results on the NTCIR Wikipedia Formula Browsing Task dataset show that our system can obtain the highest BPref scores in all existing single-model math information retrieval systems, and it is only inferior to a deep bi-encoder retriever. In the future, we plan to extend this model to a larger scale dataset and apply dynamic pruning to the current algorithm to further speed up the efficiency.