Robert L. McDevitt, K.S.G., K.C.H.S. and Catherine H. McDevitt L.C.H.S. Chair in Computer Science and Information Processing
Georgetown University
Abstract: Many consider "searching" a solved problem, and for digital text processing, this belief is factually based. The problem is that many "real world" search applications involve "complex documents," and such applications are far from solved. Complex documents, or less formally, "real world documents," comprise of a mixture of images, text, signatures, tables, etc., and are often available only in scanned hardcopy formats. Some of these documents are corrupted. Some of these documents, particularly of historical nature, contain multiple languages. Accurate search systems for such document collections are currently unavailable.
We describe our efforts at building a complex document information-processing prototype. This prototype integrates "point solution" (mature) technologies, such as document readability enhancement, OCR capability, signature matching and handwritten word spotting techniques, search and mining approaches, among others, to yield a system capable of searching "real world documents." The described prototype demonstrates the adage that "the whole is greater than the sum of its parts." Our previous complex document benchmark development efforts are likewise presented.
Having described "real world" search issues, we focus on spelling correction in adverse environments. Two environments are discussed: foreign name search and medical term search. In support of the Yizkor Books project of the Archives Section of the United States Holocaust Memorial Museum, novel foreign name search approaches that favourably compare against the state of the art are developed. By segmenting names, fusing individual results, and filtering via a threshold, our approach statistically significantly improves traditional Soundex and n-gram based search techniques used in the search of such texts. Thus, previously unsuccessful searches are now supported. Using a similar approach, within the medical domain, automated term corrections are made to reduce transcription errors.
Finally, we focus analyzing social media, an additional, non-traditional search environment. By searching and mining such data, unknown or unexpected trends are detected. We explore and demonstrate the validity of the approach in the healthcare space.
Biography: Ophir Frieder holds the Robert L. McDevitt, K.S.G., K.C.H.S. and Catherine H. McDevitt L.C.H.S. Chair in Computer Science and Information Processing and previously served as the Chair of the Department of Computer Science at Georgetown University. He is also Professor of Biostatistics, Bioinformatics and Biomathematics in the Georgetown University Medical Center. In addition to his academic positions, he is the Chief Scientific Officer for UMBRA Health Corp (UHC). He is a Fellow of the AAAS, ACM, IEEE, and NAI.
Did you miss Ophir Frieder's lecture or would you like to hear it again? If so, just start the video below.