Edward Chan, Charles L.A. Clarke, Gordon Cormack, Khuzaima Daudjee, Lukasz Golab (Management Sciences), Maura Grossman, Ihab F. Ilyas, Jimmy Lin, M. Tamer Özsu, Kenneth Salem, Semih Salihoglu, Mark Smucker (Management Sciences), David Toman, Frank Wm. Tompa, Olga Vechtomova (Management Sciences), Grant Weddell
The Data Systems Group builds innovative, high-impact platforms, systems, and applications for processing, managing, analyzing, and searching the vast collections of data that are integral to modern information societies – colloquially known as “big data” technologies. Our capabilities span the full spectrum from unstructured text collections to relational data, and everything in between including semi-structured sources such as time series, log data, graphs, and other data types. We work at multiple layers in the software stack, ranging from storage management and execution platforms to user-facing applications and studies of user behaviour. Our research tackles all phases of the information lifecycle, from ingest and cleaning to inference and decision support and covers the following areas.
Data systems infrastructure. Our research addresses issues that are fundamental to building an effective and efficient infrastructure for management and analysis of very large and heterogeneous data collections. The specific research topics include:
- Cloud data management. Our work in this area has focused on techniques for ensuring database consistency and providing transactional guarantees in geographically distributed, multi-data-centre settings. We are also developing tools to simplify the development of applications that use scalable NoSQL database systems.
- Parallel and distributed data management and analytics platforms. We do both theoretical and systems research on this topic focusing on the theoretical bounds on parallelization as well as replication, distributed transactions, and data integration
- Data stream processing and warehousing. Our goals are to investigate and improve data management systems geared towards processing high volume real-time data feeds, including data stream management systems, stream data warehouses, and event processing systems.
- Efficient and multi-stage retrieval architectures. Specific research includes alternative organizations of inverted indexes based on quantized impact scores and the indexes based on treap data structures, which perform faster ranked unions and intersections while consuming less space.
- Novel data management system architectures. Our goal is to design database systems that exploit the capabilities of modern processors, storage, and networks.
Data systems applications. Our research addresses the data management and analysis needs of specific applications. The particular application areas change over time; the topics that are currently under investigation includes the following:
- Graph data management and analysis. Our work in this area follows two tracks: the development of efficient and scalable RDF data stores (both their management and reasoning over them), and the development of generic parallel graph processing and analysis solutions.
- Mathematics retrieval. Our work focuses on efficiently combining formula search with text search.
- Opinion mining and sentiment analysis. We are developing methods to identify sentiment expressed in natural language text and mine opinions from a variety of sources, such as user reviews and social media.
- Real-time filtering and search on social media streams. Our research focuses on tracking evolving news events, extracting queries from new articles for tracking purposes and creating user models for evaluating filter output.
- Spatial, temporal, and multimedia databases. Our research addresses the development of database technology to fulfill the requirements of large classes of applications that deal with spatial objects that evolve over time, move, and may consist of multiple media types.
- High recall information retrieval. The principal purpose of high recall information retrieval is to find as close as practicable to 100% of records or documents in a collection that are relevant to an information need.
Supporting user interactions. A fundamental challenge in building effective data systems is to ensure their usability, which requires novel interaction paradigms going beyond querying and better understanding user behaviour to design better interfaces. Our research in this area include:
- Complex question answering and exploratory search. This research focuses on supporting users with complex information needs, focusing on the use of knowledge graphs to assist users, and user behaviour in exploratory search tasks.
- Contextual suggestion and point-of-interest recommendation. Our research includes content-oriented methods of mapping preferences from as users move, as well as the development of crowd-sourcing methods for evaluation.
- Understanding assessor behaviour. Knowing which documents in a large collection are relevant to an information need is a required part of test collection construction as well as legal e-discovery. We are studying the behaviour of secondary assessors and developing methods of reducing the amount of errors that these assessors make.
Supporting the information lifecycle. Addressing all phases of the information lifecycle, from ingest and cleaning to inference and decision support requires attention each of these phases. Our work in support of lifecycle support include the following:
- Data cleaning. Data cleaning is the process of detecting and correcting anomalies that exist in real data. We take a pragmatic approach to this problem that is aimed at developing adoptable solutions in real applications. We pursue three main directions: building generic data cleaning platforms, non-destructive data cleaning, and trusted and knowledge-based data cleaning.
- Prediction of search effectiveness. Offline measures of search systems are traditionally designed with little to no model of the users of the systems. We are developing new effectiveness measures that model user behaviour with the search system.
- Privacy preserving information retrieval. We address the problem of evaluating retrieval systems on sensitive data without disclosing that data to the evaluators. We also investigate views of the data that embody effective de-identification of records.
In addition to the generic computing infrastructure provided by the Cheriton School of Computer Science, the Data Systems Group has established specific research laboratories to support this research. In particular, three separate cluster computing systems are being established to support these research projects. In aggregate, these clusters will provide over 100 multi-core computing nodes.
The research conducted within the group is funded by multiple agencies and industries. Currently active funding sources are the Natural Sciences and Engineering Research Council (NSERC) of Canada, Canada Foundation for Innovation, Ontario Research Fund, Mitacs, Google, Microsoft. Thomson Reuters, and IBM.