Research

My research is on data management. Although I have done work in basic database technologies such as query processing, transaction processing, and database integration, the main focus of my research follows two threads: (1) application of database technology to non-traditional data types, and (2) distributed & parallel data management.

Data management(X)

Investigating how database technology can be applied to data types that are more complex than business data processing (for which relational systems are the perfect fit) has always been one of my interests. Frank Tompa characterizes this type of work as Data Management(X) where X is the data type of interest -- we also have a graduate course with exactly this focus (CS 741). At different times in my career, X has been equal to one or more of the following: {"object" data (in the sense of object databases), multimedia data, temporal data, spatial data, XML, stream data}. Currently, my focus is on graph data and RDF data.

Graph data management

Graphs have always been important data types for database researchers. With the recent growth of social networks, Wikipedia, Linked Data, RDF, and other networks, the interest in managing very large graphs have again gained momentum. The ongoing work in this area (with PhD student Khaled Ammar) focuses on parallel analysis of dynamic (i.e., time-varying) graphs. For more information see below under Distributed & Parallal Data Management.

RDF data management

Resource Description Framework (RDF) has been proposed for modeling Web objects as part of developing the "semantic web". It has also gained attention as a way to accomplish web data integration. For example, the Linking Open Data (LOD) cloud is a distributed RDF knowledge base created over hundreds of autonomous datasets. Currently, the LOD cloud contains more than 25 billion triples, and its size is doubling every year. As the volume of RDF data has increased, interesting data management issues have arisen. We study the processing of SPARQL queries over RDF data using a graph-theoretic approach: we represent both the RDF data and the SPARQL queries a graphs and conver the query evaluation problem toone of subgraph matching. My work in this area follows two parallel threads.

The first thread, joint with Prof. Lei Chen (Hong Kong University of Science and Technology) and Prof. Lei Zou (Peking University) has developed a system, called gStore [1, 2], that incorporates a global index (called VS*-tree) over the RDF data following the mapping of both node labels and edge labels to bit signatures. The query processing methodology uses a filter-and-evaluate strategy. Given a query graph, VS*-tree is searched for each node of the query graph to generate a list of candidate matching RDF graph nodes. These lists are then joined using the query graph edge labels as join predicates to produce the set of candidate matches for the entire query. The signatures of these matches are then expanded for further and more precise checking. We investigate pruning techniques that ensure efficient generation of candidate matches.A second problem that we investigate is the processing of aggregation queries over large RDF data sets. We propose a processing approach that partitions aggregate queries into smaller parts (called star queries), processes these efficiently, and joins the results of star queries to obtain more general results. We develop indexes to assist in executing star queries and to facilitate joining their results.

The second research thread (with PhD student Güneş Aluç) focuses on the adaptive management of RDF data. Although some existing techniques address adaptivity to changes in the RDF data, they do not consider adaptivity to the changes in the workload. Consequently, the underlying database layouts and the indexes created over the databases are fixed a priori, and they cannot easily handle changing workloads. In fact, our experiments demonstrate that existing systems are unable to achieve
consistently good performance across different types of SPARQL workloads, and that there can be orders of magnitude difference between the peak and base performance of a system [3].

What we are envisioning is an adaptive RDF management system (RMS) design that can continuously and seamlessly adapt its physical configuration to achieve steady and more optimal performance across diverse and dynamically changing workloads [4]. While [4] identifies a much broader range of challenges, Gunes' thesis begins with the observation that in a graph-based RMS, query performance depends on two major components of physical design, namely,

  • how data are clustered in the storage system and in main memory, and
  • how data are indexed.

Consequently, the first problem is related to adaptively aligning the way database records are clustered in the storage system and/or in main memory with the most recent query patterns in the workload. This incurs three challenges. First, even in the worst case when the clustering is random,
correct answers should be produced, which cannot be achieved with just an arbitrary decomposition of a query [5]. Second, it is not trivial which cost function would accurately represent the performance trade-offs of a given clustering with respect to a series of query patterns. Third, computing a clustering that sufficiently minimizes this cost function in linear-time---which is crucial for performance---requires extra investigation. We address these challenges in [5].

The second problem is to design indexes that can be maintained adaptively, both when the clustering is updated and when the query patterns in the workloads change [4]. In tis work, we consider indexing at two levels: the structural-level index organizes the clusters based on their structural similarity,
while the instance-level index stores the distinguishing features of each cluster [6]. The major challenge in maintaining the structural-level index is to devise an efficient algorithm that can incrementally group similar clusters, even when the clusters are constantly being updated. The second challenge is that the instance-level index is constructed over multi-dimensional features; hence, novel sorting and indexing algorithms need to be developed to support adaptive multi-dimensional indexing.

  1. Lei Zou, Jinghui Mo, Dongyan Zhao, Lei Chen, and M. Tamer Özsu. "gStore:Answering SPARQL queries via subgraph matching," Proc. VLDB Endowment, 4(8): 482-493, 2011.
  2. Lei Zou, M. Tamer Özsu, Lei Chen, Xuchuan Sheng, Ruizhe Huang, and Dongyan Zhao. "gStore: A Graph-based SPARQL Query Engine," VLDB Journal, 23(4): 565-590, 2014.
  3. Güneş Aluç, Olaf Hartig, M. Tamer Özsu, and Khuzaima Daudjee. Diversified stress testing of RDF data management systems. In Proc. 13th Int. Semantic Web Conference, Part I, pages 197–212, 2014.
  4. Güneş Aluç, M. Tamer Özsu, and Khuzaima Daudjee. Workload matters: Why RDF databases need a new design. Proc. VLDB Endowment, 7(10):837–840, 2014.
  5. Güneş Aluç, M. Tamer Özsu, Khuzaima Daudjee, and Olaf Hartig. Executing queries over schemaless RDF databases. In Proc. 31st Int. Conf. on Data Engineering, 2015. Forthcoming.
  6. Güneş Aluç, M. Tamer Özsu, Khuzaima Daudjee, and Olaf Hartig. chameleon-db: a workload-aware robust RDF data management system. Technical Report CS-2013-10, University of Waterloo, 2013.

Distributed & parallel data management

For each of the data types mentioned above, I have eventually investigated their management in a distributed and parallel setting. The main current project in this space is parallel analysis of time-varying graphs.

Parallel Analysis of Time-Varying Graphs (Khaled Ammar)

(This is a summary; a more extensive and complete project page is under development.)

Massive data analysis has become a major challenge for scientific discovery, as various scientific communities embrace computational techniques and, consequently, generate more data. Examples include computational biology, chemical data analysis, drug discovery, health informatics analysis, social networking, web link analysis, and communication networks. The amount of data has increased significantly; for example, Twitter reported that in 2012 more than 400 million tweets were sent per day. Walmart handles more than one million transactions per day. The increasing data volumes have increased the importance of the management, efficient querying, and analysis of these data. These data are usually interrelated, and the relationships among data are important and typically implemented using a graph data structure. Consequently, graphs have re-emerged as important data structures within this context. For example, tweets connect people, financial transactions connect banks, payers and payees together, multiple medical sensors may suggest relationships between activities in different parts of the body, and the Internet connects most smart devices in the world. There is a serious need to analyze very large graphs that implement these relationships. Typically, graph algorithms do not scale as they generally assume that the entire data set can reside in the memory and are developed as centralized algorithms that execute on one machine. Most existing graph processing systems assume that graphs are static.

The main objective of our research is to develop algorithms, techniques, and systems for efficient parallel graph mining over very large dynamic (i.e., time-varying) graphs. Consequently, a major part of this research will focus on developing data mining and analysis techniques that can incrementally handle updates in the graph. A second important part is to consider parallelization of these techniques to ensure that they can scale.

As a first step of this study, we are conducting the first universal evaluation study for distributed graph systems supported by the Waterloo Graph Benchmark (WGB [1]). The result of this study will include tremendous amount of comparative and quantitative information about the most recent parallel/distributed graph systems.

  1. Khaled Ammar, M. Tamer Özsu. “WGB: Towards a Universal Graph Benchmark,” In Proc. 4th Workshop on Big Data Benchmarking, 2013, pages 58-72.

Some previous research projects

Better organization coming in the near future...
 
 
[University of Waterloo]
University of Waterloo
[Department of Computer Science]
Computer Science
[M. Tamer Özsu's home page]
M. T. Özsu

Copyright © M. Tamer Özsu. All rights reserved.
Last update: January, 2012