XML Data Management

Overview

There are three ongoing projects: the XDB XML DBMS,XCache, and XBench, which is an XML DBMS benchmark specification.

XDB (Ning Zhang)

XDB project (not very innovative, just an abbreviation for XML DBMS) addresses the development of a native XML databasoftbase. The focus of the project is on the management of large collection of documents in hierarchical data model. Our goal is to be able to store and query terabytes of XML documents, no matter how complex the structures are (in terms of depth, width, and recursions).  The focus of the project, at this stage, is on the following issues:
  1. Storage structures for XML documents: the main requirements of storage structures for XML documents are:
    • scalability: the storage structures are able to store terabytes of XML data. We store the two types of data – XML tree structures (tags and their nesting relationships) and the PCDATA of each XML element – separately. The rationale of this separation is that different types of data are used by different types of constraints in a path query: tree pattern-based structural constraints on tree structures and value-based constrains on PCDATA. We store the tree structures by linearizing them into a string. Each XML element tag is mapped into a small unit of datum (could be as small as one byte) in the string, and the units are stored in document order. If the XML document is very large, the mapped string can be divided into sub-strings, each of which can be stored into a disk page. The PCDATA of XML elements are stored continuously according to their document order, and are also stored in disk pages if necessary. This mechanism allows virtually unlimited XML documents to be stored into the databasoftbase.
    • supporting efficient query evaluation: the storage structures should be optimized to XML path queries. Since the PCDATA are usually subject to value-based queries, we could build value-based indexes such as B+ trees on them. For evaluating structural constraints, an efficient single-pass navigational operator could be developed based on the linearized tree storage, due to the fact that elements are stored in document order. The navigational operator sequentially scans the string and maintains internal states for pattern matching. Lightweight statistics can also be collected for each disk page and are stored in the page header. These statistics are used to skip unnecessary page I/Os during the sequential scan.
    • easy-to-update: when the underlying XML document is updated, the basic storage structure has to be able to be easily updated accordingly.  The update to the XML tree (deleting an element /insering a new element) is significantly easier if it is stored in the string format than previous mappings (e.g., interval encoding). The advantage of our storage structure is that updating the tree structure only affects the string locally, i.e., a page or several pages; while it usually involves a global update in the case of interval encoding.
  2. You can find more details of these techniques in our ICDE 2004 paper.
  3. Efficient physical operators for answering path queries (e.g., XPath) on the storage. Path queries are ubiquitous and are significantly different from traditional relational database query. There are usually two approaches to evaluating path queries in the literature: join-based and navigational approaches. Each of these approaches has its advantages and disadvantages. Join-based approach exploits tag-name indexes and is usually I/O optimal when the queries involve only ancestor-descendant relationships ("global axes"). The navigational approach does not dependent on existing indexes and is usually more efficient when there is only "local axes" – parent-child or sibling relationships. In the ICDE 2004 paper, we define a special type of path expression – Next-of-Kin (NoK) expressions –that only contains “local axes”. A physical NoK operator is defined to answer the NoK path expression using navigational approach. For a general path expression containing "global axes", we can decompose the path expression into interconnected NoK expressions and apply a NoK operator to each of them. The intermediate results are then "joined" using join-based approach on the "global axes". In this way, we can exploit the best of both worlds. The decomposition of path expressions and composition of intermediate results are introduced in details in our ICDE 2005 poster paper.
  4. Cardinality estimation and cost modeling. Cardinality estimation is crucial to cost-based optimization. Cardinality estimation of path expressions is usually based on a graph-based synopsis structure that summarizes the XML tree and maintains sufficient statistics. A synopsis structure that is useful in practice should have the following desirable properties: the estimations based on this synopsis should be reasonably accurate; the cardinality estimation algorithm should work for a majority types of queries and data sets; the synopsis should be adaptive to different memory budgets; the synopsis should be easy to create and update; and the estimation time should be a very small fraction of the actual querying time. To meet all these criteria, we developed a synopsis structure, called XSeed (see our ICDE 2006 paper for details), that is accurate and fast for cardinality estimation, using our novel estimation algorithm. In many cases, XSeed improves the accuracy of the state-of-art synopsis by an order of magnitude even under less memory budget. XSeed is also easy to construct and update, which makes it suitable to be implemented in a commercial system.

XCache (Emre Cem Sözgen)

Storing the results of the frequently issued queries in a semantic cache can expedite query processing significantly. So far, such systems generally have been developed for relational databases, especially for relational data warehouses. In the XCache project, our goal is to apply the same ideas for XML databases. The focuses of the XCache project are:
  1. if a workload is provided, using that workload to select XML views to materialize in the semantic cache.
  2. using the semantic cache to answer XML queries issued to the system.
  3. updating the cache when a query is issued to the system. This is used in the runtime and especially useful when a workload is not provided to the system.
We are doing cost-based analysis for determining the views to be stored in the cache. We store the views that have the highest benefit as long as they do not cause redundancy in the cache. Now, we are working on the theory. Soon, we are going to verify our ideas with an experimental work.

XBench (Benjamin Bin Yao)

XBench is a family of benchmarks that capture different XML application characteristics. These applications are categorized as data-centric or text-centric and the corresponding databases can consist of single documents or multiple documents. In data-centric (DC) applications, the database stores data that are captured in XML even though the original data may not be in XML. Examples include e-commerce catalogue data or transactional data that is captured as XML. Text-centric (TC) applications manage actual text documents and use a database of native XML documents. Examples include book collections in a digital library, or news article archives. The single document (SD) case covers those databases, such as an e-commerce catalogue, that consists of a single document with complex structures (deep nested elements), while the multiple document case covers those databases that contain a set of XML documents, such as an archive of news documents or transactional data. The result is a requirement for a database generator that can handle four cases: DC/SD, DC/MD, TC/SD, and TC/MD. XBench database generator can generate databases in any of these classes ranging from 10MB to 10GB in size. The workload specification covers the functionality of XQuery as captured in the Use Cases. Each of these queries are slightly varied to fit the specifics of the application domain.XBench has its own home page. For more detailed information, please click here.

Distributed XML Query Processing (Patrick Kling)

While centralized query processing over collections of XML data stored at a single site is a well understood problem, centralized query evaluation techniques are inherently limited in their scalability when presented with large collections (or a single, large document) and heavy query workloads. In the context of relational query processing, similar scalability challenges have been overcome by partitioning data collections, distributing them across the sites of a distributed system, and then evaluating queries in a distributed fashion, usually in a way that ensures locality between (sub-)queries and their relevant data. This thesis presents a suite of query evaluation techniques for XML data that follow a similar approach to address the scalability problems encountered by XML query evaluation.

Due to the significant differences in data and query models between relational and XML query processing, it is not possible to directly apply distributed query evaluation techniques designed for relational data to the XML scenario. Instead, new distributed query evaluation techniques need to be developed. Thus, in this thesis, an end-to-end solution to the scalability problems encountered by XML query processing is proposed.

Based on a data partitioning model that supports both horizontal and vertical fragmentation steps (or any combination of the two), XML collections are fragmented and distributed across the sites of a distributed system. Then, a suite of distributed query evaluation strategies is proposed. These query evaluation techniques ensure locality between each fragment of the collection and the parts of the query corresponding to the data in this fragment. Special attention is paid to scalability and query performance, which is achieved by ensuring a high degree of parallelism during distributed query evaluation and by avoiding access to irrelevant portions of the data.

For maximum flexibility, the suite of distributed query evaluation techniques proposed in this thesis provides a number of alternative approaches for evaluating a given query over a given distributed collection. Thus, to achieve the best performance, it is necessary to predict and compare the expected performance of each of these alternatives. In this work, this is accomplished through a query optimization technique based on a distribution-aware cost model. The same cost model is also used to fine-tune the way a collection is fragmented to the demands of the query workload evaluated over this collection.

To evaluate the performance impact of the distributed query evaluation techniques proposed in this thesis, the techniques were implemented within a real-life XML database system. Based on this implementation, a thorough experimental evaluation was performed. The results of this evaluation confirm that the distributed query evaluation techniques introduced here lead to significant improvements in query performance and scalability.

Previous Related Work

This work is partly based on our previous research on SGML/HyTime compliant document databases at the University of Alberta. For more information, you can see our overview paper (in postscript format).

Researchers

Publications

[University of Waterloo]
University of Waterloo
[Department of Computer Science]
Computer Science
[M. Tamer Özsu's home page]
M. T. Özsu

Copyright © M. Tamer Özsu. All rights reserved.
Last update: September, 2006