|
XBench -
A Family of Benchmarks for XML DBMSs
Data Gathering Methodology
XBench family of benchmarks can accommodate the requirements of
the four classes of applications identified as text-centric/single document
(TC/SD), text-centric/multiple documents (TC/MD),
data-centric/single document (DC/SD) and data-centric/multiple
documents (DC/MD). Extensive studies were conducted to characterize the databases in each category
in terms of the parameters identified below. The methodology that
was followed is the following:
- Analyze real XML documents and extract statistical data;
There are sufficient amount for text-centric XML documents for analysis. However, for data-centric classes, the
availability of real XML data for analysis is problematic. Most of the XML documents in the
data-centric classes are currently relational that may be
translated into XML for communication. Therefore, the schema of
the TPC-W benchmark is used and is mapped to XML.
- Generalize the characterizations of XML documents in each
category;
- Create synthetic data to simulate the XML documents in each
category.
Document Characterization
The following parameters are used to characterize and generate XML
documents in each class.
- Element types.
- The collection of all element types that
appear in XML documents.
- Tree structure of element types.
- The relationship of all
element types in the collection, indicating the parent/child
relationships of each pair of element types, if there is such
relationship.
- Distribution of children to elements.
- For each
element type, the probability distribution of instance occurrences
of all its child element (directly sub-element) types.
- Distribution of element values to types.
- The probability
distribution of values of each element type.
- Attribute names.
- The collection all all attribute names in
an XML documents.
- Distribution of attribute values to names.
- The probability
distribution of values of each attribute.
- Distribution of attributes to elements.
- The probability
distribution of the attributes to each element.
For each distribution parameter, the minimum and maximum values of
that distribution are defined in order to generate finite
documents.
Database Generator
For actual data generation,
ToXgene data generator is
used. ToXgene is a template-based tool facilitating the generation of synthetic XML documents. Based on the generalized distribution and abstract structure of real XML documents in each application domain, ToXgene templates are created to simulate real XML documents. The database generator can be downloaded at the [Downloads] section.
Database Size
Considering the scalability of the benchmarks, four types of database size are defined for each of the database classes: small (10MB), normal (100MB), large (1GB), and huge (10GB). The default database size is
"normal". Please note that ToXgene currently takes some time
generating "large" size databases and cannot generate
"huge" size databases. These affect XBench data generation as well. We
will inform registered users when these issues are resolved.
|
|