  • Data Gathering Methodology
    • XBench family of benchmarks can accommodate the requirements of the four classes of applications identified as text-centric/single document (TC/SD), text-centric/multiple documents (TC/MD), data-centric/single document (DC/SD) and data-centric/multiple documents (DC/MD). Extensive studies were conducted to characterize the databases in each category in terms of the parameters identified below. The methodology that was followed is the following:
      1. Analyze real XML documents and extract statistical data;
        There are sufficient amount for text-centric XML documents for analysis. However, for data-centric classes, the availability of real XML data for analysis is problematic. Most of the XML documents in the data-centric classes are currently relational that may be translated into XML for communication. Therefore, the schema of the TPC-W benchmark is used and is mapped to XML.
      2. Generalize the characterizations of XML documents in each category;
      3. Create synthetic data to simulate the XML documents in each category.

  • Document Characterization
    • The following parameters are used to characterize and generate XML documents in each class.
      Element types.
      The collection of all element types that appear in XML documents.
      Tree structure of element types.
      The relationship of all element types in the collection, indicating the parent/child relationships of each pair of element types, if there is such relationship.
      Distribution of children to elements.
      For each element type, the probability distribution of instance occurrences of all its child element (directly sub-element) types.
      Distribution of element values to types.
      The probability distribution of values of each element type.
      Attribute names.
      The collection all all attribute names in an XML documents.
      Distribution of attribute values to names.
      The probability distribution of values of each attribute.
      Distribution of attributes to elements.
      The probability distribution of the attributes to each element.
      For each distribution parameter, the minimum and maximum values of that distribution are defined in order to generate finite documents.

  • Database Generator
    • For actual data generation,  ToXgene data generator is used. ToXgene is a template-based tool facilitating the generation of synthetic XML documents. Based on the generalized distribution and abstract structure of real XML documents in each application domain, ToXgene templates are created to simulate real XML documents. The database generator can be downloaded at the [Downloads] section.

  • Database Size
    • Considering the scalability of the benchmarks, four types of database size are defined for each of the database classes: small (10MB), normal (100MB), large (1GB), and huge (10GB). The default database size is "normal". Please note that ToXgene currently takes some time generating "large" size databases and cannot generate "huge" size databases. These affect XBench data generation as well. We will inform registered users when these issues are resolved.

