Streaming XML
Problem statement for publish-subscribe systems
- Incoming stream of XML messages (published messages)
- Routers identify subset of messages that meet users' criteria
(subscriptions for packet routing; triggers, recommenders, etc.)
- Selection criterion stated as arbitrary XPath query
e.g., P1 = //publication[@yr>2003]//author[text()='Smith']
Approaches from related applications
- Multiple query optimization => sharing evaluation of common subqueries
- Continuous query evaluation => group queries based on common predicates
(most selective first)
Approaches based on finite state recognizers
- e.g., XPush, proposed by Gupta and Suciu
- Index of atomic predicates (such as "= 'Smith'" or "> 2003")
to match against values found in attributes or elements
- Automaton construction (example):
- create finite automata corresponding to each pattern individually
- create recognizer
- initial state
- one state for each interesting range of atomic values
- states for matching the end of an element (pop)
- states for merging new element matches with previous sets of matches (add)
- repeating (c) and (d) until no new states created
- create mapping from recognizer states to set of patterns matched
- Simplest approach is to consider bottom-up matching only
- Improving efficiency of construction
- Construct new states only as needed while matching against a document
- Use top-down pruning to limit values and elements of interest
- Use top-down pruning and eliminate bottom-up checking for the chain from the root to the first branching node
- Use order pruning to limit merging of new element matches with previous sets of matches
Prefiltering XML documents
- Remove all components not needed for further processing
- W3C standard for specifying fragments
- Alternative: fragments of interest specified by XPath
(possibly extracted from more general XQuery expressions)
- Result: documents much smaller
- more fits into memory at a time
- less I/O
- bigger effective window for streaming applications
- ICDE 2008 paper (Koch, et al.) shows efficient processor
References and related reading