CS 640 Principles of Database Management and UseWinter 2012http://cs.uwaterloo.ca/~fwtompa/cs640Frank Tompa DC 1313, x34675 fwtompa@uwaterloo.ca |
|
"An overview of relational databases and how they are used; exposure to relational database technology. Fundamentals of transactions. Database views. Introductions to two or three alternative data models and systems, such as those used for structured text, spatial data, multimedia data, and information retrieval. Introduction to several current topics in database research, such as warehousing, data mining, managing data streams, data cleaning, data integration, and distributed databases."
Students are expected to understand the fundamentals of programming languages, data structures, operating systems, and algorithms, each at least at the level of an introductory course.
The course is structured around a collection of published articles, each of which includes an extensive bibliography to related work.
There are also many database textbooks that are used to teach database fundamentals to undergraduate students. Students might wish to refer to one of the following:
The database area is well-covered by Michael Ley's online DBLP bibliography:
Finally, the Encyclopedia of Database Systems includes short, readable articles on many topics:
There will be no tests or exams.
Please review the materials concerning academic integrity and academic honesty. You must complete and sign the Academic Integrity Acknowledgement Form, and hand it in before the first assignment is due.
Tuesdays and Thursdays 2:30-4:00; DC 3313
For topics 2 through 12 below, background material and some of the principle ideas will be introduced first, and (except for the last topic) the related papers will be discussed the following week. Students must read the assigned articles in advance of the class in which the content will be discussed. You will likely find Keshav's essay entitled How to Read a Paper instructive.
Data as a fundamental asset, what it means to manage persistent data, differences between file management and database management in terms of functionality and interface.
Fundamentals of relational databases, relational calculus, relational algebra, integrity issues, normalization.
Donald D. Chamberlin:
Relational
Data-Base Management Systems.
ACM Comput. Surv. 8(1): 43-66
(1976).
"The essential concepts of the relational data
model are defined, and normalization, relational languages based on the
model, as well as advantages and implementations of relational systems
are discussed."
Database design methodology, Logical data modeling by Entity-Relationship modeling, mapping ER models to relational models, relationship to UML models.
Il-Yeol Song and Kristin Froehlich: Entity-Relationship
Modeling: A Practical How-to Guide, IEEE Potentials 13(5): 29-45 (1994-1995).
"The Entity-Relationship (ER) model and its accompanying ER diagrams
are widely used for database design and systems analysis.... We will
present step-by-step guidelines, a set of decision rules proven to be
useful in building ER diagrams, and a case study problem with a
preferred answer as well as a set of incorrect diagrams for the
problem."
Hossein Saiedian:
An
evaluation of extended entity-relationship model.
Information & Software Technology
39(7): 449-462 (1997).
"The Entity-Relationship (ER) model allows a database designer to
develop a high-level conceptual schema without having to consider
low-level issues such as efficiency, the underlying database management
system model, or physical data structures. The ER model has become very
popular for database design and is used quite extensively. In order to
strengthen its expressive power, many database researchers have
introduced or proposed certain extensions to this model. Some of these
extensions are important, while others add little expressive power but
provide auxiliary features. Since the ER model is used so widely, it is
important to know what extensions have been proposed for this model and
what these extensions offer to the users. The objective of this article
is thus to survey major extensions to the ER model and to evaluate
their merits. We point out that lying behind the syntactical
differences of the various extensions is the enriched semantics about
relationships among entities. We also point out the close relationship
between ER modeling and object-oriented data modeling."
Stephen Johnson, Carol Friedman, James
J. Cimino, Tony Clark, George Hripcsak,
and Paul D. Clayton: Conceptual
data model for a central patient database. Proc. 15th Annual Symposium on Computer
Applications in Medical Care (SCAMC).
1991, 381-385.
"This paper presents methods used to develop a conceptual model for a
patient database forming the centerpiece of a clinical information
system under development. Various modeling techniques are discussed
using a simplifed fragment of the model. A method for mapping the model
onto a relational design optimized for single patient retrievals is
described. The results section discusses a number of issues pertaining
to the flexibility and usability of this architecture."
SQL DDL, SQL DML, QBE, Embedded SQL.
Donald D. Chamberlin,
Morton M. Astrahan, Kapali P. Eswaran, Patricia P. Griffiths, Raymond
A. Lorie, James W. Mehl, Phyllis Reisner, and Bradford W. Wade: SEQUEL
2: A Unified Approach to Data Definition, Manipulation, and Control.
IBM Journal of Research and
Development 20(6): 560-575 (1976).
"SEQUEL 2 is a relational data language that provides a consistent,
English keyword-oriented set of facilities for query, data definition,
data manipulation, and data control. SEQUEL 2 may be used either as a
stand-alone interface for nonspecialists in data processing or as a
data sublanguage embedded in a host programming language for use by
application programmers and data base administrators. This paper
describes SEQUEL 2 and the means by which it is coupled to a host
language."
Moshé M. Zloof: Query-by-Example:
A Data Base Language. IBM
Systems Journal 16(4): 324-343 (1977).
"Discussed is a high-level data base management language that provides
the user with a convenient and unified interface to query, update,
define, and control a data base. When the user performs an operation
against the data base, he fills in an example of a solution to that
operation in skeleton tables that can be associated with actual tables
in the data base. The system is currently being used experimentally for
various applications."
Role of indexes in efficient database application development; query optimization
Surajit Chaudhuri and
Vivek R. Narasayya: An
Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server.
VLDB 1997: 146-155.
"In this paper we describe novel techniques that make it possible to
build an industrial-strenth tool for automating the choice of indexes
in the physical design of a SQL database. The tool takes as input a
workload of SQL queries, and suggests a set of suitable indexes. We
ensure that the indexes chosen are The Entity-Relationship (ER) model
allows a database designer to develop a high-level conceptual schema
without having to consider low-level issues such as efficiency, the
underlying database management system model, or physical data
structures. The ER model has become very popular for database design
and is used quite extensively. In order to strengthen its expressive
power, many database researchers have introduced or proposed certain
extensions to this model. Some of these extensions are important, while
others add little expressive power but provide auxiliary features.
Since the ER model is used so widely, it is important to know what
extensions have been proposed for this model and what these extensions
offer to the users. The objective of this article is thus to survey
major extensions to the ER model and to evaluate their merits. We point
out that lying behind the syntactical differences of the various
extensions is the enriched semantics about relationships among
entities. We also point out the close relationship between ER modeling
and object-oriented data modeling.effective in reducing the cost of
the workload by keeping the index selection tool and the query
optimizer "in step". The number of index sets that must be evaluated to
find the optimal configuration is very large. We reduce the complexity
of this problem using three techniques. First, we remove a large number
of spurious indexes from consideration by taking into account both
query syntax and cost information. Second, we introduce optimizations
that make it possible to cheaply evaluate the "goodness" of an index
set. Third, we describe an iterative approach to handle the complexity
arising from multi-column indexes. The tool has been implemented on
Microsoft SQL Server 7.0. We performed extensive experiments over a
range of workloads, including TPC-D. The results indicate that the tool
is efficient and its choices are close to optimal."
Surajit Chaudhuri: An
Overview of Query Optimization in Relational Systems. PODS 1998: 34-43.
"There has been extensive work in query optimization since the early
’70s. It is hard to capture the breadth and depth of this large body of
work in a short article. Therefore, I have decided to focus primarily
on the optimization of SQL queries in relational database systems and
present my biased and incomplete view of this field. The goal of this
article is not to be comprehensive, but rather to explain the
foundations and present samplings of significant work in this area. I
would like to apologize to the many contributors in this area whose
work I have failed to explicitly acknowledge due to oversight or lack
of space. I take the liberty of trading technical precision for ease of
presentation."
Transaction models, concurrency control algorithms, database recovery.
Jim Gray, Raymond A.
Lorie, Gianfranco R. Putzolu, and Irving L. Traiger: Granularity
of Locks and Degrees of Consistency in a Shared Data Base. IFIP Working Conference on Modelling in
Data Base Management Systems 1976:
365-394.
"The problem of choosing the appropriate granularity (size) of lockable
objects is introduced and the tradeoff between concurrency and overhead
is discussed. A locking protocol which allows simultaneous locking at
various granularities by different transactions is presented. It is
based on the introduction of additional lock modes besides the
conventional share mode and exclusive mode. A proof is given of the
equivalence of this protocol to a conventional one...."
Theo Haerder and
Andreas Reuter: Principles
of Transaction-Oriented Database Recovery. ACM Comput. Surv. 15(4): 287-317
(1983).
"In this paper, a terminological framework is provided for describing
different transactionoriented recovery schemes for database systems in
a conceptual rather than an implementation-dependent way. By
introducing the terms materialized database, propagation strategy, and
checkpoint, we obtain a means for classifying arbitrary implementations
from a unified viewpoint. This is complemented by a classification
scheme for logging techniques, which are precisely defined by using the
other terms. It is shown that these criteria are related to all
relevant questions such as speed and scope of recovery and amount of
redundant information required. The primary purpose of this paper,
however, is to establish an adequate and precise terminology for a
topic in which the confusion of concepts and implementational aspects
still imposes a lot of problems."
Views and view management, materialized views.
Arthur M. Keller: The
Role of Semantics in Translating View Updates. IEEE Computer 19(1): 63-73 (1986).
"... it is desirable to provide the users with interfaces that give
them only information that is relevant to them. In shared relational
databases, which are the subject of this article, this is done by
defining views for each class of users. Views represent simplified
models of the database, and users can express queries and updates
against them. How to handle queries expressed against views is well
understood: The user's query is composed with the view definition so as
to obtain a query that can be executed on the underlying database.
Similarly, updates expressed against a view have to be translated into
updates that can be executed on the underlying database...."
David DeHaan, Per-Åke
Larson, and Jingren Zhou: Stacked
Indexed Views in Microsoft SQL Server. SIGMOD Conference 2005: 179-190.
"Appropriately selected materialized views (also called indexed views)
can speed up query execution by orders of magnitude. Most database
systems limit support for materialized views to select-project-join
expressions, possibly with a group-by, over base tables because this
class of views can be efficiently maintained incrementally and thus
kept
up to date with the underlying source tables. However, limiting views
to reference only base tables restricts the class of queries that can
be supported by materialized views. View stacking (also called views on
views) relaxes one restriction by allowing a materialized view to
reference both base tables and other materialized views. This extends
materialized view support to additional types of queries. This paper
describes a prototype implementation of stacked views within Microsoft
SQL Server and explains which classes of queries can be supported. To
support view matching for stacked views, a signature mechanism was
added to the optimizer. This mechanism turned out to be beneficial also
for regular views by significantly speeding up view matching."
Jonathan Goldstein, Per-Åke Larson: Optimizing Queries Using Materialized Views: A practical, scalable solution. SIGMOD Conference 2001: 331-342.
"Materialized views can provide massive improvements in query
processing time, especially for aggregation queries over large tables.
To realize this potential, the query optimizer must know
how and when to exploit materialized views. This paper presents a
fast and scalable algorithm for determining whether part or all of a
query can be computed from materialized views and describes
how it can be incorporated in transformation-based optimizers.
The current version handles views composed of selections, joins
and a final group-by. Optimization remains fully cost based, that
is, a single best rewrite is not selected by heuristic rules but
multiple rewrites are generated and the optimizer chooses the best
alternative in the normal way. Experimental results based on an
implementation in Microsoft SQL Server show outstanding performance
and scalability. Optimization time increases slowly with
the number of views but remains low even up to a thousand."
Database interoperability, record matching.
Walter V. Sujansky: Heterogeneous
Database Integration in Biomedicine. Journal of Biomedical Informatics 34(4):
285-298 (2001).
"The
rapid expansion of biomedical knowledge, reduction in computing costs,
and spread of internet access have created an ocean of electronic data.
The decentralized nature of our scientific community and healthcare
system, however, has resulted in a patchwork of diverse, or
heterogeneous, database implementations, making access to and
aggregation of data across databases very difficult. The database
heterogeneity problem applies equally to clinical data describing
individual patients and biological data characterizing our genome.
Specifically, databases
are highly heterogeneous with respect to the data models they employ,
the data schemas they specify, the query languages they support, and
the terminologies they recognize. Heterogeneous database systems
attempt to unify disparate databases by providing uniform conceptual
schemas that resolve representational heterogeneities, and by providing
querying capabilities that aggregate and integrate distributed data.
Research in this area has applied a variety of database and
knowledge-based techniques, including semantic data modeling, ontology
definition, query translation, query optimization, and terminology
mapping. Existing systems have addressed heterogeneous database
integration in the realms of molecular biology, hospital information
systems, and application portability."
Glenn B. Bell, Anil
Sethi: Matching
records in a national medical patient index. Commun. ACM 44(9): 83-88 (2001).
"...There are many potential benefits of having a computerized medical
history, including the elimination of duplicate tests and procedures,
better access to patient histories for emergency use, elimination of
reentry of historical data, and greater accuracy. To make all patients’
medical records in the U.S. accessible to care providers, electronic
medical records need to be linked together using a massively
distributed Master Patient Index (MPI)...."
Database security and authorization.
Elisa Bertino, Sushil
Jajodia, and Pierangela Samarati: Database
Security: Research and Practice. Inf.
Syst. 20(7): 537-556 (1995).
"As an increasing number of organizations become dependent on access to
their data over the Internet, the need for adequate security measures
is becoming more and more critical. The most popular security measure
these days is a firewall. However, a firewail is not immune to
penetration, and it does not provide any protection of internal
resources from insiders and successful intruders. One of the
requirements for the protection of internal resources is access control
to ensure that all accesses are authorized according to some specified
policy. In this paper, we survey the state of the art in access control
for database systems, discuss the main research issues, and outline
possible directions for future research."
Rebecca Mercuri: The
HIPAA-potamus in health care data security. Commun. ACM 47(7): 25-28 (2004).
"Regulations intended to improve health care data access have created
new security risks along with headaches for patients and
practitioners."
Gustavo H. M. B.
Motta, Sergio Shiguemi Furuie: A
contextual role-based access control authorization model for electronic
patient record. IEEE
Transactions on Information Technology in Biomedicine 7(3):
202-207 (2003).
"The design of proper models for authorization and access control for
electronic patient record (EPR) is essential to a wide scale use of EPR
in large health organizations. In this paper, we propose a contextual
role-based access control authorization model aiming to increase the
patient privacy and the confidentiality of patient data, whereas being
flexible enough to consider specific cases. This model regulates user's
access to EPR based on organizational roles. It supports a role-tree
hierarchy with authorization inheritance; positive and negative
authorizations; static and dynamic separation of duties based on weak
and strong role conflicts. Contextual authorizations use environmental
information available at access time, like user/patient relationship,
in order to decide whether a user is allowed to access an EPR resource.
This enables the specification of a more flexible and precise
authorization policy, where permission is granted or denied according
to the right and the need of the user to carry out a particular job
function."
OLAP, data cubes.
Surajit Chaudhuri and
Umeshwar Dayal: An
Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26(1): 65-74 (1997).
"Data warehousing and on-line analytical processing (OLAP) are
essential elements of decision support, which has increasingly become a
focus of the database industry. Many commercial products and services
are now available, and all of the principal database management system
vendors now have offerings in these areas. Decision support places some
rather different requirements on database technology compared to
traditional on-line transaction processing applications. This paper
provides an overview of data warehousing and OLAP technologies, with an
emphasis on their new requirements. We describe back end tools for
extracting, cleaning and loading data into a data warehouse;
multidimensional data models typical of OLAP; front end client tools
for querying and data analysis; server extensions for efficient query
processing; and tools for metadata management and for managing the
warehouse. In addition to surveying the state of the art, this paper
also identifies some promising research issues, some of which are
related to problems that the database research community has worked on
for years, but others are only just beginning to be addressed. This
overview is based on a tutorial that the authors presented at the VLDB
Conference, 1996."
Craig S. Ledbetter
and Matthew W. Morgan: Toward
Best Practice: Leveraging the Electronic Patient Record
as a Clinical Data Warehouse. J.
Healthcare Info. Mgmt 15(2): 119-131 (2001).
"Automating clinical and administrative processes via an electronic
patient record (EPR) gives clinicians the point-of-care tools they need
to deliver better patient care. However, to improve clinical practice
as a whole and then evaluate it, healthcare must go beyond basic
automation and convert EPR data into aggregated, multidimensional
information. Unfortunately, few EPR systems have the established,
powerful analytical clinical data warehproxy.lib.uwaterloo.caouses
(CDWs) required for this
conversion. This article describes how an organization can support best
practice by leveraging a CDW that is fully integrated into its EPR and
clinical decision support (CDS) system. The article (1) discusses the
requirements for comprehensive CDS, including on-line analytical
processing (OLAP) of data at both transactional and aggregate levels,
(2) suggests that the transactional data acquired by an OLTP EPR system
must be remodeled to support retrospective, population-based, aggregate
analysis of those data, and (3) concludes that this aggregate analysis
is best provided by a separate CDW system."
Streaming data, window queries.
Lukasz Golab and M. Tamer Özsu: Issues
in data stream management. SIGMOD
Record 32(2): 5-14 (2003).
"Traditional databases store sets of relatively static records with no
pre-defined
notion of time, unless timestamp attributes are explicitly added. While
this model adequately represents commercial catalogues or repositories
of personal information, many current and emerging applications require
support for online analysis of rapidly changing data streams.
Limitations of traditional DBMSs in supporting streaming applications
have been recognized, prompting research to augment existing
technologies and build new systems to manage streaming data. The
purpose of this paper is to review recent work in data stream
management systems, with an emphasis on application requirements, data
models, continuous query languages, and query evaluation."
Alfonso F. Cardenas, Raymond K. Pon, Robert B. Cameron: Management of Streaming Body Sensor Data for Medical Information Systems. METMBS 2003: 186-191.
"Data retrieved from body sensors such as ECG machines and new-generation multi-sensor systems such as respiratory monitors are varied and abundant. Managing and integrating this streaming data with existing types of medical information such as patient records, laboratory results, procedures, medical documents, and medical images in a logical and intuitive way is a challenge. Not only is the management of such data difficult but so is the visualization and presentation of the data to physicians, specialists, and patients. Using a new generation of lifeshirts embedded with real time monitors for respiratory and heart functions as a testbed, we propose and have initiated development of a data management system for dealing with such streaming body sensor data, under the framework of an extensible architecture that will support easy visualization of such data."
Market basket analysis, association rules.
Evangelos
Simoudis: Reality
Check for Data Mining. IEEE
Expert 11(5): 26-33 (1996).
"Data mining—the process of extracting valid, previously unknown,
comprehensible, and actionable information from large databases and
using it to make crucial business decisions—currently performs this
task for a growing range of businesses. After presenting an overview of
current data-mining techniques, this article explores two particularly
noteworthy applications of those techniques: market basket analysis and
customer segmentation."
Heikki Mannila: Methods and Problems in Data Mining. ICDT 1997: 41-55.
"Knowledge discovery in databases and data mining aim at
semiautomatic tools for the analysis of large data sets. We consider some
methods used in data mining, concentrating on levelwise search for all
frequently occurring patterns. We show how this technique can be used
in various applications. We also discuss possibilities for compiling data
mining queries into algorithms, and look at the use of sampling in data
mining. We conclude by listing several open research problems in data
mining and knowledge discovery."
J. G. Richards, V. J. Rayward-Smith, P.
H. Sönksen, S. Carey, and C. Weng: Data
mining for indicators of early mortality in a database of clinical
records. Artifical Intelligence
in Medicine 22: 215-231 (2001).
"This paper describes the analysis of a database of diabetic patients’
clinical records and death certificates. The objective of the study was
to find rules that describe associations between observations made of
patients at their first visit to the hospital and early mortality.
Pre-processing was carried out and a knowledge discovery in databases
(KDD) package, developed by the Lanner Group and the University of East
Anglia, was used for rule induction using simulated annealing. The most
significant discovered rules describe an association that was not
generally known or accepted by the medical community, however, recent
independent studies confirm their validity."