CS 640 Principles of Database Management and Use

Winter 2012

http://cs.uwaterloo.ca/~fwtompa/cs640

Frank Tompa

DC 1313, x34675

Upcoming seminars of interest

N/A

Calendar Description

"An overview of relational databases and how they are used; exposure to relational database technology. Fundamentals of transactions. Database views. Introductions to two or three alternative data models and systems, such as those used for structured text, spatial data, multimedia data, and information retrieval. Introduction to several current topics in database research, such as warehousing, data mining, managing data streams, data cleaning, data integration, and distributed databases."

Prerequisites

Students are expected to understand the fundamentals of programming languages, data structures, operating systems, and algorithms, each at least at the level of an introductory course.

References

The course is structured around a collection of published articles, each of which includes an extensive bibliography to related work.

There are also many database textbooks that are used to teach database fundamentals to undergraduate students. Students might wish to refer to one of the following:

Database Management Systems, 3rd ed., by R. Ramakrishnan and J. Gehrke, McGraw-Hill, 2003.
Database Systems: The Complete Book, 2nd ed., by H. Garcia-Molina, J. Ullman, and J. Widom, Prentice Hall, 2008.

The database area is well-covered by Michael Ley's online DBLP bibliography:

The DBLP Computer Science Bibliography, by M. Ley, http://www.informatik.uni-trier.de/~ley/db/index.html.

Finally, the Encyclopedia of Database Systems includes short, readable articles on many topics:

Encyclopedia of Database Systems, by L. Liu, M. T. Özsu (eds.), Springer, 2009.

and several topics are covered by short monographs in the series:

Synthesis Lectures on Database Management, by M. T. Özsu (ed.), Morgan and Claypool, 2010.

Workload and Evaluation

(20%) The design and description of a database for the health domain. (Due Feb. 7)
(20%) The formulation of queries and updates against a database for the health domain. (Due Feb. 28)
(60%) Two critical reviews of related conference/journal papers or technical reports. (Due Mar. 8 and Mar. 29)
(Effort Adjustment Factor between 0.95 and 1.05) Participation and contribution to the class.

There will be no tests or exams.

Please review the materials concerning academic integrity and academic honesty. You must complete and sign the Academic Integrity Acknowledgement Form, and hand it in before the first assignment is due.

Schedule

Tuesdays and Thursdays 2:30-4:00; DC 3313

Outline (tentative)

For topics 2 through 12 below, background material and some of the principle ideas will be introduced first, and (except for the last topic) the related papers will be discussed the following week. Students must read the assigned articles in advance of the class in which the content will be discussed. You will likely find Keshav's essay entitled How to Read a Paper instructive.

1. Introduction (1.5 hours)

Data as a fundamental asset, what it means to manage persistent data, differences between file management and database management in terms of functionality and interface.

2. Relational database systems (6 hours)

Fundamentals of relational databases, relational calculus, relational algebra, integrity issues, normalization.

Donald D. Chamberlin: Relational Data-Base Management Systems. ACM Comput. Surv. 8(1): 43-66 (1976).
"The essential concepts of the relational data model are defined, and normalization, relational languages based on the model, as well as advantages and implementations of relational systems are discussed."

3. Database design (3 hours)

Database design methodology, Logical data modeling by Entity-Relationship modeling, mapping ER models to relational models, relationship to UML models.

Discussion questions

Il-Yeol Song and Kristin Froehlich: Entity-Relationship Modeling: A Practical How-to Guide, IEEE Potentials 13(5): 29-45 (1994-1995).
"The Entity-Relationship (ER) model and its accompanying ER diagrams are widely used for database design and systems analysis.... We will present step-by-step guidelines, a set of decision rules proven to be useful in building ER diagrams, and a case study problem with a preferred answer as well as a set of incorrect diagrams for the problem."

Hossein Saiedian: An evaluation of extended entity-relationship model. Information & Software Technology 39(7): 449-462 (1997).
"The Entity-Relationship (ER) model allows a database designer to develop a high-level conceptual schema without having to consider low-level issues such as efficiency, the underlying database management system model, or physical data structures. The ER model has become very popular for database design and is used quite extensively. In order to strengthen its expressive power, many database researchers have introduced or proposed certain extensions to this model. Some of these extensions are important, while others add little expressive power but provide auxiliary features. Since the ER model is used so widely, it is important to know what extensions have been proposed for this model and what these extensions offer to the users. The objective of this article is thus to survey major extensions to the ER model and to evaluate their merits. We point out that lying behind the syntactical differences of the various extensions is the enriched semantics about relationships among entities. We also point out the close relationship between ER modeling and object-oriented data modeling."

Stephen Johnson, Carol Friedman, James J. Cimino, Tony Clark, George Hripcsak, and Paul D. Clayton: Conceptual data model for a central patient database. Proc. 15th Annual Symposium on Computer Applications in Medical Care (SCAMC). 1991, 381-385.
"This paper presents methods used to develop a conceptual model for a patient database forming the centerpiece of a clinical information system under development. Various modeling techniques are discussed using a simplifed fragment of the model. A method for mapping the model onto a relational design optimized for single patient retrievals is described. The results section discusses a number of issues pertaining to the flexibility and usability of this architecture."

4. SQL and interfaces (3 hours)

SQL DDL, SQL DML, QBE, Embedded SQL.

Discussion questions

Donald D. Chamberlin, Morton M. Astrahan, Kapali P. Eswaran, Patricia P. Griffiths, Raymond A. Lorie, James W. Mehl, Phyllis Reisner, and Bradford W. Wade: SEQUEL 2: A Unified Approach to Data Definition, Manipulation, and Control. IBM Journal of Research and Development 20(6): 560-575 (1976).
"SEQUEL 2 is a relational data language that provides a consistent, English keyword-oriented set of facilities for query, data definition, data manipulation, and data control. SEQUEL 2 may be used either as a stand-alone interface for nonspecialists in data processing or as a data sublanguage embedded in a host programming language for use by application programmers and data base administrators. This paper describes SEQUEL 2 and the means by which it is coupled to a host language."

For more recent information on SQL, see IBM's SQL Reference Volume 1 (mostly DML, see especially Chapter 5) and Volume 2 (mostly DDL)
To install your own copy of DB2, follow these instructions.

Moshé M. Zloof: Query-by-Example: A Data Base Language. IBM Systems Journal 16(4): 324-343 (1977).
"Discussed is a high-level data base management language that provides the user with a convenient and unified interface to query, update, define, and control a data base. When the user performs an operation against the data base, he fills in an example of a solution to that operation in skeleton tables that can be associated with actual tables in the data base. The system is currently being used experimentally for various applications."

5. Query Processing (3 hours)

Role of indexes in efficient database application development; query optimization

Surajit Chaudhuri and Vivek R. Narasayya: An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. VLDB 1997: 146-155.
"In this paper we describe novel techniques that make it possible to build an industrial-strenth tool for automating the choice of indexes in the physical design of a SQL database. The tool takes as input a workload of SQL queries, and suggests a set of suitable indexes. We ensure that the indexes chosen are The Entity-Relationship (ER) model allows a database designer to develop a high-level conceptual schema without having to consider low-level issues such as efficiency, the underlying database management system model, or physical data structures. The ER model has become very popular for database design and is used quite extensively. In order to strengthen its expressive power, many database researchers have introduced or proposed certain extensions to this model. Some of these extensions are important, while others add little expressive power but provide auxiliary features. Since the ER model is used so widely, it is important to know what extensions have been proposed for this model and what these extensions offer to the users. The objective of this article is thus to survey major extensions to the ER model and to evaluate their merits. We point out that lying behind the syntactical differences of the various extensions is the enriched semantics about relationships among entities. We also point out the close relationship between ER modeling and object-oriented data modeling.effective in reducing the cost of the workload by keeping the index selection tool and the query optimizer "in step". The number of index sets that must be evaluated to find the optimal configuration is very large. We reduce the complexity of this problem using three techniques. First, we remove a large number of spurious indexes from consideration by taking into account both query syntax and cost information. Second, we introduce optimizations that make it possible to cheaply evaluate the "goodness" of an index set. Third, we describe an iterative approach to handle the complexity arising from multi-column indexes. The tool has been implemented on Microsoft SQL Server 7.0. We performed extensive experiments over a range of workloads, including TPC-D. The results indicate that the tool is efficient and its choices are close to optimal."

Surajit Chaudhuri: An Overview of Query Optimization in Relational Systems. PODS 1998: 34-43.
"There has been extensive work in query optimization since the early ’70s. It is hard to capture the breadth and depth of this large body of work in a short article. Therefore, I have decided to focus primarily on the optimization of SQL queries in relational database systems and present my biased and incomplete view of this field. The goal of this article is not to be comprehensive, but rather to explain the foundations and present samplings of significant work in this area. I would like to apologize to the many contributors in this area whose work I have failed to explicitly acknowledge due to oversight or lack of space. I take the liberty of trading technical precision for ease of presentation."

6. Transaction Management (3 hours)

Transaction models, concurrency control algorithms, database recovery.

Jim Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger: Granularity of Locks and Degrees of Consistency in a Shared Data Base. IFIP Working Conference on Modelling in Data Base Management Systems 1976: 365-394.
"The problem of choosing the appropriate granularity (size) of lockable objects is introduced and the tradeoff between concurrency and overhead is discussed. A locking protocol which allows simultaneous locking at various granularities by different transactions is presented. It is based on the introduction of additional lock modes besides the conventional share mode and exclusive mode. A proof is given of the equivalence of this protocol to a conventional one...."

Theo Haerder and Andreas Reuter: Principles of Transaction-Oriented Database Recovery. ACM Comput. Surv. 15(4): 287-317 (1983).
"In this paper, a terminological framework is provided for describing different transactionoriented recovery schemes for database systems in a conceptual rather than an implementation-dependent way. By introducing the terms materialized database, propagation strategy, and checkpoint, we obtain a means for classifying arbitrary implementations from a unified viewpoint. This is complemented by a classification scheme for logging techniques, which are precisely defined by using the other terms. It is shown that these criteria are related to all relevant questions such as speed and scope of recovery and amount of redundant information required. The primary purpose of this paper, however, is to establish an adequate and precise terminology for a topic in which the confusion of concepts and implementational aspects still imposes a lot of problems."

7. Database Views (3 hours)

Views and view management, materialized views.

Arthur M. Keller: The Role of Semantics in Translating View Updates. IEEE Computer 19(1): 63-73 (1986).
"... it is desirable to provide the users with interfaces that give them only information that is relevant to them. In shared relational databases, which are the subject of this article, this is done by defining views for each class of users. Views represent simplified models of the database, and users can express queries and updates against them. How to handle queries expressed against views is well understood: The user's query is composed with the view definition so as to obtain a query that can be executed on the underlying database. Similarly, updates expressed against a view have to be translated into updates that can be executed on the underlying database...."

David DeHaan, Per-Åke Larson, and Jingren Zhou: Stacked Indexed Views in Microsoft SQL Server. SIGMOD Conference 2005: 179-190.
"Appropriately selected materialized views (also called indexed views) can speed up query execution by orders of magnitude. Most database systems limit support for materialized views to select-project-join expressions, possibly with a group-by, over base tables because this class of views can be efficiently maintained incrementally and thus kept up to date with the underlying source tables. However, limiting views to reference only base tables restricts the class of queries that can be supported by materialized views. View stacking (also called views on views) relaxes one restriction by allowing a materialized view to reference both base tables and other materialized views. This extends materialized view support to additional types of queries. This paper describes a prototype implementation of stacked views within Microsoft SQL Server and explains which classes of queries can be supported. To support view matching for stacked views, a signature mechanism was added to the optimizer. This mechanism turned out to be beneficial also for regular views by significantly speeding up view matching."

Jonathan Goldstein, Per-Åke Larson: Optimizing Queries Using Materialized Views: A practical, scalable solution. SIGMOD Conference 2001: 331-342.
"Materialized views can provide massive improvements in query processing time, especially for aggregation queries over large tables. To realize this potential, the query optimizer must know how and when to exploit materialized views. This paper presents a fast and scalable algorithm for determining whether part or all of a query can be computed from materialized views and describes how it can be incorporated in transformation-based optimizers. The current version handles views composed of selections, joins and a final group-by. Optimization remains fully cost based, that is, a single best rewrite is not selected by heuristic rules but multiple rewrites are generated and the optimizer chooses the best alternative in the normal way. Experimental results based on an implementation in Microsoft SQL Server show outstanding performance and scalability. Optimization time increases slowly with the number of views but remains low even up to a thousand."

8. Distributed Databases (3 hours)

Database interoperability, record matching.

Walter V. Sujansky: Heterogeneous Database Integration in Biomedicine. Journal of Biomedical Informatics 34(4): 285-298 (2001).
"The rapid expansion of biomedical knowledge, reduction in computing costs, and spread of internet access have created an ocean of electronic data. The decentralized nature of our scientific community and healthcare system, however, has resulted in a patchwork of diverse, or heterogeneous, database implementations, making access to and aggregation of data across databases very difficult. The database heterogeneity problem applies equally to clinical data describing individual patients and biological data characterizing our genome. Specifically, databases
are highly heterogeneous with respect to the data models they employ, the data schemas they specify, the query languages they support, and the terminologies they recognize. Heterogeneous database systems attempt to unify disparate databases by providing uniform conceptual schemas that resolve representational heterogeneities, and by providing querying capabilities that aggregate and integrate distributed data. Research in this area has applied a variety of database and knowledge-based techniques, including semantic data modeling, ontology definition, query translation, query optimization, and terminology mapping. Existing systems have addressed heterogeneous database integration in the realms of molecular biology, hospital information systems, and application portability."

Glenn B. Bell, Anil Sethi: Matching records in a national medical patient index. Commun. ACM 44(9): 83-88 (2001).
"...There are many potential benefits of having a computerized medical history, including the elimination of duplicate tests and procedures, better access to patient histories for emergency use, elimination of reentry of historical data, and greater accuracy. To make all patients’ medical records in the U.S. accessible to care providers, electronic medical records need to be linked together using a massively distributed Master Patient Index (MPI)...."

9. Security and Privacy (3 hours)

Database security and authorization.

Elisa Bertino, Sushil Jajodia, and Pierangela Samarati: Database Security: Research and Practice. Inf. Syst. 20(7): 537-556 (1995).
"As an increasing number of organizations become dependent on access to their data over the Internet, the need for adequate security measures is becoming more and more critical. The most popular security measure these days is a firewall. However, a firewail is not immune to penetration, and it does not provide any protection of internal resources from insiders and successful intruders. One of the requirements for the protection of internal resources is access control to ensure that all accesses are authorized according to some specified policy. In this paper, we survey the state of the art in access control for database systems, discuss the main research issues, and outline possible directions for future research."

Rebecca Mercuri: The HIPAA-potamus in health care data security. Commun. ACM 47(7): 25-28 (2004).
"Regulations intended to improve health care data access have created new security risks along with headaches for patients and practitioners."

Gustavo H. M. B. Motta, Sergio Shiguemi Furuie: A contextual role-based access control authorization model for electronic patient record. IEEE Transactions on Information Technology in Biomedicine 7(3): 202-207 (2003).
"The design of proper models for authorization and access control for electronic patient record (EPR) is essential to a wide scale use of EPR in large health organizations. In this paper, we propose a contextual role-based access control authorization model aiming to increase the patient privacy and the confidentiality of patient data, whereas being flexible enough to consider specific cases. This model regulates user's access to EPR based on organizational roles. It supports a role-tree hierarchy with authorization inheritance; positive and negative authorizations; static and dynamic separation of duties based on weak and strong role conflicts. Contextual authorizations use environmental information available at access time, like user/patient relationship, in order to decide whether a user is allowed to access an EPR resource. This enables the specification of a more flexible and precise authorization policy, where permission is granted or denied according to the right and the need of the user to carry out a particular job function."

10. Data Warehousing (3 hours)

OLAP, data cubes.

Surajit Chaudhuri and Umeshwar Dayal: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26(1): 65-74 (1997).
"Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996."

Craig S. Ledbetter and Matthew W. Morgan: Toward Best Practice: Leveraging the Electronic Patient Record as a Clinical Data Warehouse. J. Healthcare Info. Mgmt 15(2): 119-131 (2001).
"Automating clinical and administrative processes via an electronic patient record (EPR) gives clinicians the point-of-care tools they need to deliver better patient care. However, to improve clinical practice as a whole and then evaluate it, healthcare must go beyond basic automation and convert EPR data into aggregated, multidimensional information. Unfortunately, few EPR systems have the established, powerful analytical clinical data warehproxy.lib.uwaterloo.caouses (CDWs) required for this conversion. This article describes how an organization can support best practice by leveraging a CDW that is fully integrated into its EPR and clinical decision support (CDS) system. The article (1) discusses the requirements for comprehensive CDS, including on-line analytical processing (OLAP) of data at both transactional and aggregate levels, (2) suggests that the transactional data acquired by an OLTP EPR system must be remodeled to support retrospective, population-based, aggregate analysis of those data, and (3) concludes that this aggregate analysis is best provided by a separate CDW system."

11. Data Streams (3 hours)

Streaming data, window queries.

Lukasz Golab and M. Tamer Özsu: Issues in data stream management. SIGMOD Record 32(2): 5-14 (2003).
"Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for online analysis of rapidly changing data streams. Limitations of traditional DBMSs in supporting streaming applications have been recognized, prompting research to augment existing technologies and build new systems to manage streaming data. The purpose of this paper is to review recent work in data stream management systems, with an emphasis on application requirements, data models, continuous query languages, and query evaluation."

Alfonso F. Cardenas, Raymond K. Pon, Robert B. Cameron: Management of Streaming Body Sensor Data for Medical Information Systems. METMBS 2003: 186-191.

"Data retrieved from body sensors such as ECG machines and new-generation multi-sensor systems such as respiratory monitors are varied and abundant. Managing and integrating this streaming data with existing types of medical information such as patient records, laboratory results, procedures, medical documents, and medical images in a logical and intuitive way is a challenge. Not only is the management of such data difficult but so is the visualization and presentation of the data to physicians, specialists, and patients. Using a new generation of lifeshirts embedded with real time monitors for respiratory and heart functions as a testbed, we propose and have initiated development of a data management system for dealing with such streaming body sensor data, under the framework of an extensible architecture that will support easy visualization of such data."

12. Data Mining (1.5 hours)

Market basket analysis, association rules.

Evangelos Simoudis: Reality Check for Data Mining. IEEE Expert 11(5): 26-33 (1996).
"Data mining—the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions—currently performs this task for a growing range of businesses. After presenting an overview of current data-mining techniques, this article explores two particularly noteworthy applications of those techniques: market basket analysis and customer segmentation."

Heikki Mannila: Methods and Problems in Data Mining. ICDT 1997: 41-55.
"Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. We consider some methods used in data mining, concentrating on levelwise search for all frequently occurring patterns. We show how this technique can be used in various applications. We also discuss possibilities for compiling data mining queries into algorithms, and look at the use of sampling in data mining. We conclude by listing several open research problems in data mining and knowledge discovery."

J. G. Richards, V. J. Rayward-Smith, P. H. Sönksen, S. Carey, and C. Weng: Data mining for indicators of early mortality in a database of clinical records. Artifical Intelligence in Medicine 22: 215-231 (2001).
"This paper describes the analysis of a database of diabetic patients’ clinical records and death certificates. The objective of the study was to find rules that describe associations between observations made of patients at their first visit to the hospital and early mortality. Pre-processing was carried out and a knowledge discovery in databases (KDD) package, developed by the Lanner Group and the University of East Anglia, was used for rule induction using simulated annealing. The most significant discovered rules describe an association that was not generally known or accepted by the medical community, however, recent independent studies confirm their validity."