DBRank 2010

08:30am - 10:00am Keynote

Speaker: Alon Halevy (Google)

Title: Table Search

Abstract: The Web contains hundreds of millions of high-quality structured data sets, in HTML tables, HTML lists and databases stored behind forms. Presenting these data sets in response to user queries is important for the quality of Web search. In addition, there are a growing number of contexts in which users would like to search specifically within a collection of structured data. Such search is hard because the typical signals used for text search do not apply as effectively to structured data. I will describe several scenarios in which table search is needed and contrast the treatments we’ve developed and continue to pursue at Google. In particular, I’ll touch on our efforts to surface content from the Deep Web, our work on collecting and answering queries over a collection of 150 million HTML tables, and our work on supporting the creation and sharing of tables with Google Fusion Tables, a new service for collaborating on structured data on the Web

Speaker's Bio: Dr. Alon Halevy received his Bachelors degree in Computer Science and Mathematics from the Hebrew University in Jerusalem in 1988, and his Ph.D in Computer Science from Stanford University in 1993. From 1993 to 1997, Dr. Halevy was a principal member of technical staff at AT&T Bell Laboratories, and then at AT&T Laboratories. He joined the faculty of the Computer Science and Engineering Department at the University of Washington in 1998. In 2004, Dr. Halevy founded Transformic Inc., a company that creates search engines for the deep web, content residing in databases behind web forms. Currently, he is at Google Inc. in Mountain View, California. Dr. Halevy was a Sloan Fellow (1999-2000), and received the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000. He serves on the editorial boards of the VLDB Journal, the Journal of Artificial Intelligence Research (currently, a member of the advisory committee), and ACM Transactions on Internet Technology. He served as the program chair for the ACM SIGMOD 2003 Conference, and has given several keynotes at top conferences.

10:00 - 10:30 Coffee break
10:30am - 12:00pm Session 1

Subspace Similarity Search Using the Ideas of Ranking and Top-k Retrieval
Thomas Bernecker , Tobias Emrich , Franz Graf , Hans-Peter Kriegel , Peer Kröger, Matthias Renz , Erich Schubert , Arthur Zimek (Ludwig-Maximilians-Universität München)

Abstract
: There are abundant scenarios for applications of similarity search in databases where the similarity of objects is defined for a subset of attributes, i.e., in a subspace, only. While much research has been done in efficient support of single column similarity queries or of similarity queries in the full space, scarcely any support of similarity search in subspaces has been provided so far. The three existing approaches are variations of the sequential scan. Here, we propose the first index-based solution to subspace similarity search in arbitrary subspaces which is based on the concepts of nearest neighbor ranking and top-k retrieval.


Efficient k-Nearest Neighbor Queries with the Signature Quadratic Form Distance
Christian Beecks, Merih Uysal, Thomas Seidl (RWTH Aachen University)

Abstract: A frequently encountered query type in multimedia databases is the k-nearest neighbor query which finds the k-nearest neighbors of a given query. To speed up such queries and to meet the user requirements in low response time, approximation techniques play an important role. In this paper, we present an efficient approximation technique applicable to distance measures defined over flexible feature representations, i.e. feature signatures. We apply our approximation technique to the recently proposed Signature Quadratic Form Distance applicable to feature signatures. We performed our experiments on numerous image databases, gathering k-nearest neighbor query rankings in significantly low computation time with an average speed-up factor of 13.


Top-k pipe join
Davide Martinenghi, Marco Tagliasacchi (Politecnico di Milano)

Abstract: In the context of service composition and orchestration, service invocation is typically scheduled according to execution plans, whose topology establishes whether different services are to be invoked in parallel or in a sequence. In the latter case, we may have a configuration, called pipe join, in which the output of a service is used as input for another service. When the services involved in a pipe join output results sorted by score, the problem arises of efficiently determining the join tuples (aka combinations) with the highest combined scores as computed by some aggregation function. In this paper, we study the top-k pipe join problem, i.e., the problem of joining the results of two services in a pipe join with the goal of finding the top-k combinations while minimizing the access cost.

12:00pm - 1:30pm Lunch break (on your own)
1:30pm - 2:30pm Session 2

On Novelty in Publish/Subscribe Delivery
Dimitris Souravlias, Marina Drosou, Kostas Stefanidis, Evaggelia Pitoura (University of Ioannina)

Abstract: In publish/subscribe systems, users express their interests in specific items of information and get notified when relevant data items are produced. Such systems allow users to stay informed without the need of going through huge amounts of data. However, as the volume of data being created increases, some form of ranking of matched events is needed to avoid overwhelming the users. In this work-in-progress paper, we explore novelty as a ranking criterion. An event is considered novel, if it matches a subscription that has rarely been matched in the past.


Ranking for Data Repairs
Mohamed Yakout, Ahmed Elmagarmid, Jennifer Neville (Purdue University)

Abstract: Improving data quality is a time-consuming, labor-intensive and often domain specific operation. A recent principled approach for repairing dirty database is to use data quality rules in the form of database constraints to identify dirty tuples and then use the rules to derive data repairs. Most of existing data repair approaches focus on providing fully automated solutions, which could be risky to depend upon especially for critical data. To guarantee the optimal quality repairs applied to the database, users should be involved to confirm each repair. This highlights the need for an interactive approach that combines the best of both; automatically generating repairs, while efficiently employing user's efforts to verify the repairs. In such approach, the user will guide an online repairing process to incrementally generate repairs. A key challenge in this approach is the response time within the user's interactive sessions, because the process of generating the repairs is time consuming due to the large search space of possible repairs. To this end, we present in this paper a mechanism to continuously generate repairs only to the current top k important violated data quality rules. Moreover, the repairs are grouped and ranked such that the most beneficial in terms of improving data quality comes first to consult the user for verification and feedback. Our experiments on real-world dataset demonstrate the effectiveness of our ranking mechanism to provide a fast response time for the user while improving the data quality as quickly as possible.