08:30am - 10:00am |
Keynote |
Speaker: Alon Halevy (Google)
Title: Table Search
Abstract: The Web contains hundreds of millions of high-quality
structured data sets, in HTML tables, HTML lists and databases stored behind
forms. Presenting these data sets in response to user queries is important
for the quality of Web search. In addition, there are a growing number of
contexts in which users would like to search specifically within a collection
of structured data. Such search is hard because the typical signals used
for text search do not apply as effectively to structured data. I will describe
several scenarios in which table search is needed and contrast the treatments
we’ve developed and continue to pursue at Google. In particular, I’ll touch
on our efforts to surface content from the Deep Web, our work on collecting
and answering queries over a collection of 150 million HTML tables, and
our work on supporting the creation and sharing of tables with Google Fusion
Tables, a new service for collaborating on structured data on the Web
Speaker's Bio: Dr. Alon Halevy received his Bachelors
degree in Computer Science and Mathematics from the Hebrew University in
Jerusalem in 1988, and his Ph.D in Computer Science from Stanford University
in 1993. From 1993 to 1997, Dr. Halevy was a principal member of technical
staff at AT&T Bell Laboratories, and then at AT&T Laboratories. He joined
the faculty of the Computer Science and Engineering Department at the University
of Washington in 1998. In 2004, Dr. Halevy founded Transformic Inc., a company
that creates search engines for the deep web, content residing in
databases behind web forms. Currently, he is at Google Inc. in Mountain
View, California. Dr. Halevy was a Sloan Fellow (1999-2000), and received
the Presidential Early Career Award for Scientists and Engineers (PECASE)
in 2000. He serves on the editorial boards of the VLDB Journal, the Journal
of Artificial Intelligence Research (currently, a member of the advisory
committee), and ACM Transactions on Internet Technology. He served as the
program chair for the ACM SIGMOD 2003 Conference, and has given several
keynotes at top conferences.
|
10:30am - 12:00pm |
Session 1 |
Subspace Similarity Search Using the Ideas of Ranking and Top-k
Retrieval
Thomas Bernecker , Tobias Emrich , Franz Graf , Hans-Peter Kriegel
, Peer Kröger, Matthias Renz , Erich Schubert , Arthur Zimek (Ludwig-Maximilians-Universität
München)
Abstract: There are abundant scenarios for applications of similarity
search in databases where the similarity of objects is defined for a subset
of attributes, i.e., in a subspace, only. While much research has been done
in efficient support of single column similarity queries or of similarity
queries in the full space, scarcely any support of similarity search in
subspaces has been provided so far. The three existing approaches are variations
of the sequential scan. Here, we propose the first index-based solution
to subspace similarity search in arbitrary subspaces which is based on the
concepts of nearest neighbor ranking and top-k retrieval.
Efficient k-Nearest Neighbor Queries with the Signature Quadratic
Form Distance
Christian Beecks, Merih Uysal, Thomas Seidl (RWTH Aachen University)
Abstract: A frequently encountered query type in multimedia
databases is the k-nearest neighbor query which finds the k-nearest neighbors
of a given query. To speed up such queries and to meet the user requirements
in low response time, approximation techniques play an important role. In
this paper, we present an efficient approximation technique applicable to
distance measures defined over flexible feature representations, i.e. feature
signatures. We apply our approximation technique to the recently proposed
Signature Quadratic Form Distance applicable to feature signatures. We performed
our experiments on numerous image databases, gathering k-nearest neighbor
query rankings in significantly low computation time with an average speed-up
factor of 13.
Top-k pipe join
Davide Martinenghi, Marco Tagliasacchi (Politecnico di Milano)
Abstract: In the context of service composition and orchestration,
service invocation is typically scheduled according to execution plans,
whose topology establishes whether different services are to be invoked
in parallel or in a sequence. In the latter case, we may have a configuration,
called pipe join, in which the output of a service is used as input for
another service. When the services involved in a pipe join output results
sorted by score, the problem arises of efficiently determining the join
tuples (aka combinations) with the highest combined scores as computed by
some aggregation function. In this paper, we study the top-k pipe join problem,
i.e., the problem of joining the results of two services in a pipe join
with the goal of finding the top-k combinations while minimizing the access
cost.
|
1:30pm - 2:30pm |
Session 2 |
On Novelty in Publish/Subscribe Delivery
Dimitris Souravlias, Marina Drosou, Kostas Stefanidis, Evaggelia Pitoura
(University of Ioannina)
Abstract: In publish/subscribe systems, users express their
interests in specific items of information and get notified when relevant
data items are produced. Such systems allow users to stay informed without
the need of going through huge amounts of data. However, as the volume of
data being created increases, some form of ranking of matched events is
needed to avoid overwhelming the users. In this work-in-progress paper,
we explore novelty as a ranking criterion. An event is considered novel,
if it matches a subscription that has rarely been matched in the past.
Ranking for Data Repairs
Mohamed Yakout, Ahmed Elmagarmid, Jennifer Neville (Purdue University)
Abstract: Improving data quality is a time-consuming, labor-intensive
and often domain specific operation. A recent principled approach for repairing
dirty database is to use data quality rules in the form of database constraints
to identify dirty tuples and then use the rules to derive data repairs.
Most of existing data repair approaches focus on providing fully automated
solutions, which could be risky to depend upon especially for critical data.
To guarantee the optimal quality repairs applied to the database, users
should be involved to confirm each repair. This highlights the need for
an interactive approach that combines the best of both; automatically generating
repairs, while efficiently employing user's efforts to verify the repairs.
In such approach, the user will guide an online repairing process to incrementally
generate repairs. A key challenge in this approach is the response time
within the user's interactive sessions, because the process of generating
the repairs is time consuming due to the large search space of possible
repairs. To this end, we present in this paper a mechanism to continuously
generate repairs only to the current top k important violated data quality
rules. Moreover, the repairs are grouped and ranked such that the most beneficial
in terms of improving data quality comes first to consult the user for verification
and feedback. Our experiments on real-world dataset demonstrate the effectiveness
of our ranking mechanism to provide a fast response time for the user while
improving the data quality as quickly as possible.
|