Real world databases in general have errors due to integration from multiple sources with conflicting or imprecise information. Traditionally, a set of integrity constraints are specified on the database, a sets of tuples violating the constraints are identified and updated to remove errors. Our system addresses data cleaning in a data warehouse and curation setting where a database to be cleaned is generated through ETL and hence errors originate at a different place from where constraints are defined. Considering lineage as a first class citizen, our system tracks it, and prunes large space of lineage to precisely identify errors at sources, and summarize them succinctly to help prevent future errors in the transformation pipeline. (See publications)
Social Network Mining
Typically, in systems such as IMdb, Yelp, Amazon, Netflix, etc. users interact with items and generate ratings/reviews. Identifying prominent interaction patterns between communities of users (e.g., students between age 18-25) and communities of items (e.g., Woody Allen’s comedy movies) is very important for marketing analytics, and a challenging problem as well. Our system aims to discover patterns between communities of different entity-types in social interaction systems. Patterns are ranked based on a scoring function or satisfying a diversity constraint on the universe of entities.
Text generated by speech-to-text conversion systems from enterprise customer service interactions is both noisy as well as rich with business intelligence information for enterprises. I worked on a system at IBM Research which mines such a text database. I developed algorithms to discover outlier conversations pertaining to company’s business, develop machine learning techniques to learn generative models of text which can help build automated voice response systems, and privacy techniques to secure sensitive information in the data (See publications).
CS 348 - Database Systems
CS 448 - Database Systems Implementation
CS 349 - User Interfaces