Anup Chalamalla

PhD Candidate, Database Research Group,

David R. Cheriton School of Computer Science


Research

Data Cleaning

 

Real world databases in general have errors due to integration from multiple sources with conflicting or imprecise information. Traditionally, a set of integrity constraints are specified on the database, a sets of tuples violating the constraints are identified and updated to remove errors. Our system addresses data cleaning in a data warehouse and curation setting where a database to be cleaned is generated through ETL and hence errors originate at a different place from where constraints are defined. Considering lineage as a first class citizen, our system tracks it, and prunes large space of lineage to precisely identify errors at sources, and summarize them succinctly to help prevent future errors in the transformation pipeline. (See publications)

 

Social Network Mining

 

Typically, in systems such as IMdb, Yelp, Amazon, Netflix, etc. users interact with items and generate ratings/reviews. Identifying prominent interaction patterns between communities of users (e.g., students between age 18-25) and communities of items (e.g., Woody Allen’s comedy movies) is very important for marketing analytics, and a challenging problem as well. Our system aims to discover patterns between communities of different entity-types in social interaction systems. Patterns are ranked based on a scoring function or satisfying a diversity constraint on the universe of entities.

 

Text Mining

 

Text generated by speech-to-text conversion systems from enterprise customer service interactions is both noisy as well as rich with business intelligence information for enterprises. I worked on a system at IBM Research which mines such a text database. I developed algorithms to discover outlier conversations pertaining to company’s business, develop machine learning techniques to learn generative models of text which can help build automated voice response systems, and privacy techniques to secure sensitive information in the data (See publications).

Publications [DBLP]

Patents

Teaching Assistantship

CS 348 - Database Systems

CS 448 - Database Systems Implementation

CS 349 - User Interfaces