The Use of Unlabeled and Weakly labeled Data in Classification Tasks
While most of the modern theory of classification learning predominantly focuses on learning from labeled data, in practice it is often the case that such data is expensive and hard to come by. In such cases, classification prediction algorithms try to utilize training data that is either unlabeled (a.k.a. semi-supervised learning -- SSL), or labeled unreliably.
I have been working on analyzing SSL learning paradigms and the assumptions needed to guarantee their success. Negative results were shown in (COLT2009 with Lu and Pal) and a scenario under which the addition of unlabeled samples is provably beneficial is described in (ICML 2011 with Urner and Shalev-Shwartz).
Another solution to the scarcity of correctly labeled training data is the use of training examples that are labeled by less-than-perfect supervisors. These may include novice advisors or even labels obtained via Internet crowdsourcing (such as Amazon's Mechanical Turk). In (AISTATS 2012) we have initiated theoretical research of such setups.
Publications
- On the Hardness of Domain Adaptataion (And the Utility of Unlabeled Target Samples)
Shai Ben-David and Ruth Urner ALT 2012 - Domain Adaptation--Can Quantity compensate for Quality?
Shai Ben-David, Shai Shalev-Shwartz, and Ruth Urner ISAIM 2012 - Learning from Weak Teachers
Shai Ben-David, Ruth Urner and Ohad Shamir AISTATS 2012 - Access to Unlabeled Data can Speed up Prediction Time
Ruth Urner, Shai Shalev-Shwartz, Shai Ben-David ICML 2011 - Does Unlabeled Data Provably Help?
Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning
Shai Ben-David, Tyler Lu and David Pal COLT 2008