My picture

Shai Ben-David


School of Computer Science
Universitys of Waterloo
200 University Avenue West
Waterloo, Ontario, Canada, N2L 3G1

Email: "MyFirstName"
Phone: +1 519 888 4567 ext. 37523
Fax: +1 519 885-1208

The Use of Unlabeled and Weakly labeled Data in Classification Tasks

While most of the modern theory of classification learning predominantly focuses on learning from labeled data, in practice it is often the case that such data is expensive and hard to come by. In such cases, classification prediction algorithms try to utilize training data that is either unlabeled (a.k.a. semi-supervised learning -- SSL), or labeled unreliably.

I have been working on analyzing SSL learning paradigms and the assumptions needed to guarantee their success. Negative results were shown in (COLT2009 with Lu and Pal) and a scenario under which the addition of unlabeled samples is provably beneficial is described in (ICML 2011 with Urner and Shalev-Shwartz).

Another solution to the scarcity of correctly labeled training data is the use of training examples that are labeled by less-than-perfect supervisors. These may include novice advisors or even labels obtained via Internet crowdsourcing (such as Amazon's Mechanical Turk). In (AISTATS 2012) we have initiated theoretical research of such setups.