PhD Seminar • Data Systems (Information Retrieval) — Sampling Strategies and Active Learning for Volume EstimationExport this event to calendar

Tuesday, March 12, 2019 11:00 AM EDT

Haotian Zhang, PhD candidate
David R. Cheriton School of Computer Science

This paper tackles the challenge of accurately and efficiently estimating the number of relevant documents in a collection for a particular topic. One real-world application is estimating the volume of social media posts (e.g., tweets) pertaining to a topic, which is fundamental to tracking the popularity of politicians and brands, the potential sales of a product, etc. Our insight is to leverage active learning techniques to find all the “easy” documents, and then to use sampling techniques to infer the number of relevant documents in the residual collection. 

We propose a simple yet effective technique for determining this “switchover” point, which intuitively can be understood as the “knee” in an effort vs. recall gain curve, as well as alternative sampling strategies beyond the knee. We show on several TREC datasets and a collection of tweets that our best technique yields more accurate estimates (with the same effort) than several alternatives.

Location 
DC - William G. Davis Computer Research Centre
2102
200 University Avenue West

Waterloo, ON N2L 3G1
Canada

S M T W T F S
30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
1
2
3
  1. 2024 (100)
    1. April (23)
    2. March (27)
    3. February (25)
    4. January (25)
  2. 2023 (296)
    1. December (20)
    2. November (28)
    3. October (15)
    4. September (25)
    5. August (30)
    6. July (30)
    7. June (22)
    8. May (23)
    9. April (32)
    10. March (31)
    11. February (18)
    12. January (22)
  3. 2022 (245)
  4. 2021 (210)
  5. 2020 (217)
  6. 2019 (255)
  7. 2018 (217)
  8. 2017 (36)
  9. 2016 (21)
  10. 2015 (36)
  11. 2014 (33)
  12. 2013 (23)
  13. 2012 (4)
  14. 2011 (1)
  15. 2010 (1)
  16. 2009 (1)
  17. 2008 (1)