Empirical Software Engineering is a well established research area in Software Engineering research. However, most of such research has been focused on examining a small set of projects. More recently there is an increased interest in mining data from ultra large scale software repositories. For example there has been efforts to extract all the software projects in SourceForge, GitHub, etc. The main reasons for studying such large repositories are to (a) understand the effects of the ecosystem, (b) to discover the underlying similarities between the different projects in the repository, and (c) to arrive at more generalizable conclusions.

This seminar course explores leading research in Empirical Software Engineering using Ultra Large Repositories, discusses challenges associated with mining such repositories, highlights success stories, and outlines future research directions. We include in this course, studies on a variety of repositories like projects in GitHub or SourceForge, apps in the Android Market, discussion in Stack Overflow, etc. Students will acquire the knowledge needed to perform research or conduct practice in the field. Once completed, students should be able to integrate Empirical Software Engineering in their own research or practice.


Classes are held on Wed (01:00-03:50PM) in DC 2568.
Classes starts Sept 13 2017.

Contact Instructor: Mei Nagappan . Please prefix the subject line of your email with the code CS846 for a timely reply.
Each class, students will present and discuss around three papers. A detailed schedule is available here. Each class will cover papers along one of the following themes:

  • Source code similarity
  • Mobile Apps
  • API Mining
  • Bugs
  • Testing
  • Programming languages
  • CI
  • Collecting Large Datasets (The process used to collect the dataset)
  • Parallel processing techniques like MapReduce/Pig Queries

Students are expected to have some background in software development and software engineering. Knowledge of data mining techniques will be beneficial but not expected.

Students will be evaluated using the following breakdown:
1. Weekly critique (24% - 3% for each week):
Each week, each student should critique all the papers for that week and submit via Learn (Folder called Week#) a one page critique of the paper before the start of class. The critique should offer a brief half page summary of the paper + 3 things that you want to discuss about the paper in class. You do not need to submit a critique for the paper you are presenting, but need to submit critiques for the other papers that week. Additional advice for critiquing papers is here.

The one document should have your name at the top. The document name should follow this template: Week#_Paper#_YourName.
2. Classroom participation (15%):
Students are expected to read all papers covered in a week, come to class prepared to discuss their thoughts and take part of the classroom discussions. As a discussant, you should take an adversarial position by pointing out weak and controversial positions in the paper. You should present a short rebuttal of the paper. You should come prepared with problems and counterexamples for the presented work (Note that this is what you are writing in the document that you submit each week for each paper (see above).
I will hand out a token every time a student speaks up in class. At the end of the class, the tokens with each student will be tallied up. At the end of the term, you will get a score based on the number of tokens you have normalized against the max number of tokens by a student.

3. Paper presentation (15%):
Each paper will be assigned to one student who will act as a presenter. The presentation will last 15 mins strict and the discussion will last 20-30 mins. Each student should upload the slides to Learn (Folder called Presentation) by noon on Wed before class.
Role of presenter: As a presenter you should not simply repeat the paper's content (remember you only have 15 mins), instead you should point out the main important findings of the work. You should highlight any novel contributions, any surprises, and other possible applications of the proposed techniques. You should check the authors' other work related to the presented paper. Finally you should place the work relative other papers covered in the course (especially the papers covered in that particular week). Your presenations should have

  • one slide that lists the main contributions of the paper.
  • one slide (at least) explaining the data mining/analysis technique used in the paper.
  • one slide that places the paper relative to any recent work done by the authors of the paper.
  • one slide that links places the paper relative to other papers presented that week.
  • as the final slide, a listing of at least three technical points that you liked and three areas that should be improved.

4. Project (50%):
One original project (10 pages IEEE format) done alone. The project will be a replication of one of the papers covered in the course or something similar.
You need to submit a project proposal (2 pages IEEE format) by week 3. The proposal should provide a brief description of the paper you want to replicate with a motivation for why you want to replicate this paper. It should also have a detailed discussion of the data that will be used in the project: where will you get the data originally used in the paper, and what is the new data that you want to replicate the experiment on.
Based on the feedback from me, you will present in Week 5 (a 5 minute presentation) to the class. The proposal paper and presentation is 10%.
There are two major milestones: (1) About 1.5 months before end of term, you submit a 2 page IEEE style report (at least) on the replication of the paper with the original data for 10% of the grade, and (2) by the end of the term you present the replication of the paper on new data (10% for the class presentation and 20% for the final paper).
Your project will be graded according to depth of your work, correctness of your analysis, and the presentation quality of your written report and class presentation in the last week of class. The GitHub, SourceForge, World of Code, Stackoverflow, Android and Apache datasets are possible sources of data to use for your project. Advice on writing a project report are here.

UWaterloo policy on academic integrity will be strictly enforced.