COURSE OVERVIEW
Empirical Software Engineering is a well established research area in Software Engineering research. However, most of such research has been focused on examining a small set of projects. More recently there is an increased interest in mining data from ultra large scale software repositories. For example there has been efforts to extract all the software projects in SourceForge, GitHub, etc. The main reasons for studying such large repositories are to (a) understand the effects of the ecosystem, (b) to discover the underlying similarities between the different projects in the repository, and (c) to arrive at more generalizable conclusions.

COURSE OBJECTIVES
This seminar course explores leading research in Empirical Software Engineering using Ultra Large Repositories, discusses challenges associated with mining such repositories, highlights success stories, and outlines future research directions. We include in this course, studies on a variety of repositories like projects in GitHub or SourceForge, apps in the Android Market, discussion in Stack Overflow, etc. Students will acquire the knowledge needed to perform research or conduct practice in the field. Once completed, students should be able to integrate Empirical Software Engineering in their own research or practice.

COURSE SCHEDULE

Classes are held on Mon (01:00-03:50PM) in DC 2568.
Classes starts Jan 09 2017.

Contact Instructor: Mei Nagappan . Please prefix the subject line of your email with the code CS846 for a timely reply.
Each class, students will present and discuss around three papers. A detailed schedule is available here. Each class will cover papers along one of the following themes:

  • Source code similarity
  • Mobile Apps
  • API Mining
  • Bugs
  • Testing
  • Programming languages
  • CI
  • Collecting Large Datasets (The process used to collect the dataset)
  • Parallel processing techniques like MapReduce/Pig Queries

COURSE REQUIREMENTS
Students are expected to have some background in software development and software engineering. Knowledge of data mining techniques will be beneficial but not expected.

Students will be evaluated using the following breakdown:

1. Classroom participation (10%):
Students are expected to read all papers covered in a week, come to class prepared to discuss their thoughts and take part of the classroom discussions. I will call upon students at random.

2. Paper presentation and discussion (20% - 10% for each):
Each paper will be assigned to one student who will act as a presenter and a discussant. The presentation will last 15 mins strict and the discussion will last 20-30 mins. Each student should email the slides to me by noon on Monday before class.

  • Role of presenter: As a presenter you should not simply repeat the paper's content (remember you only have 15 mins), instead you should point out the main important findings of the work. You should highlight any novel contributions, any surprises, and other possible applications of the proposed techniques. You should check the authors' other work related to the presented paper. Finally you should place the work relative other papers covered in the course (especially the papers covered in that particular week).
  • Role of discussant: As a discussant, you should take an adversarial position by pointing out weak and controversial positions in the paper. You should present a short rebuttal of the paper. You should come prepared with problems and counterexamples for the presented work.
Your presenations should have
  • one slide that lists the main contributions of the paper.
  • one slide (at least) explaining the data mining/analysis technique used in the paper.
  • one slide that places the paper relative to any recent work done by the authors of the paper.
  • one slide that links places the paper relative to other papers presented that week.
  • as the final slide, a listing of at least three technical points that you liked and three areas that should be improved.
3. Weekly critique (21% - 3% for each week):
Each week, each student should pick one of the papers for that week and submit via email a one page critique of the paper before the start of class. The critique should offer a brief summary of the paper, points in favor, points against, and comments for improvement. You do not need to submit a critique if you are presenting that week. Additional advice for critiquing papers is here.

The one document should have your name at the top. The document name should follow this template: Week#_Paper#_YourName.

4. Project (50%):
One original project (10 pages IEEE format) done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course.
You need to submit a project proposal (2 pages IEEE format) around 1.5 months before end of term. The proposal should provide a brief motivation of the project,a detailed discussion of the data that will be used in the project, along with a timeline of milestones, and expected outcome. Make sure you have cited at least 3 papers in your proposal. Additional advice for project proposals is here.

Your project will be graded according to originality and interestingness of your project, depth of your work, correctness of your analysis, and the presentation quality of your written report and class presentation (20 mins presentation done in the last week of classes). The GitHub, SourceForge, World of Code, Stackoverflow, Android and Apache datasets are possible sources of data to use for your project. Advice on writing a project report are here.