Link Search Menu Expand Document

Potential Datasets and Problems

The below datasets broadly define the backgrounds and cover a range of application domains and data types. A dataset can be used partially due to its size and complexity. Students are required to define their own research problems to solve. It is also encouraged to combine other data sources or applications with these to foster novel research questions and visualization techniques. For brainstorming about the project ideas related to AI explainability, refer to the proceedings of the VISxAI workshops and the course reading lists.

Document layout analysis (images)

The PRImA layout analysis dataset consists of realistic documents with a wide variety of layouts, especially magazines and technical/scientific publications. There are 305 ground-truth document images in final release and more are being added.

Programming pattern exploration (code)

Jupyter Notebook is one of the most popular platform for data scientists to conduct exploratory programming. The UCSD digital collection provides a collection of 1.25 million Jupyter Notebooks publicly available from GitHub.

Business reviews analysis (text)

The Yelp reviews dataset contains about 8 million reviews and 200K businesses, covering 10 metropolitan areas.

Academic collaboration analysis (networks)

The Vispubdata consists of information on IEEE Visualization (IEEE VIS) publications from 1990-2018. The dataset includes a variety of meta-data for a paper such as title, authors, DOI, etc., and a list of the citations to other previous VIS papers.

New York taxi trip records (sequences)

The TLC trip record data include the yellow and green taxi trip history in NYC with a bunch of data attributes such as pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, etc., from 2009 to 2020.

Example Final Project Papers from Previous Classes


Copyright © 2023 Jian Zhao.