I do both systems and theoretical research in data management and processing. My systems work focuses on developing systems for managing, querying, or doing analytics on graph-structured data. My main on-going systems projects include Graphflow, which is a new graph database we are building from scratch (here is a talk I gave at Pinterest giving an overview of the main features of the system), GRainDB, which is an extension of the DuckDB RDBMSs to support hybrid graph-relational querying, Graphsurge, which is a new graph data analytics system we are developing on top of the Timely and Diferential Dataflow systems, and KTabulator which is a system to interactively collect data and create tables from large open-source knowledge graphs. My theoretical work focuses on studying theoretical aspects of query processing algorithms.
GraphflowDB is a graph database management system (GDBMS) we are building from scratch. The system is implemented in Java and supports the openCypher language. Our research focuses on rethinking each core database component for contemporary GDBMSs, including core query optimization and processing techniques such as join optimization, indexes, and cardinality estimation.
Graphsurge is a new system built on top of Timely Dataflow and Differential Dataflow for performing analytical computations on multiple snapshots or views of large-scale static property graphs. Graphsurge allows users to create view collections, a set of related views of a graph created by applying filter predicates on node and edge properties, and run analytical computations on all the views of a collection efficiently. The system is designed to support contingency, perturbation, or temporal analysis applications that require running computations on thousands of but similar graph snapshots at a time.
GPS, A Graph Processing System, is an open-source system for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system, and Apache Giraph. GPS is a distributed system designed to run on a cluster of machines, such as Amazon's EC2.
An introduction to database management systems. The course covers topics in three areas: (1) fundamental features of relational database management systems: the relational data model and its query languages, integrity constraints, indexes and views, and transactions; (2) database design methodology; and (3) core topics about the internals and architectures of DBMSs, such as physical record design, query planning and optimization, indexes, and transaction protocols.
The study of efficient algorithms and effective algorithm design techniques. Topics include divide and conquer algorithms, recurrences, greedy algorithms, dynamic programming, graph search and backtrack, theory of NP-completeness and its implications.
The seminar is an updated version of a previous seminar that focused primarily on graph data management in Fall 2018. The current offering covers fewer database topics and instead surveys a wider range of topics in graph analytics. The goal of this offering is to showcase the specific interests of a very wide range of scientific communities that work on graphs including communities within computer science, such as databases, semantic web, hpc and computer architecture, data mining, machine learning, and hci as well as communities outside of computer science, such as physics and neuro-science. The seminar's reading list is particularly tailored for students who have interest in doing graudate studies in graph processing and analytics, as it tries to cover some of the seminal work on graph processing from multiple communities.
The seminar covered the historical waves that made graph data models popular, such as the web and semantic web. The seminar also covered recent topics popular in the database research community e.g. modern graph databases, graph data processing software based on Hadoop and Spark-like software, knowledge graphs, and example machine learning applications on graphs.
The seminar surveyed the models that underly modern large-scale data processing systems, e.g. MapReduce, Spark, Pregel, Flink, Storm, Timely, and others. The goal is to identify the fundamental advantages and limitations of different models and demonstrate the systems and applications that are built on each model.