I do both systems and theoretical research in data management and processing. My systems work focuses on developing systems for managing, querying, or doing analytics on graph-structured data. My main on-going systems project is Kùzu, which is a new embeddable graph database management system (GDBMS) that is designed for high scalability and very fast querying. See this blog post that describes the vision of the system and these talks (1 and 2). Kùzu is based on our earlier system GraphflowDB (see this talk I gave at Pinterest for an overview of GraphflowDB).
Here are a few links related to Kùzu:
Kùzu is a new embeddable graph database management system (GDBMS) that is designed for high scalability and very fast querying. Kùzu is actively being developed and is acquring more and more features. Kùzu`s core architecture is informed by our insights from our previous GraphflowDB project but Kùzu is disk-based and more importantly aims to be a fully functional user facing system. Our research on Kùzu focuses on techniques for making GDBMSs competent on knowledge graph management and in graph data science pipelines. Specifically, we are studying how to integrate GDBMSs better into Python graph data science ecosystem and performant on highly heterogeneous and string/URI-heavy knowledge graphs.
GraphflowDB is a graph database management system (GDBMS) we are building from scratch. The system is implemented in Java and supports the openCypher language. Our research focuses on rethinking each core database component for contemporary GDBMSs, including core query optimization and processing techniques such as join optimization, indexes, and cardinality estimation.
Graphsurge is a new system built on top of Timely Dataflow and Differential Dataflow for performing analytical computations on multiple snapshots or views of large-scale static property graphs. Graphsurge allows users to create view collections, a set of related views of a graph created by applying filter predicates on node and edge properties, and run analytical computations on all the views of a collection efficiently. The system is designed to support contingency, perturbation, or temporal analysis applications that require running computations on thousands of but similar graph snapshots at a time.
GPS, A Graph Processing System, is an open-source system for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system, and Apache Giraph. GPS is a distributed system designed to run on a cluster of machines, such as Amazon`s EC2.
This seminar covers seminal work in the space of knowledge graph representation, querying, management, and past and primarily modern applications that are powered by knowledge graphs. Topics include knowledge models, ontologies, query languages, graph data management systems, public knowledge graphs, knowledge graph embeddings, popular successful past and present applications.
An introduction to database management systems. The course covers topics in three areas: (1) fundamental features of relational database management systems: the relational data model and its query languages, integrity constraints, indexes and views, and transactions; (2) database design methodology; and (3) core topics about the internals and architectures of DBMSs, such as physical record design, query planning and optimization, indexes, and transaction protocols.
The study of efficient algorithms and effective algorithm design techniques. Topics include divide and conquer algorithms, recurrences, greedy algorithms, dynamic programming, graph search and backtrack, theory of NP-completeness and its implications.
The seminar is an updated version of a previous seminar that focused primarily on graph data management in Fall 2018. The current offering covers fewer database topics and instead surveys a wider range of topics in graph analytics. The goal of this offering is to showcase the specific interests of a very wide range of scientific communities that work on graphs including communities within computer science, such as databases, semantic web, hpc and computer architecture, data mining, machine learning, and hci as well as communities outside of computer science, such as physics and neuro-science. The seminar's reading list is particularly tailored for students who have interest in doing graudate studies in graph processing and analytics, as it tries to cover some of the seminal work on graph processing from multiple communities.
The seminar covered the historical waves that made graph data models popular, such as the web and semantic web. The seminar also covered recent topics popular in the database research community e.g. modern graph databases, graph data processing software based on Hadoop and Spark-like software, knowledge graphs, and example machine learning applications on graphs.
The seminar surveyed the models that underly modern large-scale data processing systems, e.g. MapReduce, Spark, Pregel, Flink, Storm, Timely, and others. The goal is to identify the fundamental advantages and limitations of different models and demonstrate the systems and applications that are built on each model.