2015 technical reports

CS-2015-01
Title

Hermes: Dynamic Partitioning for Distributed Social Network Graph Databases

Authors

Daniel Nicoara, Shahin Kamali, Khuzaima Daudjee and Lei Chen

Abstract

Social networks are large graphs that require multiple graph database servers to store and manage them. Each database server hosts a graph partition with the objectives of balancing server loads, reducing remote traversals (edge-cuts), and adapting the partitioning to changes in the structure of the graph in the face of changing workloads. To achieve these objectives, a dynamic repartitioning algorithm is required to modify an existing partitioning to maintain good quality partitions while not imposing a significant overhead to the system. In this paper, we introduce a lightweight repartitioner, which dynamically modifies partitioning using a small amount of resources. In contrast to the existing repartitioning algorithms, our lightweight repartitioner is efficient, making it suitable for use in a real system. We integrated our lightweight repartitioner into Hermes, which we designed as an extension of the open source Neo4j graph database system, to support workloads over partitioned graph data distributed over multiple servers.

Using real-world social network data, we show that Hermes leverages the lightweight repartitioner to maintain high quality partitions and provides a two to three times performance improvement over the de-facto standard random hash-based partitioning.

Date February
Report Hermes: Dynamic Partitioning for Distributed Social Network Graph Databases (PDF)
CS-2015-02
Title

The SPOT* System for Flexible Personal
Heating and Cooling

Authors

Alimohammad Rabbani and S. Keshav

Abstract Building on our prior work on the SPOT and SPOT+ Smart
Personalized Office Thermal control systems, this paper
presents SPOT*, a cost-effective, flexible system for person-
alized heating and cooling. Like its predecessors, SPOT*
senses occupancy and worker comfort to reactively control
office temperature. Specifically, it uses the Predicted Mean
Vote equation to determine worker comfort and actuates a
fan or a heater to adjust the comfort level so that it lies
between -0.5 and +0.5 in the ASHRAE comfort scale. Un-
like the two prior systems, however, which use many sensors
and a per-user compute engine, SPOT* greatly reduces costs
by using the fewest possible sensors and a lightweight com-
pute engine that can optionally even be located in the cloud.
Moreover, SPOT* provides both cooling and heating using
a speed-controlled desktop fan, rather than only controlling
heating using a radiant heater. Finally, SPOT* is less intrusive in that it does not use a camera. We find that the per-user cost for SPOT* is about $185 compared to $1000 for SPOT. Moreover, in a preliminary deployment, SPOT* is able to improve user comfort by 78%.
Date February 6, 2015
Report The SPOT* System for Flexible Personal
Heating and Cooling (PDF)
CS-2015-03
Title

Evaluating Call Graph Construction for JVM-hosted Language Implementations

Authors

Xiaoni Lai, Zhaoyi Luo, Karim Ali, Ondřej Lhoták, Julian Dolby, and Frank Tip

Abstract

An increasing number of programming languages compile to the Java Virtual Machine (JVM), and program analysis frameworks such as WALA and SOOT support a broad range of program analysis algorithms by analyzing bytecode. While this approach works well when applied to bytecode produced from Java code, its efficacy when applied to other bytecode has not been studied until now.

We present qualitative and quantitative analysis of the soundness and precision of call graphs constructed from JVM bytecodes produced for Python, Ruby, Clojure, Groovy, Scala, and OCaml applications. We show that, for Python, Ruby, Clojure, and Groovy, the call graphs are unsound due to use of reflection, invokedynamic instructions, and run-time code generation, and imprecise due to how function calls are translated.

For Scala and OCaml, all unsoundness comes from rare, complex uses of reflection and proxies, and the translation of first-class features in Scala incurs a significant loss of precision.

Date February 12, 2015
Report Evaluating Call Graph Construction for JVM-hosted Language Implementations (PDF)
CS-2015-04
Title Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems
Authors

Minyang Han and Khuzaima Daudjee

Abstract

The bulk synchronous parallel (BSP) model used by synchronous graph processing systems allows algorithms to be easily implemented and reasoned about. However, BSP can suffer from poor performance due to stale messages and frequent global synchronization barriers. Asynchronous computation models have been proposed to alleviate these overheads but existing asynchronous systems that implement such models have limited scalability or retain frequent global barriers, and do not always support graph mutations or algorithms with multiple computation phases. We propose barrierless asynchronous parallel (BAP), a new computation model that reduces both message staleness and global synchronization. This enables BAP to overcome the limitations of existing asynchronous models while retaining support for graph mutations and algorithms with multiple computation phases. We present GiraphUC, which implements our BAP model in the open source distributed graph processing system Giraph, and evaluate our system at scale with large real-world graphs on 64 EC2 machines. We show that GiraphUC provides across-the-board performance improvements of up to $5\times$ faster over synchronous systems and up to an order of magnitude faster than asynchronous systems. Our results demonstrate that the BAP model provides efficient and transparent asynchronous execution of algorithms that are programmed synchronously.

Date April 6, 2015
Report Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems (PDF)
CS-2015-05
Title

The Development of Normative Autonomous Agents: an Approach

Authors

Marx Viana1, Paulo Alencar2, Donald Cowan2, Carlos J. P. de Lucena1

Abstract

Open multi-agent systems (MASs) act as societies in which autonomous and heterogeneous agents can work towards similar or different goals. In order to cope with the heterogeneity, autonomy and diversity of interests among the different agents in the society, open MASs establish a set of behavioural norms that is used as a mechanism to ensure a state of co-operation among agents.  Such norms regulate the behaviour of the agents by defining obligations, permissions and prohibitions. Fulfillment of a norm may be encouraged through a reward while violation of a norm may be discouraged through punishment. Although norms are promising mechanisms to regulate an agent’s behaviour, we should note that each agent is an autonomous entity that is free to fulfill or violate each associated norm. Thus, agents can use different strategies when deciding to achieve their goals including whether to comply with their associated norms. Agents might choose to achieve their goals while ignoring their norms, thus overlooking the rewards or punishments they may receive. In contrast agents may choose to comply with all the norms although some of their goals may not be achieved. In this context, this paper proposes a framework for simulation of normative agents providing a basis for understanding the impacts of norms on agents.

Date April 27, 2015
Report The Development of Normative Autonomous Agents: an Approach (PDF)
CS-2015-06
Title

Adapting to Climate Change - An Open Data Platform for Cumulative Environmental Analysis and Management

Authors

Donald Cowan, Paulo Alencar, Fred McGarry and Mark R. Palmer

Abstract

The frequency of extreme weather events has accelerated, an apparent outcome of progressive climate change. Excess water is a significant consequence of these events and is now the leading cause of insurance claims for infrastructure and property damage. Governments recognize that plans for growth must reflect communities' needs, strengths and opportunities while balancing the cumulative effects of economic growth with environmental concerns. Legislation must incorporate the cumulative effects of economic growth with adaptation to weather events to protect the environment and citizens, while ensuring that products of growth such as buildings and infrastructure are resilient. For such a process to be effective it will be necessary for the private sector to develop and operate cumulative effect decision support software (CEDSS) tools and to work closely with all levels of government including watershed management authorities (WMAs) that supply environmental data. Such co-operation and sharing will require a new Open Data information-sharing platform managed by the private sector. This paper outlines that platform, its operation and possible governance model.

Date May 21, 2015
Report Adapting to Climate Change - An Open Data Platform for Cumulative Environmental Analysis and Management (PDF)
CS-2015-07
Title

Precise Data Flow Analysis in the Presence of Correlated Method Calls

Authors

Marianna Rapoport, Ondřej Lhoták, and Frank Tip

Abstract

When two methods are invoked on the same object, the dispatch behaviours of these method calls will be correlated. If two correlated method calls are polymorphic (i.e., they dispatch to different method definitions depending on the type of the receiver object), a program’s interprocedural control flow graph will contain infeasible paths. Existing algorithms for data-flow analysis are unable to ignore such infeasible paths, giving rise to loss of precision.

We show how infeasible paths due to correlated calls can be eliminated for Interprocedural Finite Distributive Subset (IFDS) problems, a large class of data-flow analysis problems with broad applications.

Our approach is to transform an IFDS problem into an Interprocedural Distributive Environment (IDE) problem, in which edge functions filter out data flow along infeasible paths. A solution to this IDE problem can be mapped back to the solution space of the original IFDS problem. We formalize the approach, prove it correct, and report on an implementation in the WALA analysis framework.

Date June 9, 2015
Report Precise Data Flow Analysis in the Presence of Correlated Method Calls (PDF)
CS-2015-08
Title

CrossStitch: An Efficient Transaction Processing Framework for Geo-Distributed Systems

Authors

Sharon Choy, Bernard Wong, Xu Cui, and Xiaoyi Liu

Abstract

Current transaction systems for geo-distributed datastores either have high transaction processing latencies or are unable to support general transactions with dependent operations. In this paper, we introduce CrossStitch, an efficient transaction processing framework that reduces latency by restructuring each transaction into a chain of state transitions, where each state consists of a key operation and computation.

Transaction states are processed sequentially, and the transaction code and data is sent directly to the next hop in the chain. CrossStitch transactions can be organized such that all states in a location are processed before transitioning to a state in a different location. This allows CrossStitch to significantly reduce the number of inter-location crossings compared to transaction systems that retrieve remote data to a single location for processing.

To provide transactional properties while preserving the chain communication pattern, CrossStitch introduces a pipelined commit protocol that executes in parallel with the transaction and does not require any centralized co-ordination. Our evaluation results show that CrossStitch can reduce the latency of geo-distributed transactions when compared to a traditional 2PC-based distributed transaction system. We demonstrate that CrossStitch can reduce the number of round trips by more than half for TPC-C-like transactions.

Date July 28, 2015
Report CrossStitch: An Efficient Transaction Processing Framework for Geo-Distributed Systems (PDF)
CS-2015-09
Title Nessie: A Decoupled, Client-Driven, Key-Value Store using RDMA
Authors

Tyler Szepesi, Benjamin Cassell, Bernard Wong, Tim Brecht, Xiaoyi Liu

Abstract
The increasing use of key-value storage systems in
performance-critical data centre applications has moti-
vated new storage system designs that use Remote Di-
rect Memory Access (RDMA) to reduce communication
overhead. However, existing approaches that achieve
low latency and high throughput do so by dedicating en-
tire cores to RDMA message handling, including polling
local memory for incoming RDMA messages.
In this paper we describe and demonstrate why
polling-based RDMA is not suitable for many data cen-
tre applications with significant data processing and
data storage requirements. We then propose, design,
implement and evaluate an alternative communication
approach that strictly uses one-sided RDMA opera-
tions, which eliminates polling on the server-side and
is therefore a serverless (client-driven) design. This
approach is used to build Nessie, a distributed client-
driven RDMA-enabled key-value store that uses na-
tive RDMA compare-and-swap operations to co-ordinate
conflicting operations between clients. Nessie further
reduces communication requirements by decoupling the
key-value location from the key lookup mechanism,
which enables local write operations, thus avoiding com-
paratively expensive RDMA operations. Nessie’s one-
sided design ensures that only the clients in the system
consume any CPU, which in turn allows it to main-
tain high performance despite the need for computa-
tion by the application utilizing the key-value store or
contention from other applications.
Date July 22, 2015
Report Nessie: A Decoupled, Client-Driven, Key-Value Store using RDMA (PDF)
CS-2015-10
Title Mayflower: Improving Distributed Filesystem Performance
Through SDN/Filesystem Co-Design
Authors

Sajjad Rizvi, Xi Li, Bernard Wong, Xiao Cao, and Benjamin Cassell

Abstract

The increasing prevalence of oversubscribed networks and fast solid-state storage devices in the datacenter has made the network the new performance bottleneck for many distributed filesystems. As a result, distributed filesystems need to be network-aware in order to make more effective use of available network resources.

In this paper, we introduce Mayflower, a new distributed filesystem that is not only network-aware, it is co-designed from the ground up to work together with a network control plane. In addition to the standard distributed filesystem components, Mayflower has a flow monitor and manager running inside a software-defined networking controller. This tight coupling with the network controller enables Mayflower to make intelligent replica selection and flow scheduling decisions based on both filesystem and network information. It also enables Mayflower to perform global optimizations that are unavailable to conventional distributed filesystems and network control planes. Our evaluation results from both simulations and a prototype implementation show that Mayflower reduces average read completion time by more than 25% compared to current state-of-the-art distributed filesystems with an independent network flow scheduler, and more than 80% compared to HDFS with ECMP.

Date July 23, 2015
Report Mayflower: Improving Distributed Filesystem Performance Through SDN/Filesystem Co-Design (PDF)
CS-2015-11
Title Lanternfish: Better Random Networks Through Optics
Authors

Tyler Szepesi, Bernard Wond, Fiodar Kazhami, Sajjad Rizvi, and Tim Brecht

Abstract Current random datacenter network designs, such as Jellyfish, directly wire top-of-rack switches together randomly, which is difficult even when using best-practice cable management techniques. Moreover, these are static designs that cannot adapt to changing workloads. In this paper, we introduce Lanternfish, a new approach to building random datacenter networks using an optical ring that significantly reduces wiring complexity and provides the opportunity for reconfigurability. Unlike previous optical ring designs, Lanternfish does not require wavelength planning because it is specifically designed to provide random connectivity between switches. This design choice further reduces the difficulty of deploying a Lanternfish network, making random datacenter networks more practical. Our experimental results using both simulations and network emulation show that Lanternfish can effectively provide the same network properties as Jellyfish. Additionally, we demonstrate that by replacing passive optical components with active optical components at each switch, we can dynamically reconfigure the network topology to better suit the workload while remaining cost competitive. Lanternfish is able to construct workload specific topologies that provide as much as half the average pathlength than a Jellyfish deployment with twice as many switch-to-switch connections.  
Date July 23,  2015
Report Lanternfish: Better Random Networks Through Optics (PDF)
CS-2015-13
Title

A Modular Notation for Monitoring Network Systems

Authors

Prashant Raghav and Richard Tefler

Abstract

Design of next generation network systems with predictable behaviour in all situations poses a significant challenge. Monitoring of events happening at different points in a distributed environment can detect the occurrence of events that indicates significant error conditions. We use Modular Timing Diagrams (MTD) as a specification language to describe these error conditions. MTDs are a component-oriented and compositional notation. We take advantage of these features of MTDs and point out that, in many cases, a global MTD specification describing behaviours  of several system components can be efficiently decomposed into a set of sub-specifications. Each of the sub-specifications describes a local monitor that is specific to the component on which the monitor is intended to run. We illustrate the compositional nature of MTDs in describing several network monitoring conditions related to network security.

Date July 16, 2015
Report A Modular Notation for Monitoring Network Systems (PDF)
CS-2015-14
Title

Representing Behavioural Models with Rich Control Structures in SMT-LIB

Authors

Nancy A. Day and Amirhossein Vakili

Abstract

We motivate and present a proposal for how to represent extended nite state machine behavioural models with rich hierarchical states and compositional control structures (e.g., the Statecharts family) in SMT-LIB. Our goal with such a representation is to facilitate automated deductive reasoning on such models, which can exploit the structure found in the control structures. We present a novel method that combines deep and shallow encoding techniques to describe models that have both rich control structures and rich datatypes. Our representation permits varying semantics to be chosen for the control structures recognizing the rich variety of semantics that exist for the family of extended nite state machine languages. We hope that discussion of these representation issues will facilitate model sharing for investigation of analysis techniques.

Date September 1, 2015
Report

Representing Behavioural Models with Rich Control Structures in SMT-LIB (PDF)

CS-2015-15
Title A Qualitative Exploratory Study of How OSS Developers Define Code Review Quality
Authors

Oleksii Kononenko and Olga Baysal

Abstract

In a large, long-lived project,  an effective code review process is key to ensuring the long-term  quality  of the  code base.  In this  work, we study  code review practices  of a large, open source project,  and we investigate how the developers themselves  perceive code review quality.   We present a qualitative study  that summarizes  the  results  from a survey of 88 Mozilla core developers.  The results  provide developer insights into how they  define review quality,  what  factors  contribute to  how they  evaluate  submitted code,  and  what  challenges  they  face when  performing  review tasks.   We  found  that the  review quality  is primarily  associated  with  the  thoroughness of the  feedback,  the reviewer’s familiarity  with the code, and the perceived quality  of the code itself.  Also, we found that while different factors  are perceived to contribute to the review quality, reviewers often find it difficult to keep their technical skills up-to-date, manage personal priorities,  and mitigate context  switching.

Date September 11, 2015
Report A Qualitative Exploratory Study of How OSS Developers Define Code Review Quality (PDF)
CS-2015-16
Title

Feedback-Directed Instrumentation for Deployed JavaScript Applications

Authors

Magnus Madsen, Frank Tip, Esben Andreasen, Koushik Sen, Anders Moller

Abstract

Many bugs in JavaScript applications manifest themselves as objects that have incorrect property values when a failure occurs. For this type of error, stack traces and log files are often insufficient for diagnosing problems. In such cases, it is helpful for developers to know the control flow path from the creation of an object to a crashing statement. Such crash paths are useful for understanding where the object originated and whether any properties of the object were corrupted since its creation.

We present a feedback-directed instrumentation technique for computing crash paths that allows the instrumentation overhead to be distributed over a crowd of users and to re- duce it for users who do not encounter the crash. We imple- mented our technique in a tool, Crowdie, and evaluated it on 10 real-world issues for which error messages and stack traces are insufficient to isolate the problem. Our results show that feedback-directed instrumentation requires 5% to 25% of the program to be instrumented, that the same crash must be observed three to ten times to discover the crash path, and that feedback-directed instrumentation typically slows down execution by a factor 2x–9x compared to 8x–90x for an approach where applications are fully instrumented.

Date September 8, 2015
Report

Feedback-Directed Instrumentation for Deployed JavaScript Applications (PDF)

CS-2015-17
Title

NoSE:  Schema Design for NoSQL Applications

Author

Michael J. Mior, Kenneth Salem, Ashraf Aboulnaga

Abstract
Database design is critical for high performance in relational databases and many tools exist to aid application designers in selecting an appropriate schema. While the problem of schema optimization is also highly relevant for NoSQL databases, existing tools for relational databases are inadequate for this setting. Application designers wishing to use a NoSQL database instead rely on rules of thumb to select an appropriate schema. We present a system for recommending database schemas for NoSQL applications. Our cost-based approach uses a novel binary integer programming formulation to guide the mapping from the application's conceptual data model to a database schema.
We implemented a prototype of this approach for the Cassandra extensible record store. Our prototype, the NoSQL Schema Evaluator (NoSE) is able to capture rules of thumb used by expert designers without explicitly encoding the rules. Automating the design process allows NoSE to produce efficient schemas and to examine more alternatives than would be possible with a manual rule-based approach.
Date October 19, 2015
Report

NoSE:  Schema Design for NoSQL Applications (PDF)

CS-2015-18
Title

Comparison of Affect-Control Theoretic Agents and Humans in the Prisoner's Dilemma

Author

Joshua D. A. Jung, Jesse Hoey, Jonathan H. Morgan, Tobias Schröder and IngoWolf

Abstract
Symbolic interactionist principles of sociology are based on the idea that human action is guided by culturally shared symbolic representations of identities, behaviours, situations and emotions.  Shared linguistic, paralinguistic, or kinesic elements allow humans to co-ordinate action by enacting identities in social situations. Structures of identity-based interactions can lead to the enactment of social orders that solve social dilemmas (e.g., by promoting co-operation).  Our goal is to build an artificial agent that mimics the identity-based interactions of humans, and to compare networks of such agents to human networks.   In this paper, we take a first step in this direction, and describe a study in which humans played a repeated prisoner's dilemma game against other humans, or against one of three artificial agents (bots). One of the bots has an explicit representation of identity (for self and other), and attempts to optimise with respect to this representation.  We compare the human play against bots to human play against humans, and show how the identity-based bot exhibits the most human-like behaviour.
Date November 2, 2015
Report

Comparison of Affect-Control Theoretic Agents and Humans in the Prisoner's Dilemma (PDF)

CS-2015-19
Title

Big Open Data for Environmental Information
Systems

Author

Donald Cowan, Paulo Alencar, Fred McGarry, R. MarkPalmer, Trevor Boston, Rigel Rozanski

Abstract
 

The frequency of extreme weather events has accelerated, an apparent outcome of progressive climate change. Excess water is a significant consequence of these events and is now the leading cause of insurance claims for infrastructure and property damage. Governments recognize that plans for growth must reflect communities needs, strengths and opportunities while balancing the cumulative effects of economic growth with environmental concerns. For such a process to be effective it will be necessary to develop and operate cumulative effect decision support software (CEDSS) tools, and to work closely with all levels of government including watershed management authorities (WMAs) that supply environmental data. Such co-operation and sharing will require a new open and big data information-sharing platform, which is described in this paper as an open and big data platform for environmental analysis and management.

Date December 4, 2015
Report

Big Open Data for Environmental Information Systems (PDF)