Fine-grained Resource Management and Problem Detection in Dynamic Content Servers
_Principal Investigator:_
Cristiana Amza, Assistant Professor
Electrical and Computer Engineering, University of Toronto
Background: As networked computer systems grow in complexity, automatic problem detection, analysis and correction become essential system management tools. Many commercial tools for coordinated monitoring and control of large scale systems exist, however, the complexity of the displayed information for currently deployed multi-tier networked systems still exceeds the ability of humans to diagnose and respond to problems rapidly and correctly.
The traditional approach to automated problem detection is to develop a priori models of system structure and behavior, which may be represented quantitatively or as a set of event-condition-action rules. While these approaches provide a basis for system modeling, these models have several limitations: they are either costly to build, or incomplete, hence inaccurate. Building such models needs extensive knowledge about the system. Finally, these models may become obsolete as systems change or encounter unprecedented situations.
Objectives: In this project, we will design and implement groundbreaking techniques for system self-optimization and self-healing at a fine granularity of resources and application contexts. The approach is based on dynamically learned statistical system models. Our system self-optimization techniques will address both performance, and power concerns. Our self-healing techniques will address detecting, diagnosing and repairing system faults, at the fine granularity of system components and low-level application contexts.
Potential benefit to Ontario: As system and workload complexity increases, manual management and performance optimization of Internet servers is increasingly costly and time consuming. Yet, suboptimal management and tuning can cause severe resource bottlenecks or failures that may cost the cluster owner millions of dollars due to system unavailability to clients. In large cluster systems with many workloads, the probability of failures, cooling problems or bottlenecks increases. These problems are especially critical in large corporations that require continuous availability to serve an international clientele. Recent statistics show that maintenance costs exceed 75% of the budget of large companies. Thus, the proposed techniques are essential for the long term survival of large Ontario-based companies and for supporting the growth of smaller companies, as well as for reducing the costs of operation in a range of dynamic content services such as, e-commerce, on-line bidding and massively multi-player games.
_
Other Projects_
Automated Management of Virtual Database Appliances
Semantically Configurable Modelling Notations and Tools
Model Management for Continuously Evolving Systems
Modeling, Evolution, and Automated Configuration of Software Services
Elaborating and Evaluating UML’s 3-Layer Semantics Architecture
Intelligent Autonomic Computing for Computational Biology
Performance Management of IT Infrastructure
Performance-Model-Assisted Creation and Management of Service Systems