Fine-grained Resource Management and Problem Detection in Dynamic Content Servers

_Principal Investigator:_
Cristiana Amza, Assistant Professor
Electrical and Computer Engineering, University of Toronto

Background: As networked computer systems grow in complexity, automatic problem detection, analysis and correction become essential system management tools. Many commercial tools for coordinated monitoring and control of large scale systems exist, however, the complexity of the displayed information for currently deployed multi-tier networked systems still exceeds the ability of humans to diagnose and respond to problems rapidly and correctly.
The traditional approach to automated problem detection is to develop a priori models of system structure and behavior, which may be represented quantitatively or as a set of event-condition-action rules. While these approaches provide a basis for system modeling, these models have several limitations: they are either costly to build, or incomplete, hence inaccurate. Building such models needs extensive knowledge about the system. Finally, these models may become obsolete as systems change or encounter unprecedented situations.

Objectives: In this project, we will design and implement groundbreaking techniques for system self-optimization and self-healing at a fine granularity of resources and application contexts. The approach is based on dynamically learned statistical system models. Our system self-optimization techniques will address both performance, and power concerns. Our self-healing techniques will address detecting, diagnosing and repairing system faults, at the fine granularity of system components and low-level application contexts.

Potential benefit to Ontario: As system and workload complexity increases, manual management and performance optimization of Internet servers is increasingly costly and time consuming. Yet, suboptimal management and tuning can cause severe resource bottlenecks or failures that may cost the cluster owner millions of dollars due to system unavailability to clients. In large cluster systems with many workloads, the probability of failures, cooling problems or bottlenecks increases. These problems are especially critical in large corporations that require continuous availability to serve an international clientele. Recent statistics show that maintenance costs exceed 75% of the budget of large companies. Thus, the proposed techniques are essential for the long term survival of large Ontario-based companies and for supporting the growth of smaller companies, as well as for reducing the costs of operation in a range of dynamic content services such as, e-commerce, on-line bidding and massively multi-player games.

_
Other Projects_

  • Automated Management of Virtual Database Appliances
  • Semantically Configurable Modelling Notations and Tools
  • Model Management for Continuously Evolving Systems
  • Modeling, Evolution, and Automated Configuration of Software Services
  • Elaborating and Evaluating UMLís 3-Layer Semantics Architecture
  • Intelligent Autonomic Computing for Computational Biology
  • Performance Management of IT Infrastructure
  • Performance-Model-Assisted Creation and Management of Service Systems
  • Topic attachments
    I Attachment Action Size Date Who Comment
    Microsoft Word filedoc Q3-4-5-8.doc manage 547.5 K 2007-05-28 - 08:30 CristianaAmza New Form.doc
    Topic revision: r3 - 2007-05-28 - CristianaAmza
     
    This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
    Ideas, requests, problems regarding TWiki? Send feedback