Libra — An Economy-Driven Cluster Schedulng and Service Level Agreements (SLA)-based Resource Allocation System

Problem Statement

Clustering involves connecting two or more computers together to take advantage of combined computational power and resources. Hence, a cluster works as an integrated collection of resources that can provide a single system image spanning all its nodes. Clustering is a popular strategy for processing applications because it transparently spreads the processing of different jobs throughout the cluster, and are used for high-performance applications such as AI expert systems, flight simulations, and scientific calculations.

Computational economy refers to the inclusion of user-specified QoS parameters with jobs so that resource management is based on a user-centric approach rather than on a system-centric approach. This essentially means that user constraints such as deadline and budget are more important in determining the priority of a job by the scheduler, than system policies like ordering jobs according to the basis of submission time. Currently, there is no holistic scheduling mechanism in cluster computing to enable differing QoS levels for different clients.

Objectives

The main purpose of our project is to:

Develop a QoS-based scheduler for resource management on a homogenous cluster
Optimize the scheduler according to time or cost considerations of the user, for sequential and embarrassingly parallel jobs
Test the scheduler through simulations of various types of job queues and user criteria

Scope

The Libra Scheduler will only manage sequential and embarrassingly parallel jobs to be run on a homogenous Linux cluster. Linux is an open source operating system, with extensive documentation and user support, and moreover, many open source CMS are well suited for a Linux-based cluster.

To provide QoS to users, there will be no mechanism for users to interact with each other, and bargain on the use of resources according to their considerations, as is provided in a grid-computing environment by projects like Nimrod/G. Once the user job is submitted, the user may not modify the job details. However, if possible, we may allow for interactive jobs that can take in user commands required during the execution of a job.

Project Description

The focus of our project is to implement a scheduler that aims to maximize user satisfaction. Thus the job details submitted by the user will include job prioritization criteria: the allocated budget and the deadline required by the user, enabling the scheduler to maximize CPU utilization while remaining within the constraints imposed by the need to optimize user Quality of Service (QoS).
The scheduler will allocate jobs based on the job parameters, which are job specifications submitted by the user with the job, including:

Location of the executable and input data sets
Where standard output is to be placed
System type
Maximum length of run
Whether the job needs sequential or parallel resources

However, our scheduler will be QoS driven: it will aim to optimize resource utilization within user-imposed constraints: thus, user satisfaction is the primary concern, as opposed to maximizing CPU utilization. Thus, the two job parameters most relevant to the scheduling decisions will be:

Budget allocated by the user to the process
Deadline

As mentioned earlier, the type of jobs that will be supported are sequential and embarrassingly parallel jobs.

With support from the CMS, the Libra Scheduler should embody the following features:

Should be able to enforce resource allocations according to user-centric priorities
Should be dynamic, and not static, which is a necessary implication of the user-centric approach, so that users who need their jobs completed in emergency and are willing to pay a high price for it, are able to get their job done through dynamic reallocation of resources even if the job is submitted later than other jobs or the system is heavily loaded. Hence, the scheduler should be able to change resource limits, priorities, privileges and execution order of the submitted jobs.
Should be scalable, which means that its performance should not degrade with the addition of nodes and jobs to our cluster
Should be configurable, and allow for various scheduling policies that can be modified to incorporate QoS parameters
Should be separable from the CMS
Should provide administrative security
Should provide job accounting, to aid in scheduling policies
Should ideally provide a GUI for all components, such as for users to submit jobs and for administrators to oversee scheduling
Should ideally provide for check pointing, load balancing, process migration and job runtime limits, which provide for better resource management, fault tolerance and reliability

A market-based economic model for computational economy needs to be developed for our cluster, which would be responsible for the pricing and allocation of resources according to user constraints. The model that we are going to implement is the bid-based proportional resource-sharing model, possibly incorporating features of other models such as the commodity market model.

The Team Members

Active Members

Rajkumar Buyya - Project owner & manager
Chee Shin Yeo (csyeo [AT] cs.mu.OZ.AU) - from 2002 onwards.

Alumni

Jahanzeb Sherwani (jahanzeb@lums.edu.pk)
Nosheen Ali (nosheen@lums.edu.pk)
Nausheen Lotia (02020111@lums.edu.pk)
Zahra Hayat (02020189@lums.edu.pk)

Publications

Jahanzeb Sherwani, Nosheen Ali, Nausheen Lotia, Zahra Hayat, and Rajkumar Buyya, Libra: An Economy driven Job Scheduling System for Clusters, Proceedings of the 6th International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2002), December 16-19, 2002, Bangalore, India.
(Talk - PPT/PDF)
Jahanzeb Sherwani, Nosheen Ali, Nausheen Lotia, Zahra Hayat, and Rajkumar Buyya, Libra: A Computational Economy-based Job Scheduling System for Clusters, Software: Practice and Experience, Volume 34, Issue 6, Pages 573-590, May 2004.
Chee Shin Yeo and Rajkumar Buyya, Pricing for Utility-driven Resource Management and Allocation in Clusters, Proceedings of the 12th International Conference on Advanced Computing and Communication (ADCOM 2004), December 2004, Ahmedabad, India.
(Talk - PPT/PDF)
Chee Shin Yeo and Rajkumar Buyya, Pricing for Utility-driven Resource Management and Allocation in Clusters, International Journal of High Performance Computing Applications, Volume 21, Issue 4, Pages 405-418, November 2007. (extended version of ADCOM 2004 paper)
Chee Shin Yeo and Rajkumar Buyya, Service Level Agreement based Allocation of Cluster Resources: Handling Penalty to Enhance Utility, Proceedings of the 7th IEEE International Conference on Cluster Computing (Cluster 2005), September 2005, Boston, MA.
(Talk - PPT/PDF)
Chee Shin Yeo and Rajkumar Buyya, Managing Risk of Inaccurate Runtime Estimates for Deadline Constrained Job Admission Control in Clusters, Proceedings of the 35th International Conference on Parallel Processing (ICPP 2006), August 2006, Columbus, OH.
(Talk - PPT/PDF)
Chee Shin Yeo and Rajkumar Buyya, A taxonomy of market-based resource management systems for utility-driven cluster computing, Software: Practice and Experience, Volume 36, Issue 13, Pages 1381-1419, 10 November 2006.
Chee Shin Yeo and Rajkumar Buyya, Integrated Risk Analysis for a Commercial Computing Service, Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), March 2007, Long Beach, CA.
(Talk - PPT/PDF)
Chee Shin Yeo and Rajkumar Buyya, Integrated Risk Analysis for a Commercial Computing Service in Utility Computing, Journal of Grid Computing, Volume 7, Issue 1, Pages 1-24, March 2009. (extended version of IPDPS 2007 paper)