TY  - CONF
AB  - This paper describes a data structure and a heuristic to plan and map arbitrary resources in complex combinations while applying time dependent constraints. The approach is used in the planning based workload manager OpenCCS at the Paderborn Center for Parallel Computing (PC\(^2\)) to operate heterogeneous clusters with up to 10000 cores. We also show performance results derived from four years of operation.
AU  - Keller, Axel
ED  - Klusáček, D.
ED  - Cirne, W.
ED  - Desai, N.
ID  - 22
KW  - Scheduling Planning Mapping Workload management
SN  - 978-3-319-77398-8
T2  - Proc. Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP)
TI  - A Data Structure for Planning Based Workload Management of Heterogeneous HPC Systems
VL  - 10773
ER  - 
TY  - JOUR
AB  - Virtualization technology makes data centers more dynamic and easier to administrate. Today, cloud providers offer customers access to complex applications running on virtualized hardware. Nevertheless, big virtualized data centers become stochastic environments and the simplification on the user side leads to many challenges for the provider. He has to find cost-efficient configurations and has to deal with dynamic environments to ensure service level objectives (SLOs). We introduce a software solution that reduces the degree of human intervention to manage clouds. It is designed as a multi-agent system (MAS) and placed on top of the Infrastructure as a Service (IaaS) layer. Worker agents allocate resources, configure applications, check the feasibility of requests, and generate cost estimates. They are equipped with application specific knowledge allowing it to estimate the type and number of necessary resources. During runtime, a worker agent monitors the job and adapts its resources to ensure the specified quality of service—even in noisy clouds where the job instances are influenced by other jobs. They interact with a scheduler agent, which takes care of limited resources and does a cost-aware scheduling by assigning jobs to times with low costs. The whole architecture is self-optimizing and able to use public or private clouds. Building a private cloud needs to face the challenge to find a mapping of virtual machines (VMs) to hosts. We present a rule-based mapping algorithm for VMs. It offers an interface where policies can be defined and combined in a generic way. The algorithm performs the initial mapping at request time as well as a remapping during runtime. It deals with policy and infrastructure changes. An energy-aware scheduler and the availability of cheap resources provided by a spot market are analyzed. We evaluated our approach by building up an SaaS stack, which assigns resources in consideration of an energy function and that ensures SLOs of two different applications, a brokerage system and a high-performance computing software. Experiments were done on a real cloud system and by simulations.
AU  - Niehörster, Oliver
AU  - Simon, Jens
AU  - Brinkmann, André
AU  - Keller, Axel
AU  - Krüger, Jens
ID  - 1965
IS  - 3
JF  - Journal of Grid Computing
TI  - Cost-aware and SLO Fulfilling Software as a Service
VL  - 10
ER  - 
TY  - CONF
AB  - Infrastructure as a Service providers use virtualization to abstract their hardware and to create a dynamic data center. Virtualization enables the consolidation of virtual machines as well as the migration of them to other hosts during runtime. Each provider has its own strategy to efficiently operate a data center. We present a rule based mapping algorithm for VMs, which is able to automatically adapt the mapping between VMs and physical hosts. It offers an interface where policies can be defined and combined in a generic way. The algorithm performs the initial mapping at request time as well as a remapping during runtime. It deals with policy and infrastructure changes. We extended the open source IaaS solution Eucalyptus and we evaluated it with typical policies: maximizing the compute performance and VM locality to achieve a high performance and minimizing energy consumption. The evaluation was done on state-of-the-art servers in our own data center and by simulations using a workload of the Parallel Workload Archive. The results show that our algorithm performs well in dynamic data centers environments.
AU  - Kleineweber, Christoph
AU  - Keller, Axel
AU  - Niehörster, Oliver
AU  - Brinkmann, André
ID  - 1968
T2  - Proc. Int. Conf. on Parallel, Distributed and Network-Based Computing (PDP)
TI  - Rule Based Mapping of Virtual Machines in Clouds
ER  - 
TY  - JOUR
AB  - System virtualization has become the enabling technology to manage the increasing number of different applications inside data centers. The abstraction from the underlying hardware and the provision of multiple virtual machines (VM) on a single physical server have led to a consolidation and more efficient usage of physical servers. The abstraction from the hardware also eases the provision of applications on different data centers, as applied in several cloud computing environments. In this case, the application need not adapt to the environment of the cloud computing provider, but can travel around with its own VM image, including its own operating system and libraries. System virtualization and cloud computing could also be very attractive in the context of high‐performance computing (HPC). Today, HPC centers have to cope with both, the management of the infrastructure and also the applications. Virtualization technology would enable these centers to focus on the infrastructure, while the users, collaborating inside their virtual organizations (VOs), would be able to provide the software. Nevertheless, there seems to be a contradiction between HPC and cloud computing, as there are very few successful approaches to virtualize HPC centers. This work discusses the underlying reasons, including the management and performance, and presents solutions to overcome the contradiction, including a set of new libraries. The viability of the presented approach is shown based on evaluating a selected parallel, scientific application in a virtualized HPC environment. 
AU  - Birkenheuer, Georg
AU  - Brinkmann, André
AU  - Kaiser, Jürgen
AU  - Keller, Axel
AU  - Keller, Matthias
AU  - Kleineweber, Christoph
AU  - Konersmann, Christoph
AU  - Niehörster, Oliver
AU  - Schäfer, Thorsten
AU  - Simon, Jens
AU  - Wilhelm, Maximilan
ID  - 1971
JF  - Software: Practice and Experience
TI  - Virtualized HPC: a contradiction in terms?
ER  - 
TY  - CONF
AB  - We present a multi-agent system on top of the IaaS layer consisting of a scheduler agent and multiple worker agents. Each job is controlled by an autonomous worker agent, which is equipped with application specific knowledge (e.g., performance functions) allowing it to estimate the type and number of necessary resources. During runtime, the worker agent monitors the job and adapts its resources to ensure the specified quality of service - even in noisy clouds where the job instances are influenced by other jobs. All worker agents interact with the scheduler agent, which takes care of limited resources and does a cost-aware scheduling by assigning jobs to times with low energy costs. The whole architecture is self-optimizing and able to use public or private clouds.
AU  - Niehörster, Oliver
AU  - Keller, Axel
AU  - Brinkmann, André
ID  - 1972
T2  - Proc. Int. Meeting of the IEEE Int. Symp. on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)
TI  - An Energy-Aware SaaS Stack
ER  - 
TY  - CONF
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1974
T2  - Proc. Int. Conf. on Risks and Security of Internet and Systems
TI  - Quality Assurance of Grid Service Provisioning by Risk Aware Managing of Resource Failures
ER  - 
TY  - CONF
AB  - Service Level Agreements (SLAs) have focal importance if the commercial customer should be attracted to the Grid. An SLA-aware resource management system has already been realize, able to fulfill the SLA of jobs even in the case of resource failures. For this, it is able to migrate checkpointed jobs over the Grid. At this, virtual execution environments allow to increase the number of potential migration targets significantly. In this paper we outline the concept of such virtual execution environments and focus on the SLA negotiation aspects.
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1975
T2  - Proc. Int. DMTF Academic Alliance Workshop on Systems and Virtualization Management: Standards and New Technologies
TI  - Virtual Execution Environments and the Negotiation of Service Level Agreements in Grid Systems
ER  - 
TY  - CONF
AB  - Abstract:
Commercial Grid users demand for contractually fixed QoS levels. Service Level Agreements (SLAs) are powerful instruments for describing such contracts. SLA-aware resource management is the foundation for realizing SLA contracts within the Grid. OpenCCS is such an SLA-aware RMS, using transparent checkpointing to cope with resource outages. It generates a compatibility profile for each checkpoint dataset, so that the job can be resumed even on resources within the Grid. However, only a small number of Grid resources comply to such a profile. This paper describes the concept of virtual execution environments and how they increase the number of potential migration targets.The paper also describes how these virtual execution environments have been implemented within the OpenCCS resource management system.
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1976
T2  - Proc. Int. Workshop on Scheduling and Resource Management for Parallel and Distributed Systems
TI  - Implementation of Virtual Execution Environments for improving SLA-compliant Job Migration in Grids
ER  - 
TY  - CONF
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1978
T2  - Proc. Int. Conf. on Grid Computing and Applications (GCA)
TI  - Germany, Belgium, France, and Back Again: Job Migration using Globus
ER  - 
TY  - CONF
AB  - OpenCCS is an SLA-aware resource management system which uses transparent checkpointing of applications and migration of checkpoint datasets for ensuring SLA-compliance also in case of resource outages. Migration of checkpoints presumes a high grade of compatibility between source and target resource. Hence, even in large Grid systems only a small number of resources are eligible migration targets. This short paper describes the concept of virtual execution environments and how they increase the number of potential migration targets. It will also outline an implementation within OpenCCS.
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1980
T2  - Proc. Int. Conf. on Services Computing (SCC)
TI  - Virtual Execution Environments for ensuring SLA-compliant Job Migration in Grids
ER  - 
TY  - CONF
AB  - Contractually fixed service quality levels are mandatory prerequisites for attracting the commercial user to Grid environments. Service Level Agreements (SLAs) are powerful instruments for describing obligations and expectations in such a business relationship. At the level of local resource management systems, checkpointing and restart is an important instrument for realizing fault tolerance and SLA awareness. This paper highlights the concepts of migrating such checkpoint datasets to achieve the goal of SLA compliant job execution.
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1981
SN  - 978-0-7695-3177-9
T2  - Proc. Int. Conf. on Grid and Pervasive Computing (GPC)
TI  - Job Migration and Fault Tolerance in SLA-aware Resource Management Systems
ER  - 
TY  - CONF
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ED  - Gonzalez, T. F.
ID  - 1983
SN  - 978-0-88986-773-4
T2  - Proc. Int. Conf. on Parallel and Distributed Computing and Systems (PDCS)
TI  - Enhancing SLA Provisioning by Utilizing Profit-Oriented Fault Tolerance
ER  - 
TY  - GEN
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1984
TI  - Increasing Fault-tolerance by Introducing Virtual Execution Environments.
ER  - 
TY  - GEN
AU  - Hovestadt, Matthias
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1985
T2  - Paderborner Universitätszeitschrift (puz)
TI  - Paderborn, Belgien, Frankreich und zurück
VL  - SS 2
ER  - 
TY  - CONF
AB  - Service level agreements (SLAs) are powerful instruments for describing all obligations and expectations
in a business relationship.  It is of focal importance for deploying Grid technology to commercial applications.
The EC-funded project HPC4U (Highly Predictable Clusters for Internet Grids) aimed at introducing 
SLA-awareness in local resource management systems, while the EC-funded project AssessGrid 
introduced the notion of risk, which is associated with every business contract. 
This paper highlights the concept of planning based resource management and describes
the SLA-aware scheduler developed and used in these projects.
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1986
T2  - Proc. Workshop of the UK PLANNING AND SCHEDULING Special Interest Group (PlanSIG)
TI  - Planning-based Scheduling for SLA-awareness and Grid Integration
ER  - 
TY  - CONF
AU  - Battré, Dominic
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Voss, Kerstin
ID  - 1988
T2  - Proc. Cracow Grid Workshop, Academic Computer Center CYFRNET
TI  - Transparent Cross Border Migration of Parallel Multi Node Applications
ER  - 
TY  - CHAP
AU  - Heine, Felix
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
ED  - Jouberta, Gerhard R.
ED  - Nagel, Wolfgang E.
ED  - Peters, Frans J.
ED  - Plata, Oscar
ED  - Tirado, Francisco
ED  - Zapata, Emilio L.
ID  - 1989
T2  - Parallel Computing: Current and Future Issues of High End Computing
TI  - Provision of Fault Tolerance with Grid-enabled and SLA-aware Resource Management Systems
ER  - 
TY  - CHAP
AB  - In this paper, we describe the architecture of the virtual resource manager VRM, a management system designed to reside on top of local resource management systems for cluster computers and other kinds of resources. The most important feature of the VRM is its capability to handle quality-of-service (QoS) guarantees and service-level agreements (SLAs). The particular emphasis of the paper is on the various opportunities to deal with local autonomy for resource management systems not supporting SLAs. As local administrators may not want to hand over complete control to the Grid management, it is necessary to define strategies that deal with this issue. Local autonomy should be retained as much as possible while providing reliability and QoS guarantees for Grid applications, e.g., specified as SLAs.
AU  - Burchard, Lars-Olof
AU  - Heine, Felix
AU  - Heiss, Hans-Ulrich
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Linnert, Barry
AU  - Schneider, Jörg
ED  - Getov, Vladimir
ED  - Laforenza, Domenico
ED  - Reinefeld, Alexander
ID  - 1991
T2  - Future Generation Grids
TI  - The Virtual Resource Manager: Local Autonomy versus QoS Guarantees for Grid Applications
ER  - 
TY  - CHAP
AB  - Grid Computing promises an efficient sharing of world-wide distributed resources, ranging from hardware, software, expert knowledge to special I/O devices. However, although the main Grid mechanisms are already developed or are currently addressed by tremendous research effort, the Grid environment still suffers from a low acceptance in different user communities. Beside difficulties regarding an intuitive and comfortable resource access, various problems related to the reliability and the Quality-of-Service while using the Grid exist.

Users should be able to rely, that their jobs will have certain priority at the remote Grid site and that they will be finished upon the agreed time regardless of any provider problems. Therefore, QoS issues have to be considered in the Grid middleware but also in the local resource management systems at the Grid sites. However, most of the currently used resource management systems are not suitable for SLAs, as they do not support resource reservation and do not offer mechanisms for job checkpointing/migration respectively. The latter are mandatory for Grid providers as rescue anchor in case of system failures or system overload.

This paper focuses on SLA-aware job migration and presents a work, which is being performed in the EU supported project HPC4U.
AU  - Heine, Felix
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
ED  - Grandinetti, Lucio
ID  - 1990
T2  - Grid Computing: New Frontiers of High Performance Computing
TI  - SLA-aware Job Migration in Grid Environments
VL  - 14
ER  - 
TY  - CONF
AB  - The next generation grid applications demand grid middleware for a flexible negotiation mechanism supporting various ways of quality-of-service (QoS) guarantees. In this context, a QoS guarantee covers simultaneous allocations of various kinds of different resources, such as processor runtime, storage capacity, or network bandwidth, which are specified in the form of service level agreements (SLA). Currently, a gap exists between the capabilities of grid middleware and the underlying resource management systems concerning their support for QoS and SLA negotiation. In this paper we present an approach which closes this gap. Introducing the architecture of the virtual resource manager, we highlight its main QoS management features like run-time responsibility, co-allocation, and fault tolerance.
AU  - Burchard, Lars-Olof
AU  - Heine, Felix
AU  - Hovestadt, Matthias
AU  - Kao, Odej
AU  - Keller, Axel
AU  - Linnert, Barry
ID  - 1992
T2  - Proc. IEEE Int. Parallel & Distributed Processing Symposium (IPDPS)
TI  - A Quality-of-Service Architecture for Future Grid Computing Applications.
ER  -