TY - JOUR AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Kenter, Tobias AU - Plessl, Christian ID - 7689 IS - 2 JF - ACM Trans. Archit. Code Optim. (TACO) KW - htrop TI - Transparent Acceleration for Heterogeneous Platforms with Compilation to OpenCL VL - 16 ER - TY - THES AU - Vaz, Gavin Francis ID - 14849 TI - Using Just-in-Time Code Generation to Transparently Accelerate Applications in Heterogeneous Systems ER - TY - CONF AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Kenter, Tobias AU - Plessl, Christian ID - 1204 KW - htrop SN - 9781450349826 T2 - Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) TI - Automated Code Acceleration Targeting Heterogeneous OpenCL Devices ER - TY - CONF AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Plessl, Christian AU - Trainiti, Ettore M. G. AU - Durelli, Gianluca C. AU - Bolchini, Cristiana ID - 31 T2 - Proc. HiPEAC Workshop on Reonfigurable Computing (WRC) TI - Using Just-in-Time Code Generation for Transparent Resource Management in Heterogeneous Systems ER - TY - CONF AB - Hardware accelerators are becoming popular in academia and industry. To move one step further from the state-of-the-art multicore plus accelerator approaches, we present in this paper our innovative SAVEHSA architecture. It comprises of a heterogeneous hardware platform with three different high-end accelerators attached over PCIe (GPGPU, FPGA and Intel MIC). Such systems can process parallel workloads very efficiently whilst being more energy efficient than regular CPU systems. To leverage the heterogeneity, the workload has to be distributed among the computing units in a way that each unit is well-suited for the assigned task and executable code must be available. To tackle this problem we present two software components; the first can perform resource allocation at runtime while respecting system and application goals (in terms of throughput, energy, latency, etc.) and the second is able to analyze an application and generate executable code for an accelerator at runtime. We demonstrate the first proof-of-concept implementation of our framework on the heterogeneous platform, discuss different runtime policies and measure the introduced overheads. AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Plessl, Christian AU - Trainiti, Ettore M. G. AU - Durelli, Gianluca C. AU - Del Sozzo, Emanuele AU - Santambrogio, Marco D. AU - Bolchini, Christina ID - 138 T2 - Proceedings of International Forum on Research and Technologies for Society and Industry (RTSI) TI - Using Just-in-Time Code Generation for Transparent Resource Management in Heterogeneous Systems ER - TY - JOUR AB - A broad spectrum of applications can be accelerated by offloading computation intensive parts to reconfigurable hardware. However, to achieve speedups, the number of loop it- erations (trip count) needs to be sufficiently large to amortize offloading overheads. Trip counts are frequently not known at compile time, but only at runtime just before entering a loop. Therefore, we propose to generate code for both the CPU and the coprocessor, and defer the offloading decision to the application runtime. We demonstrate how a toolflow, based on the LLVM compiler framework, can automatically embed dynamic offloading de- cisions into the application code. We perform in-depth static and dynamic analysis of pop- ular benchmarks, which confirm the general potential of such an approach. We also pro- pose to optimize the offloading process by decoupling the runtime decision from the loop execution (decision slack). The feasibility of our approach is demonstrated by a toolflow that automatically identifies suitable data-parallel loops and generates code for the FPGA coprocessor of a Convey HC-1. We evaluate the integrated toolflow with representative loops executed for different input data sizes. AU - Vaz, Gavin Francis AU - Riebler, Heinrich AU - Kenter, Tobias AU - Plessl, Christian ID - 165 JF - Computers and Electrical Engineering SN - 0045-7906 TI - Potential and Methods for Embedding Dynamic Offloading Decisions into Application Code VL - 55 ER - TY - CONF AU - Kenter, Tobias AU - Vaz, Gavin Francis AU - Riebler, Heinrich AU - Plessl, Christian ID - 171 T2 - Workshop on Reconfigurable Computing (WRC) TI - Opportunities for deferring application partitioning and accelerator synthesis to runtime (extended abstract) ER - TY - CONF AB - In this paper, we study how binary applications can be transparently accelerated with novel heterogeneous computing resources without requiring any manual porting or developer-provided hints. Our work is based on Binary Acceleration At Runtime (BAAR), our previously introduced binary acceleration mechanism that uses the LLVM Compiler Infrastructure. BAAR is designed as a client-server architecture. The client runs the program to be accelerated in an environment, which allows program analysis and profiling and identifies and extracts suitable program parts to be offloaded. The server compiles and optimizes these offloaded program parts for the accelerator and offers access to these functions to the client with a remote procedure call (RPC) interface. Our previous work proved the feasibility of our approach, but also showed that communication time and overheads limit the granularity of functions that can be meaningfully offloaded. In this work, we motivate the importance of a lightweight, high-performance communication between server and client and present a communication mechanism based on the Message Passing Interface (MPI). We evaluate our approach by using an Intel Xeon Phi 5110P as the acceleration target and show that the communication overhead can be reduced from 40% to 10%, thus enabling even small hotspots to benefit from offloading to an accelerator. AU - Damschen, Marvin AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Plessl, Christian ID - 238 T2 - Proceedings of the 2015 Conference on Design, Automation and Test in Europe (DATE) TI - Transparent offloading of computational hotspots from binary code to Xeon Phi ER - TY - CONF AB - In order to leverage the use of reconfigurable architectures in general-purpose computing, quick and automated methods to find suitable accelerator designs are required. We tackle this challenge in both regards. In order to avoid long synthesis times, we target a vector copro- cessor, implemented on the FPGAs of a Convey HC-1. Previous studies showed that existing tools were not able to accelerate a real-world application with low effort. We present a toolflow to automatically identify suitable loops for vectorization, generate a corresponding hardware/software bipartition, and generate coprocessor code. Where applicable, we leverage outer-loop vectorization. We evaluate our tools with a set of characteristic loops, systematically analyzing different dependency and data layout properties. AU - Kenter, Tobias AU - Vaz, Gavin Francis AU - Plessl, Christian ID - 388 T2 - Proceedings of the International Symposium on Reconfigurable Computing: Architectures, Tools, and Applications (ARC) TI - Partitioning and Vectorizing Binary Applications for a Reconfigurable Vector Computer VL - 8405 ER - TY - CONF AU - C. Durelli, Gianluca AU - Pogliani, Marcello AU - Miele, Antonio AU - Plessl, Christian AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - D. Santambrogio, Marco AU - Bolchini, Cristiana ID - 1778 T2 - Proc. Int. Symp. on Parallel and Distributed Processing with Applications (ISPA) TI - Runtime Resource Management in Heterogeneous System Architectures: The SAVE Approach ER - TY - CONF AB - Reconfigurable architectures provide an opportunityto accelerate a wide range of applications, frequentlyby exploiting data-parallelism, where the same operations arehomogeneously executed on a (large) set of data. However, whenthe sequential code is executed on a host CPU and only dataparallelloops are executed on an FPGA coprocessor, a sufficientlylarge number of loop iterations (trip counts) is required, such thatthe control- and data-transfer overheads to the coprocessor canbe amortized. However, the trip count of large data-parallel loopsis frequently not known at compile time, but only at runtime justbefore entering a loop. Therefore, we propose to generate codeboth for the CPU and the coprocessor, and to defer the decisionwhere to execute the appropriate code to the runtime of theapplication when the trip count of the loop can be determinedjust at runtime. We demonstrate how an LLVM compiler basedtoolflow can automatically insert appropriate decision blocks intothe application code. Analyzing popular benchmark suites, weshow that this kind of runtime decisions is often applicable. Thepractical feasibility of our approach is demonstrated by a toolflowthat automatically identifies loops suitable for vectorization andgenerates code for the FPGA coprocessor of a Convey HC-1. Thetoolflow adds decisions based on a comparison of the runtimecomputedtrip counts to thresholds for specific loops and alsoincludes support to move just the required data to the coprocessor.We evaluate the integrated toolflow with characteristic loopsexecuted on different input data sizes. AU - Vaz, Gavin Francis AU - Riebler, Heinrich AU - Kenter, Tobias AU - Plessl, Christian ID - 439 T2 - Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig) TI - Deferring Accelerator Offloading Decisions to Application Runtime ER - TY - CONF AU - Rammig, Franz-Josef AU - Stahl, Katharina AU - Vaz, Gavin Francis ID - 25292 T2 - Proc. 4th IEEE Workshop on Self-Organizing Real-Time Systems (SORT) 2013 TI - A Framework for Enhancing Dependability in Self-x Systems by Artificial Immune Systems ER - TY - CONF AU - Rammig, Franz AU - Stahl, Katharina AU - Vaz, Gavin Francis ID - 1785 T2 - IEEE Int. Symp. on Object/component/service-oriented Real-time distributed Computing (ISORC) TI - A framework for enhancing dependability in self-x systems by Artificial Immune Systems ER -