TY - JOUR AB - The rise of exascale supercomputers has fueled competition among GPU vendors, driving lattice QCD developers to write code that supports multiple APIs. Moreover, new developments in algorithms and physics research require frequent updates to existing software. These challenges have to be balanced against constantly changing personnel. At the same time, there is a wide range of applications for HISQ fermions in QCD studies. This situation encourages the development of software featuring a HISQ action that is flexible, high-performing, open source, easy to use, and easy to adapt. In this technical paper, we explain the design strategy, provide implementation details, list available algorithms and modules, and show key performance indicators for SIMULATeQCD, a simple multi-GPU lattice code for large-scale QCD calculations, mainly developed and used by the HotQCD collaboration. The code is publicly available on GitHub. AU - Mazur, Lukas AU - Bollweg, Dennis AU - Clarke, David A. AU - Altenkort, Luis AU - Kaczmarek, Olaf AU - Larsen, Rasmus AU - Shu, Hai-Tao AU - Goswami, Jishnu AU - Scior, Philipp AU - Sandmeyer, Hauke AU - Neumann, Marius AU - Dick, Henrik AU - Ali, Sajid AU - Kim, Jangho AU - Schmidt, Christian AU - Petreczky, Peter AU - Mukherjee, Swagato ID - 46120 JF - Computer Physics Communications TI - SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations ER - TY - JOUR AU - Altenkort, Luis AU - Eller, Alexander M. AU - Francis, Anthony AU - Kaczmarek, Olaf AU - Mazur, Lukas AU - Moore, Guy D. AU - Shu, Hai-Tao ID - 46119 IS - 1 JF - Physical Review D SN - 2470-0010 TI - Viscosity of pure-glue QCD from the lattice VL - 108 ER - TY - JOUR AB - While FPGA accelerator boards and their respective high-level design tools are maturing, there is still a lack of multi-FPGA applications, libraries, and not least, benchmarks and reference implementations towards sustained HPC usage of these devices. As in the early days of GPUs in HPC, for workloads that can reasonably be decoupled into loosely coupled working sets, multi-accelerator support can be achieved by using standard communication interfaces like MPI on the host side. However, for performance and productivity, some applications can profit from a tighter coupling of the accelerators. FPGAs offer unique opportunities here when extending the dataflow characteristics to their communication interfaces. In this work, we extend the HPCC FPGA benchmark suite by multi-FPGA support and three missing benchmarks that particularly characterize or stress inter-device communication: b_eff, PTRANS, and LINPACK. With all benchmarks implemented for current boards with Intel and Xilinx FPGAs, we established a baseline for multi-FPGA performance. Additionally, for the communication-centric benchmarks, we explored the potential of direct FPGA-to-FPGA communication with a circuit-switched inter-FPGA network that is currently only available for one of the boards. The evaluation with parallel execution on up to 26 FPGA boards makes use of one of the largest academic FPGA installations. AU - Meyer, Marius AU - Kenter, Tobias AU - Plessl, Christian ID - 38041 JF - ACM Transactions on Reconfigurable Technology and Systems KW - General Computer Science SN - 1936-7406 TI - Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks ER - TY - CHAP AU - Hansmeier, Tim AU - Kenter, Tobias AU - Meyer, Marius AU - Riebler, Heinrich AU - Platzner, Marco AU - Plessl, Christian ED - Haake, Claus-Jochen ED - Meyer auf der Heide, Friedhelm ED - Platzner, Marco ED - Wachsmuth, Henning ED - Wehrheim, Heike ID - 45893 T2 - On-The-Fly Computing -- Individualized IT-services in dynamic markets TI - Compute Centers I: Heterogeneous Execution Environments VL - 412 ER - TY - CONF AU - Opdenhövel, Jan-Oliver AU - Plessl, Christian AU - Kenter, Tobias ID - 46190 T2 - Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies TI - Mutation Tree Reconstruction of Tumor Cells on FPGAs Using a Bit-Level Matrix Representation ER - TY - CONF AU - Faj, Jennifer AU - Kenter, Tobias AU - Faghih-Naini, Sara AU - Plessl, Christian AU - Aizinger, Vadym ID - 46188 T2 - Proceedings of the Platform for Advanced Scientific Computing Conference TI - Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured Meshes ER - TY - CONF AU - Prouveur, Charles AU - Haefele, Matthieu AU - Kenter, Tobias AU - Voss, Nils ID - 46189 T2 - Proceedings of the Platform for Advanced Scientific Computing Conference TI - FPGA Acceleration for HPC Supercapacitor Simulations ER - TY - CONF AB - The computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic simulations. In practical simulations, several trillions of ERIs may have to be computed for every time step. In this work, we investigate FPGAs as accelerators for the ERI computation. We use template parameters, here within the Intel oneAPI tool flow, to create customized designs for 256 different ERI quartet classes, based on their orbitals. To maximize data reuse, all intermediates are buffered in FPGA on-chip memory with customized layout. The pre-calculation of intermediates also helps to overcome data dependencies caused by multi-dimensional recurrence relations. The involved loop structures are partially or even fully unrolled for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our best knowledge, this is the first work on ERI computation on FPGAs that supports more than just the single most basic quartet class. Also, the integration of ERI computation and compression it a novelty that is not even covered by CPU or GPU libraries so far. Our evaluation shows that using 16-bit integer for the ERI compression, the fastest FPGA kernels exceed the performance of 10 GERIS ($10 \times 10^9$ ERIs per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors around $10^{-7}$ - $10^{-5}$ Hartree. The measured throughput can be accurately explained by a performance model. The FPGA kernels deployed on 2 FPGAs outperform similar computations using the widely used libint reference on a two-socket server with 40 Xeon Gold 6148 CPU cores of the same process technology by factors up to 6.0x and on a new two-socket server with 128 EPYC 7713 CPU cores by up to 1.9x. AU - Wu, Xin AU - Kenter, Tobias AU - Schade, Robert AU - Kühne, Thomas AU - Plessl, Christian ID - 43228 T2 - 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) TI - Computing and Compressing Electron Repulsion Integrals on FPGAs ER - TY - JOUR AB - The non-orthogonal local submatrix method applied to electronic structure–based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32-mixed floating-point arithmetic when using 4400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms. AU - Schade, Robert AU - Kenter, Tobias AU - Elgabarty, Hossam AU - Lass, Michael AU - Kühne, Thomas AU - Plessl, Christian ID - 45361 JF - The International Journal of High Performance Computing Applications KW - Hardware and Architecture KW - Theoretical Computer Science KW - Software SN - 1094-3420 TI - Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics ER - TY - GEN AB - Viscous hydrodynamics serves as a successful mesoscopic description of the Quark-Gluon Plasma produced in relativistic heavy-ion collisions. In order to investigate, how such an effective description emerges from the underlying microscopic dynamics we calculate the hydrodynamic and non-hydrodynamic modes of linear response in the sound channel from a first-principle calculation in kinetic theory. We do this with a new approach wherein we discretize the collision kernel to directly calculate eigenvalues and eigenmodes of the evolution operator. This allows us to study the Green's functions at any point in the complex frequency space. Our study focuses on scalar theory with quartic interaction and we find that the analytic structure of Green's functions in the complex plane is far more complicated than just poles or cuts which is a first step towards an equivalent study in QCD kinetic theory. AU - Ochsenfeld, Stephan AU - Schlichting, Sören ID - 50172 T2 - arXiv:2308.04491 TI - Hydrodynamic and Non-hydrodynamic Excitations in Kinetic Theory -- A Numerical Analysis in Scalar Field Theory ER - TY - GEN AB - Memory Gym presents a suite of 2D partially observable environments, namely Mortar Mayhem, Mystery Path, and Searing Spotlights, designed to benchmark memory capabilities in decision-making agents. These environments, originally with finite tasks, are expanded into innovative, endless formats, mirroring the escalating challenges of cumulative memory games such as ``I packed my bag''. This progression in task design shifts the focus from merely assessing sample efficiency to also probing the levels of memory effectiveness in dynamic, prolonged scenarios. To address the gap in available memory-based Deep Reinforcement Learning baselines, we introduce an implementation that integrates Transformer-XL (TrXL) with Proximal Policy Optimization. This approach utilizes TrXL as a form of episodic memory, employing a sliding window technique. Our comparative study between the Gated Recurrent Unit (GRU) and TrXL reveals varied performances across different settings. TrXL, on the finite environments, demonstrates superior sample efficiency in Mystery Path and outperforms in Mortar Mayhem. However, GRU is more efficient on Searing Spotlights. Most notably, in all endless tasks, GRU makes a remarkable resurgence, consistently outperforming TrXL by significant margins. Website and Source Code: https://github.com/MarcoMeter/endless-memory-gym/ AU - Pleines, Marco AU - Pallasch, Matthias AU - Zimmer, Frank AU - Preuss, Mike ID - 50221 T2 - arXiv:2309.17207 TI - Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents ER - TY - CHAP AU - Alt, Christoph AU - Kenter, Tobias AU - Faghih-Naini, Sara AU - Faj, Jennifer AU - Opdenhövel, Jan-Oliver AU - Plessl, Christian AU - Aizinger, Vadym AU - Hönig, Jan AU - Köstler, Harald ID - 46191 SN - 0302-9743 T2 - Lecture Notes in Computer Science TI - Shallow Water DG Simulations on FPGAs: Design and Comparison of a Novel Code Generation Pipeline ER - TY - GEN AB - This preprint makes the claim of having computed the $9^{th}$ Dedekind Number. This was done by building an efficient FPGA Accelerator for the core operation of the process, and parallelizing it on the Noctua 2 Supercluster at Paderborn University. The resulting value is 286386577668298411128469151667598498812366. This value can be verified in two steps. We have made the data file containing the 490M results available, each of which can be verified separately on CPU, and the whole file sums to our proposed value. AU - Van Hirtum, Lennart AU - De Causmaecker, Patrick AU - Goemaere, Jens AU - Kenter, Tobias AU - Riebler, Heinrich AU - Lass, Michael AU - Plessl, Christian ID - 43439 T2 - arXiv:2304.03039 TI - A computation of D(9) using FPGA Supercomputing ER - TY - GEN AB - We investigate the early time development of the anisotropic transverse flow and spatial eccentricities of a fireball with various particle-based transport approaches using a fixed initial condition. In numerical simulations ranging from the quasi-collisionless case to the hydrodynamic regime, we find that the onset of $v_n$ and of related measures of anisotropic flow can be described with a simple power-law ansatz, with an exponent that depends on the amount of rescatterings in the system. In the few-rescatterings regime we perform semi-analytical calculations, based on a systematic expansion in powers of time and the cross section, which can reproduce the numerical findings. AU - Borghini, Nicolas AU - Borrell, Marc AU - Roch, Hendrik ID - 32177 T2 - arXiv:2201.13294 TI - Early time behavior of spatial and momentum anisotropies in kinetic theory across different Knudsen numbers ER - TY - GEN AB - We test the ability of the "escape mechanism" to create the anisotropic flow observed in high-energy nuclear collisions. We compare the flow harmonics $v_n$ in the few-rescatterings regime from two types of transport simulations, with $2\to 2$ and $2\to 0$ collision kernels respectively, and from analytical calculations neglecting the gain term of the Boltzmann equation. We find that the even flow harmonics are similar in the three approaches, while the odd harmonics differ significantly. AU - Bachmann, Benedikt AU - Borghini, Nicolas AU - Feld, Nina AU - Roch, Hendrik ID - 32178 T2 - arXiv:2203.13306 TI - Even anisotropic-flow harmonics are from Venus, odd ones are from Mars ER - TY - JOUR AU - Hou, W AU - Yao, Y AU - Li, Y AU - Peng, B AU - Shi, K AU - Zhou, Z AU - Pan, J AU - Liu, M AU - Hu, J ID - 32183 IS - 1 JF - Frontiers of materials science SN - 2095-025x TI - Linearly shifting ferromagnetic resonance response of La0.7Sr0.3MnO3 thin film for body temperature sensors VL - 16 ER - TY - JOUR AU - Wojciechowski, M ID - 32234 JF - Data Brief SN - 2352-3409 TI - Dataset for random uniform distributions of 2D circles and 3D spheres. VL - 43 ER - TY - THES AU - Lass, Michael ID - 32414 TI - Bringing Massive Parallelism and Hardware Acceleration to Linear Scaling Density Functional Theory Through Targeted Approximations ER - TY - GEN AB - The Julia programming language has evolved into a modern alternative to fill existing gaps in scientific computing and data science applications. Julia leverages a unified and coordinated single-language and ecosystem paradigm and has a proven track record of achieving high performance without sacrificing user productivity. These aspects make Julia a viable alternative to high-performance computing's (HPC's) existing and increasingly costly many-body workflow composition strategy in which traditional HPC languages (e.g., Fortran, C, C++) are used for simulations, and higher-level languages (e.g., Python, R, MATLAB) are used for data analysis and interactive computing. Julia's rapid growth in language capabilities, package ecosystem, and community make it a promising universal language for HPC. This paper presents the views of a multidisciplinary group of researchers from academia, government, and industry that advocate for an HPC software development paradigm that emphasizes developer productivity, workflow portability, and low barriers for entry. We believe that the Julia programming language, its ecosystem, and its community provide modern and powerful capabilities that enable this group's objectives. Crucially, we believe that Julia can provide a feasible and less costly approach to programming scientific applications and workflows that target HPC facilities. In this work, we examine the current practice and role of Julia as a common, end-to-end programming model to address major challenges in scientific reproducibility, data-driven AI/machine learning, co-design and workflows, scalability and performance portability in heterogeneous computing, network communication, data management, and community education. As a result, the diversification of current investments to fulfill the needs of the upcoming decade is crucial as more supercomputing centers prepare for the exascale era. AU - Churavy, Valentin AU - Godoy, William F AU - Bauer, Carsten AU - Ranocha, Hendrik AU - Schlottke-Lakemper, Michael AU - Räss, Ludovic AU - Blaschke, Johannes AU - Giordano, Mosè AU - Schnetter, Erik AU - Omlin, Samuel AU - Vetter, Jeffrey S AU - Edelman, Alan ID - 36879 TI - Bridging HPC Communities through the Julia Programming Language ER - TY - JOUR AB - AbstractTailored nanoscale quantum light sources, matching the specific needs of use cases, are crucial building blocks for photonic quantum technologies. Several different approaches to realize solid-state quantum emitters with high performance have been pursued and different concepts for energy tuning have been established. However, the properties of the emitted photons are always defined by the individual quantum emitter and can therefore not be controlled with full flexibility. Here we introduce an all-optical nonlinear method to tailor and control the single photon emission. We demonstrate a laser-controlled down-conversion process from an excited state of a semiconductor quantum three-level system. Based on this concept, we realize energy tuning and polarization control of the single photon emission with a control-laser field. Our results mark an important step towards tailored single photon emission from a photonic quantum system based on quantum optical principles. AU - Jonas, B. AU - Heinze, Dirk Florian AU - Schöll, E. AU - Kallert, P. AU - Langer, T. AU - Krehs, S. AU - Widhalm, A. AU - Jöns, Klaus AU - Reuter, Dirk AU - Schumacher, Stefan AU - Zrenner, Artur ID - 40523 IS - 1 JF - Nature Communications KW - General Physics and Astronomy KW - General Biochemistry KW - Genetics and Molecular Biology KW - General Chemistry KW - Multidisciplinary SN - 2041-1723 TI - Nonlinear down-conversion in a single quantum dot VL - 13 ER - TY - JOUR AU - Altenkort, Luis AU - Eller, Alexander M. AU - Kaczmarek, O. AU - Mazur, Lukas AU - Moore, Guy D. AU - Shu, Hai-Tao ID - 46121 IS - 9 JF - Physical Review D SN - 2470-0010 TI - Lattice QCD noise reduction for bosonic correlators through blocking VL - 105 ER - TY - GEN AB - Electronic structure calculations have been instrumental in providing many important insights into a range of physical and chemical properties of various molecular and solid-state systems. Their importance to various fields, including materials science, chemical sciences, computational chemistry and device physics, is underscored by the large fraction of available public supercomputing resources devoted to these calculations. As we enter the exascale era, exciting new opportunities to increase simulation numbers, sizes, and accuracies present themselves. In order to realize these promises, the community of electronic structure software developers will however first have to tackle a number of challenges pertaining to the efficient use of new architectures that will rely heavily on massive parallelism and hardware accelerators. This roadmap provides a broad overview of the state-of-the-art in electronic structure calculations and of the various new directions being pursued by the community. It covers 14 electronic structure codes, presenting their current status, their development priorities over the next five years, and their plans towards tackling the challenges and leveraging the opportunities presented by the advent of exascale computing. AU - Gavini, Vikram AU - Baroni, Stefano AU - Blum, Volker AU - Bowler, David R. AU - Buccheri, Alexander AU - Chelikowsky, James R. AU - Das, Sambit AU - Dawson, William AU - Delugas, Pietro AU - Dogan, Mehmet AU - Draxl, Claudia AU - Galli, Giulia AU - Genovese, Luigi AU - Giannozzi, Paolo AU - Giantomassi, Matteo AU - Gonze, Xavier AU - Govoni, Marco AU - Gulans, Andris AU - Gygi, François AU - Herbert, John M. AU - Kokott, Sebastian AU - Kühne, Thomas AU - Liou, Kai-Hsin AU - Miyazaki, Tsuyoshi AU - Motamarri, Phani AU - Nakata, Ayako AU - Pask, John E. AU - Plessl, Christian AU - Ratcliff, Laura E. AU - Richard, Ryan M. AU - Rossi, Mariana AU - Schade, Robert AU - Scheffler, Matthias AU - Schütt, Ole AU - Suryanarayana, Phanish AU - Torrent, Marc AU - Truflandier, Lionel AU - Windus, Theresa L. AU - Xu, Qimen AU - Yu, Victor W. -Z. AU - Perez, Danny ID - 33493 T2 - arXiv:2209.12747 TI - Roadmap on Electronic Structure Codes in the Exascale Era ER - TY - CONF AU - Karp, Martin AU - Podobas, Artur AU - Kenter, Tobias AU - Jansson, Niclas AU - Plessl, Christian AU - Schlatter, Philipp AU - Markidis, Stefano ID - 46193 T2 - International Conference on High Performance Computing in Asia-Pacific Region TI - A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future Challenges ER - TY - GEN AB - The CP2K program package, which can be considered as the swiss army knife of atomistic simulations, is presented with a special emphasis on ab-initio molecular dynamics using the second-generation Car-Parrinello method. After outlining current and near-term development efforts with regards to massively parallel low-scaling post-Hartree-Fock and eigenvalue solvers, novel approaches on how we plan to take full advantage of future low-precision hardware architectures are introduced. Our focus here is on combining our submatrix method with the approximate computing paradigm to address the immanent exascale era. AU - Kühne, Thomas AU - Plessl, Christian AU - Schade, Robert AU - Schütt, Ole ID - 32404 T2 - arXiv:2205.14741 TI - CP2K on the road to exascale ER - TY - JOUR AB - A parallel hybrid quantum-classical algorithm for the solution of the quantum-chemical ground-state energy problem on gate-based quantum computers is presented. This approach is based on the reduced density-matrix functional theory (RDMFT) formulation of the electronic structure problem. For that purpose, the density-matrix functional of the full system is decomposed into an indirectly coupled sum of density-matrix functionals for all its subsystems using the adaptive cluster approximation to RDMFT. The approximations involved in the decomposition and the adaptive cluster approximation itself can be systematically converged to the exact result. The solutions for the density-matrix functionals of the effective subsystems involves a constrained minimization over many-particle states that are approximated by parametrized trial states on the quantum computer similarly to the variational quantum eigensolver. The independence of the density-matrix functionals of the effective subsystems introduces a new level of parallelization and allows for the computational treatment of much larger molecules on a quantum computer with a given qubit count. In addition, for the proposed algorithm techniques are presented to reduce the qubit count, the number of quantum programs, as well as its depth. The evaluation of a density-matrix functional as the essential part of our approach is demonstrated for Hubbard-like systems on IBM quantum computers based on superconducting transmon qubits. AU - Schade, Robert AU - Bauer, Carsten AU - Tamoev, Konstantin AU - Mazur, Lukas AU - Plessl, Christian AU - Kühne, Thomas ID - 33226 JF - Phys. Rev. Research TI - Parallel quantum chemistry on noisy intermediate-scale quantum computers VL - 4 ER - TY - GEN AB - Electronic structure calculations have been instrumental in providing many important insights into a range of physical and chemical properties of various molecular and solid-state systems. Their importance to various fields, including materials science, chemical sciences, computational chemistry and device physics, is underscored by the large fraction of available public supercomputing resources devoted to these calculations. As we enter the exascale era, exciting new opportunities to increase simulation numbers, sizes, and accuracies present themselves. In order to realize these promises, the community of electronic structure software developers will however first have to tackle a number of challenges pertaining to the efficient use of new architectures that will rely heavily on massive parallelism and hardware accelerators. This roadmap provides a broad overview of the state-of-the-art in electronic structure calculations and of the various new directions being pursued by the community. It covers 14 electronic structure codes, presenting their current status, their development priorities over the next five years, and their plans towards tackling the challenges and leveraging the opportunities presented by the advent of exascale computing. AU - Gavini, Vikram AU - Baroni, Stefano AU - Blum, Volker AU - Bowler, David R. AU - Buccheri, Alexander AU - Chelikowsky, James R. AU - Das, Sambit AU - Dawson, William AU - Delugas, Pietro AU - Dogan, Mehmet AU - Draxl, Claudia AU - Galli, Giulia AU - Genovese, Luigi AU - Giannozzi, Paolo AU - Giantomassi, Matteo AU - Gonze, Xavier AU - Govoni, Marco AU - Gulans, Andris AU - Gygi, François AU - Herbert, John M. AU - Kokott, Sebastian AU - Kühne, Thomas AU - Liou, Kai-Hsin AU - Miyazaki, Tsuyoshi AU - Motamarri, Phani AU - Nakata, Ayako AU - Pask, John E. AU - Plessl, Christian AU - Ratcliff, Laura E. AU - Richard, Ryan M. AU - Rossi, Mariana AU - Schade, Robert AU - Scheffler, Matthias AU - Schütt, Ole AU - Suryanarayana, Phanish AU - Torrent, Marc AU - Truflandier, Lionel AU - Windus, Theresa L. AU - Xu, Qimen AU - Yu, Victor W. -Z. AU - Perez, Danny ID - 46275 T2 - arXiv:2209.12747 TI - Roadmap on Electronic Structure Codes in the Exascale Era ER - TY - JOUR AU - Schade, Robert AU - Kenter, Tobias AU - Elgabarty, Hossam AU - Lass, Michael AU - Schütt, Ole AU - Lazzaro, Alfio AU - Pabst, Hans AU - Mohr, Stephan AU - Hutter, Jürg AU - Kühne, Thomas AU - Plessl, Christian ID - 33684 JF - Parallel Computing KW - Artificial Intelligence KW - Computer Graphics and Computer-Aided Design KW - Computer Networks and Communications KW - Hardware and Architecture KW - Theoretical Computer Science KW - Software SN - 0167-8191 TI - Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms VL - 111 ER - TY - JOUR AU - Meyer, Marius AU - Kenter, Tobias AU - Plessl, Christian ID - 27364 JF - Journal of Parallel and Distributed Computing SN - 0743-7315 TI - In-depth FPGA Accelerator Performance Evaluation with Single Node Benchmarks from the HPC Challenge Benchmark Suite for Intel and Xilinx FPGAs using OpenCL ER - TY - JOUR AB - Recent advances in numerical methods significantly pushed forward the understanding of electrons coupled to quantized lattice vibrations. At this stage, it becomes increasingly important to also account for the effects of physically inevitable environments. In particular, we study the transport properties of the Hubbard-Holstein Hamiltonian that models a large class of materials characterized by strong electron-phonon coupling, in contact with a dissipative environment. Even in the one-dimensional and isolated case, simulating the quantum dynamics of such a system with high accuracy is very challenging due to the infinite dimensionality of the phononic Hilbert spaces. For this reason, the effects of dissipation on the conductance properties of such systems have not been investigated systematically so far. We combine the non-Markovian hierarchy of pure states method and the Markovian quantum jumps method with the newly introduced projected purified density-matrix renormalization group, creating powerful tensor-network methods for dissipative quantum many-body systems. Investigating their numerical properties, we find a significant speedup up to a factor $\sim 30$ compared to conventional tensor-network techniques. We apply these methods to study dissipative quenches, aiming for an in-depth understanding of the formation, stability, and quasi-particle properties of bipolarons. Surprisingly, our results show that in the metallic phase dissipation localizes the bipolarons, which is reminiscent of an indirect quantum Zeno effect. However, the bipolaronic binding energy remains mainly unaffected, even in the presence of strong dissipation, exhibiting remarkable bipolaron stability. These findings shed light on the problem of designing real materials exhibiting phonon-mediated high-$T_\mathrm{C}$ superconductivity. AU - Moroder, Mattia AU - Grundner, Martin AU - Damanet, François AU - Schollwöck, Ulrich AU - Mardazad, Sam AU - Flannigan, Stuart AU - Köhler, Thomas AU - Paeckel, Sebastian ID - 50146 JF - Physical Review B 107, 214310 (2023) TI - Stable bipolarons in open quantum systems ER - TY - JOUR AB - We develop a general decomposition of an ensemble of initial density profiles in terms of an average state and a basis of modes that represent the event-by-event fluctuations of the initial state. The basis is determined such that the probability distributions of the amplitudes of different modes are uncorrelated. Based on this decomposition, we quantify the different types and probabilities of event-by-event fluctuations in Glauber and Saturation models and investigate how the various modes affect different characteristics of the initial state. We perform simulations of the dynamical evolution with KoMPoST and MUSIC to investigate the impact of the modes on final-state observables and their correlations. AU - Borghini, Nicolas AU - Borrell, Marc AU - Feld, Nina AU - Roch, Hendrik AU - Schlichting, Sören AU - Werthmann, Clemens ID - 50148 JF - Phys. Rev. C 107 (2023) 034905 TI - Statistical analysis of initial state and final state response in heavy-ion collisions ER - TY - JOUR AB - Abstract RNA editing processes are strikingly different in animals and plants. Up to thousands of specific cytidines are converted into uridines in plant chloroplasts and mitochondria whereas up to millions of adenosines are converted into inosines in animal nucleo-cytosolic RNAs. It is unknown whether these two different RNA editing machineries are mutually incompatible. RNA-binding pentatricopeptide repeat (PPR) proteins are the key factors of plant organelle cytidine-to-uridine RNA editing. The complete absence of PPR mediated editing of cytosolic RNAs might be due to a yet unknown barrier that prevents its activity in the cytosol. Here, we transferred two plant mitochondrial PPR-type editing factors into human cell lines to explore whether they could operate in the nucleo-cytosolic environment. PPR56 and PPR65 not only faithfully edited their native, co-transcribed targets but also different sets of off-targets in the human background transcriptome. More than 900 of such off-targets with editing efficiencies up to 91%, largely explained by known PPR-RNA binding properties, were identified for PPR56. Engineering two crucial amino acid positions in its PPR array led to predictable shifts in target recognition. We conclude that plant PPR editing factors can operate in the entirely different genetic environment of the human nucleo-cytosol and can be intentionally re-engineered towards new targets. AU - Lesch, Elena AU - Schilling, Maximilian T AU - Brenner, Sarah AU - Yang, Yingying AU - Gruss, Oliver J AU - Knoop, Volker AU - Schallenberg-Rüdinger, Mareike ID - 50149 IS - 17 JF - Nucleic Acids Research KW - Genetics SN - 0305-1048 TI - Plant mitochondrial RNA editing factors can perform targeted C-to-U editing of nuclear transcripts in human cells VL - 50 ER - TY - JOUR AB - N-body methods are one of the essential algorithmic building blocks of high-performance and parallel computing. Previous research has shown promising performance for implementing n-body simulations with pairwise force calculations on FPGAs. However, to avoid challenges with accumulation and memory access patterns, the presented designs calculate each pair of forces twice, along with both force sums of the involved particles. Also, they require large problem instances with hundreds of thousands of particles to reach their respective peak performance, limiting the applicability for strong scaling scenarios. This work addresses both issues by presenting a novel FPGA design that uses each calculated force twice and overlaps data transfers and computations in a way that allows to reach peak performance even for small problem instances, outperforming previous single precision results even in double precision, and scaling linearly over multiple interconnected FPGAs. For a comparison across architectures, we provide an equally optimized CPU reference, which for large problems actually achieves higher peak performance per device, however, given the strong scaling advantages of the FPGA design, in parallel setups with few thousand particles per device, the FPGA platform achieves highest performance and power efficiency. AU - Menzel, Johannes AU - Plessl, Christian AU - Kenter, Tobias ID - 28099 IS - 1 JF - ACM Transactions on Reconfigurable Technology and Systems SN - 1936-7406 TI - The Strong Scaling Advantage of FPGAs in HPC for N-body Simulations VL - 15 ER - TY - CONF AU - Meyer, Marius ID - 27365 T2 - Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies TI - Towards Performance Characterization of FPGAs in Context of HPC using OpenCL Benchmarks ER - TY - CONF AU - Nickchen, Tobias AU - Heindorf, Stefan AU - Engels, Gregor ID - 20886 T2 - Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision TI - Generating Physically Sound Training Data for Image Recognition of Additively Manufactured Parts ER - TY - JOUR AB - Abstract The defining feature of active particles is that they constantly propel themselves by locally converting chemical energy into directed motion. This active self-propulsion prevents them from equilibrating with their thermal environment (e.g. an aqueous solution), thus keeping them permanently out of equilibrium. Nevertheless, the spatial dynamics of active particles might share certain equilibrium features, in particular in the steady state. We here focus on the time-reversal symmetry of individual spatial trajectories as a distinct equilibrium characteristic. We investigate to what extent the steady-state trajectories of a trapped active particle obey or break this time-reversal symmetry. Within the framework of active Ornstein–Uhlenbeck particles we find that the steady-state trajectories in a harmonic potential fulfill path-wise time-reversal symmetry exactly, while this symmetry is typically broken in anharmonic potentials. AU - Dabelow, Lennart AU - Bo, Stefano AU - Eichhorn, Ralf ID - 32243 IS - 3 JF - Journal of Statistical Mechanics: Theory and Experiment KW - Statistics KW - Probability and Uncertainty KW - Statistics and Probability KW - Statistical and Nonlinear Physics SN - 1742-5468 TI - How irreversible are steady-state trajectories of a trapped active particle? VL - 2021 ER - TY - GEN AB - We push the boundaries of electronic structure-based \textit{ab-initio} molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massively parallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs. AU - Schade, Robert AU - Kenter, Tobias AU - Elgabarty, Hossam AU - Lass, Michael AU - Schütt, Ole AU - Lazzaro, Alfio AU - Pabst, Hans AU - Mohr, Stephan AU - Hutter, Jürg AU - Kühne, Thomas D. AU - Plessl, Christian ID - 32244 T2 - arXiv:2104.08245 TI - Towards Electronic Structure-Based Ab-Initio Molecular Dynamics Simulations with Hundreds of Millions of Atoms ER - TY - GEN AB - Optical travelling wave antennas offer unique opportunities to control and selectively guide light into a specific direction which renders them as excellent candidates for optical communication and sensing. These applications require state of the art engineering to reach optimized functionalities such as high directivity and radiation efficiency, low side lobe level, broadband and tunable capabilities, and compact design. In this work we report on the numerical optimization of the directivity of optical travelling wave antennas made from low-loss dielectric materials using full-wave numerical simulations in conjunction with a particle swarm optimization algorithm. The antennas are composed of a reflector and a director deposited on a glass substrate and an emitter placed in the feed gap between them serves as an internal source of excitation. In particular, we analysed antennas with rectangular- and horn-shaped directors made of either Hafnium dioxide or Silicon. The optimized antennas produce highly directional emission due to the presence of two dominant guided TE modes in the director in addition to leaky modes. These guided modes dominate the far-field emission pattern and govern the direction of the main lobe emission which predominately originates from the end facet of the director. Our work also provides a comprehensive analysis of the modes, radiation patterns, parametric influences, and bandwidths of the antennas that highlights their robust nature. AU - Farheen, Henna AU - Leuteritz, Till AU - Linden, Stefan AU - Myroshnychenko, Viktor AU - Förstner, Jens ID - 32245 T2 - arXiv:2106.02468 TI - Optimization of optical waveguide antennas for directive emission of light ER - TY - GEN AB - The interaction between quantum light and matter is being intensively studied for systems that are enclosed in high-$Q$ cavities which strongly enhance the light-matter coupling. However, for many applications, cavities with lower $Q$-factors are preferred due to the increased spectral width of the cavity mode. Here, we investigate the interaction between quantum light and matter represented by a $\Lambda$-type three-level system in lossy cavities, assuming that cavity losses are the dominant loss mechanism. We demonstrate that cavity losses lead to non-trivial steady states of the electronic occupations that can be controlled by the loss rate and the initial statistics of the quantum fields. The mechanism of formation of such steady states can be understood on the basis of the equations of motion. Analytical expressions for steady states and their numerical simulations are presented and discussed. AU - Rose, H. AU - Tikhonova, O. V. AU - Meier, T. AU - Sharapova, P. ID - 32236 T2 - arXiv:2109.00842 TI - Steady states of $Λ$-type three-level systems excited by quantum light in lossy cavities ER - TY - JOUR AU - Kaczmarek, Olaf AU - Mazur, Lukas AU - Sharma, Sayantan ID - 46122 IS - 9 JF - Physical Review D SN - 2470-0010 TI - Eigenvalue spectra of QCD and the fate of UA(1) breaking towards the chiral limit VL - 104 ER - TY - JOUR AU - Altenkort, Luis AU - Eller, Alexander M. AU - Kaczmarek, O. AU - Mazur, Lukas AU - Moore, Guy D. AU - Shu, H.-T. ID - 46124 IS - 1 JF - Physical Review D SN - 2470-0010 TI - Heavy quark momentum diffusion from the lattice using gradient flow VL - 103 ER - TY - JOUR AU - Altenkort, Luis AU - Eller, Alexander M. AU - Kaczmarek, O. AU - Mazur, Lukas AU - Moore, Guy D. AU - Shu, H.-T. ID - 46123 IS - 11 JF - Physical Review D SN - 2470-0010 TI - Sphaleron rate from Euclidean lattice correlators: An exploration VL - 103 ER - TY - CONF AU - Kenter, Tobias AU - Shambhu, Adesh AU - Faghih-Naini, Sara AU - Aizinger, Vadym ID - 46194 T2 - Proceedings of the Platform for Advanced Scientific Computing Conference TI - Algorithm-hardware co-design of a discontinuous Galerkin shallow-water model for a dataflow architecture on FPGA ER - TY - CONF AU - Karp, Martin AU - Podobas, Artur AU - Jansson, Niclas AU - Kenter, Tobias AU - Plessl, Christian AU - Schlatter, Philipp AU - Markidis, Stefano ID - 46195 T2 - 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) TI - High-Performance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection ER - TY - CHAP AB - Solving partial differential equations on unstructured grids is a cornerstone of engineering and scientific computing. Nowadays, heterogeneous parallel platforms with CPUs, GPUs, and FPGAs enable energy-efficient and computationally demanding simulations. We developed the HighPerMeshes C++-embedded Domain-Specific Language (DSL) for bridging the abstraction gap between the mathematical and algorithmic formulation of mesh-based algorithms for PDE problems on the one hand and an increasing number of heterogeneous platforms with their different parallel programming and runtime models on the other hand. Thus, the HighPerMeshes DSL aims at higher productivity in the code development process for multiple target platforms. We introduce the concepts as well as the basic structure of the HighPerMeshes DSL, and demonstrate its usage with three examples, a Poisson and monodomain problem, respectively, solved by the continuous finite element method, and the discontinuous Galerkin method for Maxwell’s equation. The mapping of the abstract algorithmic description onto parallel hardware, including distributed memory compute clusters, is presented. Finally, the achievable performance and scalability are demonstrated for a typical example problem on a multi-core CPU cluster. AU - Alhaddad, Samer AU - Förstner, Jens AU - Groth, Stefan AU - Grünewald, Daniel AU - Grynko, Yevgen AU - Hannig, Frank AU - Kenter, Tobias AU - Pfreundt, Franz-Josef AU - Plessl, Christian AU - Schotte, Merlind AU - Steinke, Thomas AU - Teich, Jürgen AU - Weiser, Martin AU - Wende, Florian ID - 21587 KW - tet_topic_hpc SN - 0302-9743 T2 - Euro-Par 2020: Parallel Processing Workshops TI - HighPerMeshes – A Domain-Specific Language for Numerical Algorithms on Unstructured Grids ER - TY - CHAP AU - Ramaswami, Arjun AU - Kenter, Tobias AU - Kühne, Thomas AU - Plessl, Christian ID - 29936 SN - 0302-9743 T2 - Applied Reconfigurable Computing. Architectures, Tools, and Applications TI - Evaluating the Design Space for Offloading 3D FFT Calculations to an FPGA for High-Performance Computing ER - TY - JOUR AU - Alhaddad, Samer AU - Förstner, Jens AU - Groth, Stefan AU - Grünewald, Daniel AU - Grynko, Yevgen AU - Hannig, Frank AU - Kenter, Tobias AU - Pfreundt, Franz‐Josef AU - Plessl, Christian AU - Schotte, Merlind AU - Steinke, Thomas AU - Teich, Jürgen AU - Weiser, Martin AU - Wende, Florian ID - 24788 JF - Concurrency and Computation: Practice and Experience KW - tet_topic_hpc SN - 1532-0626 TI - The HighPerMeshes framework for numerical algorithms on unstructured grids ER - TY - JOUR AB -

The effect of traces of ethanol in supercritical carbon dioxide on the mixture's thermodynamic properties is studied by molecular simulations and Taylor dispersion measurements.

AU - Chatwell, René Spencer AU - Guevara-Carrion, Gabriela AU - Gaponenko, Yuri AU - Shevtsova, Valentina AU - Vrabec, Jadran ID - 32240 IS - 4 JF - Physical Chemistry Chemical Physics KW - Physical and Theoretical Chemistry KW - General Physics and Astronomy SN - 1463-9076 TI - Diffusion of the carbon dioxide–ethanol mixture in the extended critical region VL - 23 ER - TY - CONF AU - Karp, Martin AU - Podobas, Artur AU - Jansson, Niclas AU - Kenter, Tobias AU - Plessl, Christian AU - Schlatter, Philipp AU - Markidis, Stefano ID - 29937 T2 - 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) TI - High-Performance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection ER - TY - CHAP AU - Nickchen, Tobias AU - Engels, Gregor AU - Lohn, Johannes ID - 18789 SN - 9783030543334 T2 - Industrializing Additive Manufacturing TI - Opportunities of 3D Machine Learning for Manufacturability Analysis and Component Recognition in the Additive Manufacturing Process Chain ER - TY - JOUR AB -

State-of-the-art methods in materials science such as artificial intelligence and data-driven techniques advance the investigation of photovoltaic materials.

AU - Mirhosseini, Hossein AU - Kormath Madam Raghupathy, Ramya AU - Sahoo, Sudhir K. AU - Wiebeler, Hendrik AU - Chugh, Manjusha AU - Kühne, Thomas D. ID - 32246 IS - 46 JF - Physical Chemistry Chemical Physics KW - Physical and Theoretical Chemistry KW - General Physics and Astronomy SN - 1463-9076 TI - In silico investigation of Cu(In,Ga)Se2-based solar cells VL - 22 ER - TY - GEN AB - We consider a resource-aware variant of the classical multi-armed bandit problem: In each round, the learner selects an arm and determines a resource limit. It then observes a corresponding (random) reward, provided the (random) amount of consumed resources remains below the limit. Otherwise, the observation is censored, i.e., no reward is obtained. For this problem setting, we introduce a measure of regret, which incorporates the actual amount of allocated resources of each learning round as well as the optimality of realizable rewards. Thus, to minimize regret, the learner needs to set a resource limit and choose an arm in such a way that the chance to realize a high reward within the predefined resource limit is high, while the resource limit itself should be kept as low as possible. We derive the theoretical lower bound on the cumulative regret and propose a learning algorithm having a regret upper bound that matches the lower bound. In a simulation study, we show that our learning algorithm outperforms straightforward extensions of standard multi-armed bandit algorithms. AU - Bengs, Viktor AU - Hüllermeier, Eyke ID - 32242 T2 - arXiv:2011.00813 TI - Multi-Armed Bandits with Censored Consumption of Resources ER - TY - JOUR AB - CP2K is an open source electronic structure and molecular dynamics software package to perform atomistic simulations of solid-state, liquid, molecular, and biological systems. It is especially aimed at massively parallel and linear-scaling electronic structure methods and state-of-theart ab initio molecular dynamics simulations. Excellent performance for electronic structure calculations is achieved using novel algorithms implemented for modern high-performance computing systems. This review revisits the main capabilities of CP2K to perform efficient and accurate electronic structure simulations. The emphasis is put on density functional theory and multiple post–Hartree–Fock methods using the Gaussian and plane wave approach and its augmented all-electron extension. AU - Kühne, Thomas AU - Iannuzzi, Marcella AU - Ben, Mauro Del AU - Rybkin, Vladimir V. AU - Seewald, Patrick AU - Stein, Frederick AU - Laino, Teodoro AU - Khaliullin, Rustam Z. AU - Schütt, Ole AU - Schiffmann, Florian AU - Golze, Dorothea AU - Wilhelm, Jan AU - Chulkov, Sergey AU - Mohammad Hossein Bani-Hashemian, Mohammad Hossein Bani-Hashemian AU - Weber, Valéry AU - Borstnik, Urban AU - Taillefumier, Mathieu AU - Jakobovits, Alice Shoshana AU - Lazzaro, Alfio AU - Pabst, Hans AU - Müller, Tiziano AU - Schade, Robert AU - Guidon, Manuel AU - Andermatt, Samuel AU - Holmberg, Nico AU - Schenter, Gregory K. AU - Hehn, Anna AU - Bussy, Augustin AU - Belleflamme, Fabian AU - Tabacchi, Gloria AU - Glöß, Andreas AU - Lass, Michael AU - Bethune, Iain AU - Mundy, Christopher J. AU - Plessl, Christian AU - Watkins, Matt AU - VandeVondele, Joost AU - Krack, Matthias AU - Hutter, Jürg ID - 16277 IS - 19 JF - The Journal of Chemical Physics TI - CP2K: An electronic structure and molecular dynamics software package - Quickstep: Efficient and accurate electronic structure calculations VL - 152 ER - TY - CONF AB - Electronic structure calculations based on density-functional theory (DFT) represent a significant part of today's HPC workloads and pose high demands on high-performance computing resources. To perform these quantum-mechanical DFT calculations on complex large-scale systems, so-called linear scaling methods instead of conventional cubic scaling methods are required. In this work, we take up the idea of the submatrix method and apply it to the DFT computations in the software package CP2K. For that purpose, we transform the underlying numeric operations on distributed, large, sparse matrices into computations on local, much smaller and nearly dense matrices. This allows us to exploit the full floating-point performance of modern CPUs and to make use of dedicated accelerator hardware, where performance has been limited by memory bandwidth before. We demonstrate both functionality and performance of our implementation and show how it can be accelerated with GPUs and FPGAs. AU - Lass, Michael AU - Schade, Robert AU - Kühne, Thomas AU - Plessl, Christian ID - 16898 T2 - Proc. International Conference for High Performance Computing, Networking, Storage and Analysis (SC) TI - A Submatrix-Based Method for Approximate Matrix Function Evaluation in the Quantum Chemistry Code CP2K ER - TY - CONF AB - FPGAs have found increasing adoption in data center applications since a new generation of high-level tools have become available which noticeably reduce development time for FPGA accelerators and still provide high-quality results. There is, however, no high-level benchmark suite available, which specifically enables a comparison of FPGA architectures, programming tools, and libraries for HPC applications. To fill this gap, we have developed an OpenCL-based open-source implementation of the HPCC benchmark suite for Xilinx and Intel FPGAs. This benchmark can serve to analyze the current capabilities of FPGA devices, cards, and development tool flows, track progress over time, and point out specific difficulties for FPGA acceleration in the HPC domain. Additionally, the benchmark documents proven performance optimization patterns. We will continue optimizing and porting the benchmark for new generations of FPGAs and design tools and encourage active participation to create a valuable tool for the community. To fill this gap, we have developed an OpenCL-based open-source implementation of the HPCC benchmark suite for Xilinx and Intel FPGAs. This benchmark can serve to analyze the current capabilities of FPGA devices, cards, and development tool flows, track progress over time, and point out specific difficulties for FPGA acceleration in the HPC domain. Additionally, the benchmark documents proven performance optimization patterns. We will continue optimizing and porting the benchmark for new generations of FPGAs and design tools and encourage active participation to create a valuable tool for the community. AU - Meyer, Marius AU - Kenter, Tobias AU - Plessl, Christian ID - 21632 KW - FPGA KW - OpenCL KW - High Level Synthesis KW - HPC benchmarking SN - 9781665415927 T2 - 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC) TI - Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite ER - TY - JOUR AB - In scientific computing, the acceleration of atomistic computer simulations by means of custom hardware is finding ever-growing application. A major limitation, however, is that the high efficiency in terms of performance and low power consumption entails the massive usage of low precision computing units. Here, based on the approximate computing paradigm, we present an algorithmic method to compensate for numerical inaccuracies due to low accuracy arithmetic operations rigorously, yet still obtaining exact expectation values using a properly modified Langevin-type equation. AU - Rengaraj, Varadarajan AU - Lass, Michael AU - Plessl, Christian AU - Kühne, Thomas ID - 12878 IS - 2 JF - Computation TI - Accurate Sampling with Noisy Forces from Approximate Computing VL - 8 ER - TY - JOUR AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Kenter, Tobias AU - Plessl, Christian ID - 7689 IS - 2 JF - ACM Trans. Archit. Code Optim. (TACO) KW - htrop TI - Transparent Acceleration for Heterogeneous Platforms with Compilation to OpenCL VL - 16 ER - TY - CONF AB - Stratix 10 FPGA cards have a good potential for the acceleration of HPC workloads since the Stratix 10 product line introduces devices with a large number of DSP and memory blocks. The high level synthesis of OpenCL codes can play a fundamental role for FPGAs in HPC, because it allows to implement different designs with lower development effort compared to hand optimized HDL. However, Stratix 10 cards are still hard to fully exploit using the Intel FPGA SDK for OpenCL. The implementation of designs with thousands of concurrent arithmetic operations often suffers from place and route problems that limit the maximum frequency or entirely prevent a successful synthesis. In order to overcome these issues for the implementation of the matrix multiplication, we formulate Cannon's matrix multiplication algorithm with regard to its efficient synthesis within the FPGA logic. We obtain a two-level block algorithm, where the lower level sub-matrices are multiplied using our Cannon's algorithm implementation. Following this design approach with multiple compute units, we are able to get maximum frequencies close to and above 300 MHz with high utilization of DSP and memory blocks. This allows for performance results above 1 TeraFLOPS. AU - Gorlani, Paolo AU - Kenter, Tobias AU - Plessl, Christian ID - 15478 T2 - Proceedings of the International Conference on Field-Programmable Technology (FPT) TI - OpenCL Implementation of Cannon's Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs ER - TY - THES AU - Riebler, Heinrich ID - 34167 TI - Efficient parallel branch-and-bound search on FPGAs using work stealing and instance-specific designs ER - TY - JOUR AB - We address the general mathematical problem of computing the inverse p-th root of a given matrix in an efficient way. A new method to construct iteration functions that allow calculating arbitrary p-th roots and their inverses of symmetric positive definite matrices is presented. We show that the order of convergence is at least quadratic and that adaptively adjusting a parameter q always leads to an even faster convergence. In this way, a better performance than with previously known iteration schemes is achieved. The efficiency of the iterative functions is demonstrated for various matrices with different densities, condition numbers and spectral radii. AU - Richters, Dorothee AU - Lass, Michael AU - Walther, Andrea AU - Plessl, Christian AU - Kühne, Thomas ID - 21 IS - 2 JF - Communications in Computational Physics TI - A General Algorithm to Calculate the Inverse Principal p-th Root of Symmetric Positive Definite Matrices VL - 25 ER - TY - JOUR AU - Platzner, Marco AU - Plessl, Christian ID - 12871 JF - Informatik Spektrum SN - 0170-6012 TI - FPGAs im Rechenzentrum ER - TY - JOUR AB - Approximate computing has shown to provide new ways to improve performance and power consumption of error-resilient applications. While many of these applications can be found in image processing, data classification or machine learning, we demonstrate its suitability to a problem from scientific computing. Utilizing the self-correcting behavior of iterative algorithms, we show that approximate computing can be applied to the calculation of inverse matrix p-th roots which are required in many applications in scientific computing. Results show great opportunities to reduce the computational effort and bandwidth required for the execution of the discussed algorithm, especially when targeting special accelerator hardware. AU - Lass, Michael AU - Kühne, Thomas AU - Plessl, Christian ID - 20 IS - 2 JF - Embedded Systems Letters SN - 1943-0663 TI - Using Approximate Computing for the Calculation of Inverse Matrix p-th Roots VL - 10 ER - TY - CONF AB - This paper describes a data structure and a heuristic to plan and map arbitrary resources in complex combinations while applying time dependent constraints. The approach is used in the planning based workload manager OpenCCS at the Paderborn Center for Parallel Computing (PC\(^2\)) to operate heterogeneous clusters with up to 10000 cores. We also show performance results derived from four years of operation. AU - Keller, Axel ED - Klusáček, D. ED - Cirne, W. ED - Desai, N. ID - 22 KW - Scheduling Planning Mapping Workload management SN - 978-3-319-77398-8 T2 - Proc. Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP) TI - A Data Structure for Planning Based Workload Management of Heterogeneous HPC Systems VL - 10773 ER - TY - JOUR AU - Mertens, Jan Cedric AU - Boschmann, Alexander AU - Schmidt, M. AU - Plessl, Christian ID - 6516 IS - 4 JF - Sports Engineering SN - 1369-7072 TI - Sprint diagnostic with GPS and inertial sensor fusion VL - 21 ER - TY - JOUR AU - Luk, Samuel M. H. AU - Lewandowski, P. AU - Kwong, N. H. AU - Baudin, E. AU - Lafont, O. AU - Tignon, J. AU - Leung, P. T. AU - Chan, Ch. K. P. AU - Babilon, M. AU - Schumacher, Stefan AU - Binder, R. ID - 13348 IS - 1 JF - Journal of the Optical Society of America B SN - 0740-3224 TI - Theory of optically controlled anisotropic polariton transport in semiconductor double microcavities VL - 35 ER - TY - CONF AB - The exploration of FPGAs as accelerators for scientific simulations has so far mostly been focused on small kernels of methods working on regular data structures, for example in the form of stencil computations for finite difference methods. In computational sciences, often more advanced methods are employed that promise better stability, convergence, locality and scaling. Unstructured meshes are shown to be more effective and more accurate, compared to regular grids, in representing computation domains of various shapes. Using unstructured meshes, the discontinuous Galerkin method preserves the ability to perform explicit local update operations for simulations in the time domain. In this work, we investigate FPGAs as target platform for an implementation of the nodal discontinuous Galerkin method to find time-domain solutions of Maxwell's equations in an unstructured mesh. When maximizing data reuse and fitting constant coefficients into suitably partitioned on-chip memory, high computational intensity allows us to implement and feed wide data paths with hundreds of floating point operators. By decoupling off-chip memory accesses from the computations, high memory bandwidth can be sustained, even for the irregular access pattern required by parts of the application. Using the Intel/Altera OpenCL SDK for FPGAs, we present different implementation variants for different polynomial orders of the method. In different phases of the algorithm, either computational or bandwidth limits of the Arria 10 platform are almost reached, thus outperforming a highly multithreaded CPU implementation by around 2x. AU - Kenter, Tobias AU - Mahale, Gopinath AU - Alhaddad, Samer AU - Grynko, Yevgen AU - Schmitt, Christian AU - Afzal, Ayesha AU - Hannig, Frank AU - Förstner, Jens AU - Plessl, Christian ID - 1588 KW - tet_topic_hpc T2 - Proc. Int. Symp. on Field-Programmable Custom Computing Machines (FCCM) TI - OpenCL-based FPGA Design to Accelerate the Nodal Discontinuous Galerkin Method for Unstructured Meshes ER - TY - CONF AB - We present the submatrix method, a highly parallelizable method for the approximate calculation of inverse p-th roots of large sparse symmetric matrices which are required in different scientific applications. Following the idea of Approximate Computing, we allow imprecision in the final result in order to utilize the sparsity of the input matrix and to allow massively parallel execution. For an n x n matrix, the proposed algorithm allows to distribute the calculations over n nodes with only little communication overhead. The result matrix exhibits the same sparsity pattern as the input matrix, allowing for efficient reuse of allocated data structures. We evaluate the algorithm with respect to the error that it introduces into calculated results, as well as its performance and scalability. We demonstrate that the error is relatively limited for well-conditioned matrices and that results are still valuable for error-resilient applications like preconditioning even for ill-conditioned matrices. We discuss the execution time and scaling of the algorithm on a theoretical level and present a distributed implementation of the algorithm using MPI and OpenMP. We demonstrate the scalability of this implementation by running it on a high-performance compute cluster comprised of 1024 CPU cores, showing a speedup of 665x compared to single-threaded execution. AU - Lass, Michael AU - Mohr, Stephan AU - Wiebeler, Hendrik AU - Kühne, Thomas AU - Plessl, Christian ID - 1590 KW - approximate computing KW - linear algebra KW - matrix inversion KW - matrix p-th roots KW - numeric algorithm KW - parallel computing SN - 978-1-4503-5891-0/18/07 T2 - Proc. Platform for Advanced Scientific Computing (PASC) Conference TI - A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices ER - TY - CONF AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Kenter, Tobias AU - Plessl, Christian ID - 1204 KW - htrop SN - 9781450349826 T2 - Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) TI - Automated Code Acceleration Targeting Heterogeneous OpenCL Devices ER - TY - JOUR AB - Branch and bound (B&B) algorithms structure the search space as a tree and eliminate infeasible solutions early by pruning subtrees that cannot lead to a valid or optimal solution. Custom hardware designs significantly accelerate the execution of these algorithms. In this article, we demonstrate a high-performance B&B implementation on FPGAs. First, we identify general elements of B&B algorithms and describe their implementation as a finite state machine. Then, we introduce workers that autonomously cooperate using work stealing to allow parallel execution and full utilization of the target FPGA. Finally, we explore advantages of instance-specific designs that target a specific problem instance to improve performance. We evaluate our concepts by applying them to a branch and bound problem, the reconstruction of corrupted AES keys obtained from cold-boot attacks. The evaluation shows that our work stealing approach is scalable with the available resources and provides speedups proportional to the number of workers. Instance-specific designs allow us to achieve an overall speedup of 47 × compared to the fastest implementation of AES key reconstruction so far. Finally, we demonstrate how instance-specific designs can be generated just-in-time such that the provided speedups outweigh the additional time required for design synthesis. AU - Riebler, Heinrich AU - Lass, Michael AU - Mittendorf, Robert AU - Löcke, Thomas AU - Plessl, Christian ID - 18 IS - 3 JF - ACM Transactions on Reconfigurable Technology and Systems (TRETS) KW - coldboot SN - 1936-7406 TI - Efficient Branch and Bound on FPGAs Using Work Stealing and Instance-Specific Designs VL - 10 ER - TY - CONF AB - Compared to classical HDL designs, generating FPGA with high-level synthesis from an OpenCL specification promises easier exploration of different design alternatives and, through ready-to-use infrastructure and common abstractions for host and memory interfaces, easier portability between different FPGA families. In this work, we evaluate the extent of this promise. To this end, we present a parameterized FDTD implementation for photonic microcavity simulations. Our design can trade-off different forms of parallelism and works for two independent OpenCL-based FPGA design flows. Hence, we can target FPGAs from different vendors and different FPGA families. We describe how we used pre-processor macros to achieve this flexibility and to work around different shortcomings of the current tools. Choosing the right design configurations, we are able to present two extremely competitive solutions for very different FPGA targets, reaching up to 172 GFLOPS sustained performance. With the portability and flexibility demonstrated, code developers not only avoid vendor lock-in, but can even make best use of real trade-offs between different architectures. AU - Kenter, Tobias AU - Förstner, Jens AU - Plessl, Christian ID - 1592 KW - tet_topic_hpc T2 - Proc. Int. Conf. on Field Programmable Logic and Applications (FPL) TI - Flexible FPGA design for FDTD using OpenCL ER - TY - JOUR AU - Schumacher, Jörn AU - Plessl, Christian AU - Vandelli, Wainer ID - 1589 JF - Journal of Physics: Conference Series TI - High-Throughput and Low-Latency Network Communication with NetIO VL - 898 ER - TY - THES AB - Lightweight materials play an ever growing role in today's world. Saving on the mass of a machine will usually translate into a lower energy consumption. However, lightweight applications are prone to develop performance problems due to vibration induced by the operation of the machine. The Fraunhofer Institute for Manufacturing Technology and Advanced Materials in Dresden conducts research into the damping properties of composite materials. They are experimenting with hollow, particle filled spheres embedded in the lightweight material. Such a system is the technical motivation of this thesis. Ultimately, a numerical experiment to derive the coefficient of restitution is required. The simulation developed in this thesis is based on a discrete element method to track the individual particle and sphere trajectories. Based on a potential based approach for the particle interactions deployed in molecular dynamics, the behavior of the particles can be controlled effectively. The simulated volume is using reflecting boundaries and encloses the hollow sphere. In this work, a highly flexible memory structure was used with a linked cell approach to cope with the highly flexible mass of particles. This allows for a linear complexity of the method in regard to the particle number by reducing the computational overhead of the interaction computation. Multiple numerical experiments show the great effect the particles have on the damping behavior of the system. AU - Steinle, Tobias ID - 33 TI - Modeling and simulation of metallic, particle-damped spheres for lightweight materials ER - TY - CONF AU - Dellnitz, Michael AU - Eckstein, Julian AU - Flaßkamp, Kathrin AU - Friedel, Patrick AU - Horenkamp, Christian AU - Köhler, Ulrich AU - Ober-Blöbaum, Sina AU - Peitz, Sebastian AU - Tiemeyer, Sebastian ID - 34 SN - 2212-0173 T2 - Progress in Industrial Mathematics at ECMI TI - Multiobjective Optimal Control Methods for the Development of an Intelligent Cruise Control VL - 22 ER - TY - CONF AB - Version Control Systems (VCS) are a valuable tool for software development and document management. Both client/server and distributed (Peer-to-Peer) models exist, with the latter (e.g., Git and Mercurial) becoming increasingly popular. Their distributed nature introduces complications, especially concerning security: it is hard to control the dissemination of contents stored in distributed VCS as they rely on replication of complete repositories to any involved user. We overcome this issue by designing and implementing a concept for cryptography-enforced access control which is transparent to the user. Use of field-tested schemes (end-to-end encryption, digital signatures) allows for strong security, while adoption of convergent encryption and content-defined chunking retains storage efficiency. The concept is seamlessly integrated into Mercurial---respecting its distributed storage concept---to ensure practical usability and compatibility to existing deployments. AU - Lass, Michael AU - Leibenger, Dominik AU - Sorge, Christoph ID - 19 KW - access control KW - distributed version control systems KW - mercurial KW - peer-to-peer KW - convergent encryption KW - confidentiality KW - authenticity SN - 978-1-5090-2054-6 T2 - Proc. 41st Conference on Local Computer Networks (LCN) TI - Confidentiality and Authenticity for Distributed Version Control Systems - A Mercurial Extension ER - TY - THES AU - Kenter, Tobias ID - 161 TI - Reconfigurable Accelerators in the World of General-Purpose Computing ER - TY - CHAP AB - In this chapter, we present an introduction to the ReconOS operating system for reconfigurable computing. ReconOS offers a unified multi-threaded programming model and operating system services for threads executing in software and threads mapped to reconfigurable hardware. By supporting standard POSIX operating system functions for both software and hardware threads, ReconOS particularly caters to developers with a software background, because developers can use well-known mechanisms such as semaphores, mutexes, condition variables, and message queues for developing hybrid applications with threads running on the CPU and FPGA concurrently. Through the semantic integration of hardware accelerators into a standard operating system environment, ReconOS allows for rapid design space exploration, supports a structured application development process and improves the portability of applications between different reconfigurable computing systems. AU - Agne, Andreas AU - Platzner, Marco AU - Plessl, Christian AU - Happe, Markus AU - Lübbers, Enno ED - Koch, Dirk ED - Hannig, Frank ED - Ziener, Daniel ID - 29 SN - 978-3-319-26406-6 T2 - FPGAs for Software Programmers TI - ReconOS ER - TY - CONF AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Plessl, Christian AU - Trainiti, Ettore M. G. AU - Durelli, Gianluca C. AU - Bolchini, Cristiana ID - 31 T2 - Proc. HiPEAC Workshop on Reonfigurable Computing (WRC) TI - Using Just-in-Time Code Generation for Transparent Resource Management in Heterogeneous Systems ER - TY - CONF AU - Kenter, Tobias AU - Plessl, Christian ID - 24 T2 - Proc. Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC) TI - Microdisk Cavity FDTD Simulation on FPGA using OpenCL ER - TY - CONF AU - Lass, Michael AU - Kühne, Thomas AU - Plessl, Christian ID - 25 T2 - Workshop on Approximate Computing (AC) TI - Using Approximate Computing in Scientific Codes ER - TY - CONF AB - Hardware accelerators are becoming popular in academia and industry. To move one step further from the state-of-the-art multicore plus accelerator approaches, we present in this paper our innovative SAVEHSA architecture. It comprises of a heterogeneous hardware platform with three different high-end accelerators attached over PCIe (GPGPU, FPGA and Intel MIC). Such systems can process parallel workloads very efficiently whilst being more energy efficient than regular CPU systems. To leverage the heterogeneity, the workload has to be distributed among the computing units in a way that each unit is well-suited for the assigned task and executable code must be available. To tackle this problem we present two software components; the first can perform resource allocation at runtime while respecting system and application goals (in terms of throughput, energy, latency, etc.) and the second is able to analyze an application and generate executable code for an accelerator at runtime. We demonstrate the first proof-of-concept implementation of our framework on the heterogeneous platform, discuss different runtime policies and measure the introduced overheads. AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Plessl, Christian AU - Trainiti, Ettore M. G. AU - Durelli, Gianluca C. AU - Del Sozzo, Emanuele AU - Santambrogio, Marco D. AU - Bolchini, Christina ID - 138 T2 - Proceedings of International Forum on Research and Technologies for Society and Industry (RTSI) TI - Using Just-in-Time Code Generation for Transparent Resource Management in Heterogeneous Systems ER - TY - CHAP AB - Many modern compute nodes are heterogeneous multi-cores that integrate several CPU cores with fixed function or reconfigurable hardware cores. Such systems need to adapt task scheduling and mapping to optimise for performance and energy under varying workloads and, increasingly important, for thermal and fault management and are thus relevant targets for self-aware computing. In this chapter, we take up the generic reference architecture for designing self-aware and self-expressive computing systems and refine it for heterogeneous multi-cores. We present ReconOS, an architecture, programming model and execution environment for heterogeneous multi-cores, and show how the components of the reference architecture can be implemented on top of ReconOS. In particular, the unique feature of dynamic partial reconfiguration supports self-expression through starting and terminating reconfigurable hardware cores. We detail a case study that runs two applications on an architecture with one CPU and 12 reconfigurable hardware cores and present self-expression strategies for adapting under performance, temperature and even conflicting constraints. The case study demonstrates that the reference architecture as a model for self-aware computing is highly useful as it allows us to structure and simplify the design process, which will be essential for designing complex future compute nodes. Furthermore, ReconOS is used as a base technology for flexible protocol stacks in Chapter 10, an approach for self-aware computing at the networking level. AU - Agne, Andreas AU - Happe, Markus AU - Lösch, Achim AU - Plessl, Christian AU - Platzner, Marco ID - 156 T2 - Self-aware Computing Systems TI - Self-aware Compute Nodes ER - TY - JOUR AB - A broad spectrum of applications can be accelerated by offloading computation intensive parts to reconfigurable hardware. However, to achieve speedups, the number of loop it- erations (trip count) needs to be sufficiently large to amortize offloading overheads. Trip counts are frequently not known at compile time, but only at runtime just before entering a loop. Therefore, we propose to generate code for both the CPU and the coprocessor, and defer the offloading decision to the application runtime. We demonstrate how a toolflow, based on the LLVM compiler framework, can automatically embed dynamic offloading de- cisions into the application code. We perform in-depth static and dynamic analysis of pop- ular benchmarks, which confirm the general potential of such an approach. We also pro- pose to optimize the offloading process by decoupling the runtime decision from the loop execution (decision slack). The feasibility of our approach is demonstrated by a toolflow that automatically identifies suitable data-parallel loops and generates code for the FPGA coprocessor of a Convey HC-1. We evaluate the integrated toolflow with representative loops executed for different input data sizes. AU - Vaz, Gavin Francis AU - Riebler, Heinrich AU - Kenter, Tobias AU - Plessl, Christian ID - 165 JF - Computers and Electrical Engineering SN - 0045-7906 TI - Potential and Methods for Embedding Dynamic Offloading Decisions into Application Code VL - 55 ER - TY - CONF AB - The use of heterogeneous computing resources, such as Graphic Processing Units or other specialized coprocessors, has become widespread in recent years because of their per- formance and energy efficiency advantages. Approaches for managing and scheduling tasks to heterogeneous resources are still subject to research. Although queuing systems have recently been extended to support accelerator resources, a general solution that manages heterogeneous resources at the operating system- level to exploit a global view of the system state is still missing.In this paper we present a user space scheduler that enables task scheduling and migration on heterogeneous processing resources in Linux. Using run queues for available resources we perform scheduling decisions based on the system state and on task characterization from earlier measurements. With a pro- gramming pattern that supports the integration of checkpoints into applications, we preempt tasks and migrate them between three very different compute resources. Considering static and dynamic workload scenarios, we show that this approach can gain up to 17% performance, on average 7%, by effectively avoiding idle resources. We demonstrate that a work-conserving strategy without migration is no suitable alternative. AU - Lösch, Achim AU - Beisel, Tobias AU - Kenter, Tobias AU - Plessl, Christian AU - Platzner, Marco ID - 168 T2 - Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE) TI - Performance-centric scheduling with task migration for a heterogeneous compute node in the data center ER - TY - CONF AU - Kenter, Tobias AU - Vaz, Gavin Francis AU - Riebler, Heinrich AU - Plessl, Christian ID - 171 T2 - Workshop on Reconfigurable Computing (WRC) TI - Opportunities for deferring application partitioning and accelerator synthesis to runtime (extended abstract) ER - TY - JOUR AB - Große zylindrische Stahlprüflinge werden mittels der Methode der finiten Differenzen im Zeitbereich (engl. finite differences in time domain, FDTD) simulativ untersucht. Dabei werden Pitch-Catch-Messanordnungen verwendet. Es werden zwei Bildgebungsansätze vorgestellt: ersterer basiert auf dem Imaging Principle nach Claerbout, letzterer basiert auf gradientenbasierter Optimierung eines Zielfunktionals. AU - Hegler, Sebastian AU - Statz, Christoph AU - Mütze, Marco AU - Mooshofer, Hubert AU - Goldammer, Matthias AU - Fendt, Karl AU - Schwarzer, Stefan AU - Feldhoff, Kim AU - Flehmig, Martin AU - Markwardt, Ulf AU - E. Nagel, Wolfgang AU - Schütte, Maria AU - Walther, Andrea AU - Meinel, Michael AU - Basermann, Achim AU - Plettemeier, Dirk ID - 1769 IS - 9 JF - tm - Technisches Messen TI - Simulative Ultraschall-Untersuchung von Pitch-Catch-Messanordnungen für große zylindrische Stahl-Prüflinge und gradientenbasierte Bildgebung VL - 82 ER - TY - JOUR AU - Torresen, Jim AU - Plessl, Christian AU - Yao, Xin ID - 1772 IS - 7 JF - IEEE Computer KW - self-awareness KW - self-expression TI - Self-Aware and Self-Expressive Systems – Guest Editor's Introduction VL - 48 ER - TY - JOUR AB - In this article an efficient numerical method to solve multiobjective optimization problems for fluid flow governed by the Navier Stokes equations is presented. In order to decrease the computational effort, a reduced order model is introduced using Proper Orthogonal Decomposition and a corresponding Galerkin Projection. A global, derivative free multiobjective optimization algorithm is applied to compute the Pareto set (i.e. the set of optimal compromises) for the concurrent objectives minimization of flow field fluctuations and control cost. The method is illustrated for a 2D flow around a cylinder at Re = 100. AU - Peitz, Sebastian AU - Dellnitz, Michael ID - 1774 IS - 1 JF - PAMM SN - 1617-7061 TI - Multiobjective Optimization of the Flow Around a Cylinder Using Model Order Reduction VL - 15 ER - TY - THES AB - The use of heterogeneous computing resources, such as graphics processing units or other specialized co-processors, has become widespread in recent years because of their performance and energy efficiency advantages. Operating system approaches that are limited to optimizing CPU usage are no longer sufficient for the efficient utilization of systems that comprise diverse resource types. Enabling task preemption on these architectures and migration of tasks between different resource types at run-time is not only key to improving the performance and energy consumption but also to enabling automatic scheduling methods for heterogeneous compute nodes. This thesis proposes novel techniques for run-time management of heterogeneous resources and enabling tasks to migrate between diverse hardware. It provides fundamental work towards future operating systems by discussing implications, limitations, and chances of the heterogeneity and introducing solutions for energy- and performance-efficient run-time systems. Scheduling methods to utilize heterogeneous systems by the use of a centralized scheduler are presented that show benefits over existing approaches in varying case studies. AU - Beisel, Tobias ID - 10624 SN - 978-3-8325-4155-2 TI - Management and Scheduling of Accelerators for Heterogeneous High-Performance Computing ER - TY - JOUR AB - FPGAs are known to permit huge gains in performance and efficiency for suitable applications but still require reduced design efforts and shorter development cycles for wider adoption. In this work, we compare the resulting performance of two design concepts that in different ways promise such increased productivity. As common starting point, we employ a kernel-centric design approach, where computational hotspots in an application are identified and individually accelerated on FPGA. By means of a complex stereo matching application, we evaluate two fundamentally different design philosophies and approaches for implementing the required kernels on FPGAs. In the first implementation approach, we designed individually specialized data flow kernels in a spatial programming language for a Maxeler FPGA platform; in the alternative design approach, we target a vector coprocessor with large vector lengths, which is implemented as a form of programmable overlay on the application FPGAs of a Convey HC-1. We assess both approaches in terms of overall system performance, raw kernel performance, and performance relative to invested resources. After compensating for the effects of the underlying hardware platforms, the specialized dataflow kernels on the Maxeler platform are around 3x faster than kernels executing on the Convey vector coprocessor. In our concrete scenario, due to trade-offs between reconfiguration overheads and exposed parallelism, the advantage of specialized dataflow kernels is reduced to around 2.5x. AU - Kenter, Tobias AU - Schmitz, Henning AU - Plessl, Christian ID - 296 JF - International Journal of Reconfigurable Computing (IJRC) TI - Exploring Tradeoffs between Specialized Kernels and a Reusable Overlay in a Stereo-Matching Case Study VL - 2015 ER - TY - CONF AB - This paper introduces Binary Acceleration At Runtime(BAAR), an easy-to-use on-the-fly binary acceleration mechanismwhich aims to tackle the problem of enabling existentsoftware to automatically utilize accelerators at runtime. BAARis based on the LLVM Compiler Infrastructure and has aclient-server architecture. The client runs the program to beaccelerated in an environment which allows program analysisand profiling. Program parts which are identified as suitable forthe available accelerator are exported and sent to the server.The server optimizes these program parts for the acceleratorand provides RPC execution for the client. The client transformsits program to utilize accelerated execution on the server foroffloaded program parts. We evaluate our work with a proofof-concept implementation of BAAR that uses an Intel XeonPhi 5110P as the acceleration target and performs automaticoffloading, parallelization and vectorization of suitable programparts. The practicality of BAAR for real-world examples is shownbased on a study of stencil codes. Our results show a speedup ofup to 4 without any developer-provided hints and 5.77 withhints over the same code compiled with the Intel Compiler atoptimization level O2 and running on an Intel Xeon E5-2670machine. Based on our insights gained during implementationand evaluation we outline future directions of research, e.g.,offloading more fine-granular program parts than functions, amore sophisticated communication mechanism or introducing onstack-replacement. AU - Damschen, Marvin AU - Plessl, Christian ID - 303 T2 - Proceedings of the 5th International Workshop on Adaptive Self-tuning Computing Systems (ADAPT) TI - Easy-to-Use On-The-Fly Binary Program Acceleration on Many-Cores ER - TY - CONF AU - Schumacher, Jörn AU - T. Anderson, J. AU - Borga, A. AU - Boterenbrood, H. AU - Chen, H. AU - Chen, K. AU - Drake, G. AU - Francis, D. AU - Gorini, B. AU - Lanni, F. AU - Lehmann-Miotto, Giovanna AU - Levinson, L. AU - Narevicius, J. AU - Plessl, Christian AU - Roich, A. AU - Ryu, S. AU - P. Schreuder, F. AU - Vandelli, Wainer AU - Vermeulen, J. AU - Zhang, J. ID - 1773 T2 - Proc. Int. Conf. on Distributed Event-Based Systems (DEBS) TI - Improving Packet Processing Performance in the ATLAS FELIX Project – Analysis and Optimization of a Memory-Bounded Algorithm ER - TY - JOUR AU - Plessl, Christian AU - Platzner, Marco AU - Schreier, Peter J. ID - 1768 IS - 5 JF - Informatik Spektrum KW - approximate computing KW - survey TI - Aktuelles Schlagwort: Approximate Computing ER - TY - CONF AB - In this paper, we study how binary applications can be transparently accelerated with novel heterogeneous computing resources without requiring any manual porting or developer-provided hints. Our work is based on Binary Acceleration At Runtime (BAAR), our previously introduced binary acceleration mechanism that uses the LLVM Compiler Infrastructure. BAAR is designed as a client-server architecture. The client runs the program to be accelerated in an environment, which allows program analysis and profiling and identifies and extracts suitable program parts to be offloaded. The server compiles and optimizes these offloaded program parts for the accelerator and offers access to these functions to the client with a remote procedure call (RPC) interface. Our previous work proved the feasibility of our approach, but also showed that communication time and overheads limit the granularity of functions that can be meaningfully offloaded. In this work, we motivate the importance of a lightweight, high-performance communication between server and client and present a communication mechanism based on the Message Passing Interface (MPI). We evaluate our approach by using an Intel Xeon Phi 5110P as the acceleration target and show that the communication overhead can be reduced from 40% to 10%, thus enabling even small hotspots to benefit from offloading to an accelerator. AU - Damschen, Marvin AU - Riebler, Heinrich AU - Vaz, Gavin Francis AU - Plessl, Christian ID - 238 T2 - Proceedings of the 2015 Conference on Design, Automation and Test in Europe (DATE) TI - Transparent offloading of computational hotspots from binary code to Xeon Phi ER - TY - JOUR AB - The ATLAS experiment at CERN is planning full deployment of a new unified optical link technology for connecting detector front end electronics on the timescale of the LHC Run 4 (2025). It is estimated that roughly 8000 GBT (GigaBit Transceiver) links, with transfer rates up to 10.24 Gbps, will replace existing links used for readout, detector control and distribution of timing and trigger information. A new class of devices will be needed to interface many GBT links to the rest of the trigger, data-acquisition and detector control systems. In this paper FELIX (Front End LInk eXchange) is presented, a PC-based device to route data from and to multiple GBT links via a high-performance general purpose network capable of a total throughput up to O(20 Tbps). FELIX implies architectural changes to the ATLAS data acquisition system, such as the use of industry standard COTS components early in the DAQ chain. Additionally the design and implementation of a FELIX demonstration platform is presented and hardware and software aspects will be discussed. AU - Anderson, J AU - Borga, A AU - Boterenbrood, H AU - Chen, H AU - Chen, K AU - Drake, G AU - Francis, D AU - Gorini, B AU - Lanni, F AU - Lehmann Miotto, G AU - Levinson, L AU - Narevicius, J AU - Plessl, Christian AU - Roich, A AU - Ryu, S AU - Schreuder, F AU - Schumacher, Jörn AU - Vandelli, Wainer AU - Vermeulen, J AU - Zhang, J ID - 1775 JF - Journal of Physics: Conference Series TI - FELIX: a High-Throughput Network Approach for Interfacing to Front End Electronics for ATLAS Upgrades VL - 664 ER - TY - CONF AB - In light of an increasing awareness of environmental challenges, extensive research is underway to develop new light-weight materials. A problem arising with these materials is their increased response to vibration. This can be solved using a new composite material that contains embedded hollow spheres that are partially filled with particles. Progress on the adaptation of molecular dynamics towards a particle-based numerical simulation of this material is reported. This includes the treatment of specific boundary conditions and the adaption of the force computation. First results are presented that showcase the damping properties of such particle-filled spheres in a bouncing experiment. AU - Steinle, Tobias AU - Vrabec, Jadran AU - Walther, Andrea ED - Bock, Hans Georg ED - Hoang, Xuan Phu ED - Rannacher, Rolf ED - Schlöder, Johannes P. ID - 1781 SN - 978-3-319-09063-4 T2 - Proc. Modeling, Simulation and Optimization of Complex Processes (HPSC) TI - Numerical Simulation of the Damping Behavior of Particle-Filled Hollow Spheres ER - TY - CONF AU - Graf, Tobias AU - Schaefers, Lars AU - Platzner, Marco ID - 1782 IS - 8427 T2 - Proc. Conf. on Computers and Games (CG) TI - On Semeai Detection in Monte-Carlo Go ER - TY - CHAP AB - Im Bereich der Computersysteme ist die Festlegung der Grenze zwischen Hardware und Software eine zentrale Problemstellung. Diese Grenze hat in den letzten Jahrzehnten nicht nur die Entwicklung von Computersystemen bestimmt, sondern auch die Strukturierung der Ausbildung in den Computerwissenschaften beeinflusst und sogar zur Entstehung von neuen Forschungsrichtungen gef{\"u}hrt. In diesem Beitrag besch{\"a}ftigen wir uns mit Verschiebungen an der Grenze zwischen Hardware und Software und diskutieren insgesamt drei qualitativ unterschiedliche Formen solcher Verschiebungen. Wir beginnen mit der Entwicklung von Computersystemen im letzten Jahrhundert und der Entstehung dieser Grenze, die Hardware und Software erst als eigenst{\"a}ndige Produkte differenziert. Dann widmen wir uns der Frage, welche Funktionen in einem Computersystem besser in Hardware und welche besser in Software realisiert werden sollten, eine Fragestellung die zu Beginn der 90er-Jahre zur Bildung einer eigenen Forschungsrichtung, dem sogenannten Hardware/Software Co-design, gef{\"u}hrt hat. Im Hardware/Software Co-design findet eine Verschiebung von Funktionen an der Grenze zwischen Hardware und Software w{\"a}hrend der Entwicklung eines Produktes statt, um Produkteigenschaften zu optimieren. Im fertig entwickelten und eingesetzten Produkt hingegen k{\"o}nnen wir dann eine feste Grenze zwischen Hardware und Software beobachten. Im dritten Teil dieses Beitrags stellen wir mit selbst-adaptiven Systemen eine hochaktuelle Forschungsrichtung vor. In unserem Kontext bedeutet Selbstadaption, dass ein System Verschiebungen von Funktionen an der Grenze zwischen Hardware und Software autonom w{\"a}hrend der Betriebszeit vornimmt. Solche Systeme beruhen auf rekonfigurierbarer Hardware, einer relativ neuen Technologie mit der die Hardware eines Computers w{\"a}hrend der Laufzeit ver{\"a}ndert werden kann. Diese Technologie f{\"u}hrt zu einer durchl{\"a}ssigen Grenze zwischen Hardware und Software bzw. l{\"o}st sie die herk{\"o}mmliche Vorstellung einer festen Hardware und einer flexiblen Software damit auf. AU - Platzner, Marco AU - Plessl, Christian ED - Künsemöller, Jörn ED - Eke, Norber Otto ED - Foit, Lioba ED - Kaerlein, Timo ID - 335 SN - 978-3-7705-5730-1 T2 - Logiken strukturbildender Prozesse: Automatismen TI - Verschiebungen an der Grenze zwischen Hardware und Software ER - TY - CONF AB - In order to leverage the use of reconfigurable architectures in general-purpose computing, quick and automated methods to find suitable accelerator designs are required. We tackle this challenge in both regards. In order to avoid long synthesis times, we target a vector copro- cessor, implemented on the FPGAs of a Convey HC-1. Previous studies showed that existing tools were not able to accelerate a real-world application with low effort. We present a toolflow to automatically identify suitable loops for vectorization, generate a corresponding hardware/software bipartition, and generate coprocessor code. Where applicable, we leverage outer-loop vectorization. We evaluate our tools with a set of characteristic loops, systematically analyzing different dependency and data layout properties. AU - Kenter, Tobias AU - Vaz, Gavin Francis AU - Plessl, Christian ID - 388 T2 - Proceedings of the International Symposium on Reconfigurable Computing: Architectures, Tools, and Applications (ARC) TI - Partitioning and Vectorizing Binary Applications for a Reconfigurable Vector Computer VL - 8405 ER - TY - JOUR AB - Due to the continuously shrinking device structures and increasing densities of FPGAs, thermal aspects have become the new focus for many research projects over the last years. Most researchers rely on temperature simulations to evaluate their novel thermal management techniques. However, these temperature simulations require a high computational effort if a detailed thermal model is used and their accuracies are often unclear. In contrast to simulations, the use of synthetic heat sources allows for experimental evaluation of temperature management methods. In this paper we investigate the creation of significant rises in temperature on modern FPGAs to enable future evaluation of thermal management techniques based on experiments. To that end, we have developed seven different heat-generating cores that use different subsets of FPGA resources. Our experimental results show that, according to external temperature probes connected to the FPGA’s heat sink, we can increase the temperature by an average of 81 !C. This corresponds to an average increase of 156.3 !C as measured by the built-in thermal diodes of our Virtex-5 FPGAs in less than 30 min by only utilizing about 21 percent of the slices. AU - Agne, Andreas AU - Hangmann, Hendrik AU - Happe, Markus AU - Platzner, Marco AU - Plessl, Christian ID - 363 IS - 8, Part B JF - Microprocessors and Microsystems TI - Seven Recipes for Setting Your FPGA on Fire – A Cookbook on Heat Generators VL - 38 ER - TY - CONF AB - In this paper, we study how AES key schedules can be reconstructed from decayed memory. This operation is a crucial and time consuming operation when trying to break encryption systems with cold-boot attacks. In software, the reconstruction of the AES master key can be performed using a recursive, branch-and-bound tree-search algorithm that exploits redundancies in the key schedule for constraining the search space. In this work, we investigate how this branch-and-bound algorithm can be accelerated with FPGAs. We translated the recursive search procedure to a state machine with an explicit stack for each recursion level and create optimized datapaths to accelerate in particular the processing of the most frequently accessed tree levels. We support two different decay models, of which especially the more realistic non-idealized asymmetric decay model causes very high runtimes in software. Our implementation on a Maxeler dataflow computing system outperforms a software implementation for this model by up to 27x, which makes cold-boot attacks against AES practical even for high error rates. AU - Riebler, Heinrich AU - Kenter, Tobias AU - Plessl, Christian AU - Sorge, Christoph ID - 377 KW - coldboot T2 - Proceedings of Field-Programmable Custom Computing Machines (FCCM) TI - Reconstructing AES Key Schedules from Decayed Memory with FPGAs ER - TY - JOUR AB - Self-aware computing is a paradigm for structuring and simplifying the design and operation of computing systems that face unprecedented levels of system dynamics and thus require novel forms of adaptivity. The generality of the paradigm makes it applicable to many types of computing systems and, previously, researchers started to introduce concepts of self-awareness to multicore architectures. In our work we build on a recent reference architectural framework as a model for self-aware computing and instantiate it for an FPGA-based heterogeneous multicore running the ReconOS reconfigurable architecture and operating system. After presenting the model for self-aware computing and ReconOS, we demonstrate with a case study how a multicore application built on the principle of self-awareness, autonomously adapts to changes in the workload and system state. Our work shows that the reference architectural framework as a model for self-aware computing can be practically applied and allows us to structure and simplify the design process, which is essential for designing complex future computing systems. AU - Agne, Andreas AU - Happe, Markus AU - Lösch, Achim AU - Plessl, Christian AU - Platzner, Marco ID - 365 IS - 2 JF - ACM Transactions on Reconfigurable Technology and Systems (TRETS) TI - Self-awareness as a Model for Designing and Operating Heterogeneous Multicores VL - 7 ER -