TY - JOUR AB - The rise of exascale supercomputers has fueled competition among GPU vendors, driving lattice QCD developers to write code that supports multiple APIs. Moreover, new developments in algorithms and physics research require frequent updates to existing software. These challenges have to be balanced against constantly changing personnel. At the same time, there is a wide range of applications for HISQ fermions in QCD studies. This situation encourages the development of software featuring a HISQ action that is flexible, high-performing, open source, easy to use, and easy to adapt. In this technical paper, we explain the design strategy, provide implementation details, list available algorithms and modules, and show key performance indicators for SIMULATeQCD, a simple multi-GPU lattice code for large-scale QCD calculations, mainly developed and used by the HotQCD collaboration. The code is publicly available on GitHub. AU - Mazur, Lukas AU - Bollweg, Dennis AU - Clarke, David A. AU - Altenkort, Luis AU - Kaczmarek, Olaf AU - Larsen, Rasmus AU - Shu, Hai-Tao AU - Goswami, Jishnu AU - Scior, Philipp AU - Sandmeyer, Hauke AU - Neumann, Marius AU - Dick, Henrik AU - Ali, Sajid AU - Kim, Jangho AU - Schmidt, Christian AU - Petreczky, Peter AU - Mukherjee, Swagato ID - 46120 JF - Computer Physics Communications TI - SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations ER - TY - JOUR AU - Altenkort, Luis AU - Eller, Alexander M. AU - Francis, Anthony AU - Kaczmarek, Olaf AU - Mazur, Lukas AU - Moore, Guy D. AU - Shu, Hai-Tao ID - 46119 IS - 1 JF - Physical Review D SN - 2470-0010 TI - Viscosity of pure-glue QCD from the lattice VL - 108 ER - TY - JOUR AB - While FPGA accelerator boards and their respective high-level design tools are maturing, there is still a lack of multi-FPGA applications, libraries, and not least, benchmarks and reference implementations towards sustained HPC usage of these devices. As in the early days of GPUs in HPC, for workloads that can reasonably be decoupled into loosely coupled working sets, multi-accelerator support can be achieved by using standard communication interfaces like MPI on the host side. However, for performance and productivity, some applications can profit from a tighter coupling of the accelerators. FPGAs offer unique opportunities here when extending the dataflow characteristics to their communication interfaces. In this work, we extend the HPCC FPGA benchmark suite by multi-FPGA support and three missing benchmarks that particularly characterize or stress inter-device communication: b_eff, PTRANS, and LINPACK. With all benchmarks implemented for current boards with Intel and Xilinx FPGAs, we established a baseline for multi-FPGA performance. Additionally, for the communication-centric benchmarks, we explored the potential of direct FPGA-to-FPGA communication with a circuit-switched inter-FPGA network that is currently only available for one of the boards. The evaluation with parallel execution on up to 26 FPGA boards makes use of one of the largest academic FPGA installations. AU - Meyer, Marius AU - Kenter, Tobias AU - Plessl, Christian ID - 38041 JF - ACM Transactions on Reconfigurable Technology and Systems KW - General Computer Science SN - 1936-7406 TI - Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks ER - TY - CHAP AU - Hansmeier, Tim AU - Kenter, Tobias AU - Meyer, Marius AU - Riebler, Heinrich AU - Platzner, Marco AU - Plessl, Christian ED - Haake, Claus-Jochen ED - Meyer auf der Heide, Friedhelm ED - Platzner, Marco ED - Wachsmuth, Henning ED - Wehrheim, Heike ID - 45893 T2 - On-The-Fly Computing -- Individualized IT-services in dynamic markets TI - Compute Centers I: Heterogeneous Execution Environments VL - 412 ER - TY - CONF AU - Opdenhövel, Jan-Oliver AU - Plessl, Christian AU - Kenter, Tobias ID - 46190 T2 - Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies TI - Mutation Tree Reconstruction of Tumor Cells on FPGAs Using a Bit-Level Matrix Representation ER - TY - CONF AU - Faj, Jennifer AU - Kenter, Tobias AU - Faghih-Naini, Sara AU - Plessl, Christian AU - Aizinger, Vadym ID - 46188 T2 - Proceedings of the Platform for Advanced Scientific Computing Conference TI - Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured Meshes ER - TY - CONF AU - Prouveur, Charles AU - Haefele, Matthieu AU - Kenter, Tobias AU - Voss, Nils ID - 46189 T2 - Proceedings of the Platform for Advanced Scientific Computing Conference TI - FPGA Acceleration for HPC Supercapacitor Simulations ER - TY - CONF AB - The computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic simulations. In practical simulations, several trillions of ERIs may have to be computed for every time step. In this work, we investigate FPGAs as accelerators for the ERI computation. We use template parameters, here within the Intel oneAPI tool flow, to create customized designs for 256 different ERI quartet classes, based on their orbitals. To maximize data reuse, all intermediates are buffered in FPGA on-chip memory with customized layout. The pre-calculation of intermediates also helps to overcome data dependencies caused by multi-dimensional recurrence relations. The involved loop structures are partially or even fully unrolled for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our best knowledge, this is the first work on ERI computation on FPGAs that supports more than just the single most basic quartet class. Also, the integration of ERI computation and compression it a novelty that is not even covered by CPU or GPU libraries so far. Our evaluation shows that using 16-bit integer for the ERI compression, the fastest FPGA kernels exceed the performance of 10 GERIS ($10 \times 10^9$ ERIs per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors around $10^{-7}$ - $10^{-5}$ Hartree. The measured throughput can be accurately explained by a performance model. The FPGA kernels deployed on 2 FPGAs outperform similar computations using the widely used libint reference on a two-socket server with 40 Xeon Gold 6148 CPU cores of the same process technology by factors up to 6.0x and on a new two-socket server with 128 EPYC 7713 CPU cores by up to 1.9x. AU - Wu, Xin AU - Kenter, Tobias AU - Schade, Robert AU - Kühne, Thomas AU - Plessl, Christian ID - 43228 T2 - 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) TI - Computing and Compressing Electron Repulsion Integrals on FPGAs ER - TY - JOUR AB - The non-orthogonal local submatrix method applied to electronic structure–based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32-mixed floating-point arithmetic when using 4400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms. AU - Schade, Robert AU - Kenter, Tobias AU - Elgabarty, Hossam AU - Lass, Michael AU - Kühne, Thomas AU - Plessl, Christian ID - 45361 JF - The International Journal of High Performance Computing Applications KW - Hardware and Architecture KW - Theoretical Computer Science KW - Software SN - 1094-3420 TI - Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics ER - TY - GEN AB - Viscous hydrodynamics serves as a successful mesoscopic description of the Quark-Gluon Plasma produced in relativistic heavy-ion collisions. In order to investigate, how such an effective description emerges from the underlying microscopic dynamics we calculate the hydrodynamic and non-hydrodynamic modes of linear response in the sound channel from a first-principle calculation in kinetic theory. We do this with a new approach wherein we discretize the collision kernel to directly calculate eigenvalues and eigenmodes of the evolution operator. This allows us to study the Green's functions at any point in the complex frequency space. Our study focuses on scalar theory with quartic interaction and we find that the analytic structure of Green's functions in the complex plane is far more complicated than just poles or cuts which is a first step towards an equivalent study in QCD kinetic theory. AU - Ochsenfeld, Stephan AU - Schlichting, Sören ID - 50172 T2 - arXiv:2308.04491 TI - Hydrodynamic and Non-hydrodynamic Excitations in Kinetic Theory -- A Numerical Analysis in Scalar Field Theory ER - TY - GEN AB - Memory Gym presents a suite of 2D partially observable environments, namely Mortar Mayhem, Mystery Path, and Searing Spotlights, designed to benchmark memory capabilities in decision-making agents. These environments, originally with finite tasks, are expanded into innovative, endless formats, mirroring the escalating challenges of cumulative memory games such as ``I packed my bag''. This progression in task design shifts the focus from merely assessing sample efficiency to also probing the levels of memory effectiveness in dynamic, prolonged scenarios. To address the gap in available memory-based Deep Reinforcement Learning baselines, we introduce an implementation that integrates Transformer-XL (TrXL) with Proximal Policy Optimization. This approach utilizes TrXL as a form of episodic memory, employing a sliding window technique. Our comparative study between the Gated Recurrent Unit (GRU) and TrXL reveals varied performances across different settings. TrXL, on the finite environments, demonstrates superior sample efficiency in Mystery Path and outperforms in Mortar Mayhem. However, GRU is more efficient on Searing Spotlights. Most notably, in all endless tasks, GRU makes a remarkable resurgence, consistently outperforming TrXL by significant margins. Website and Source Code: https://github.com/MarcoMeter/endless-memory-gym/ AU - Pleines, Marco AU - Pallasch, Matthias AU - Zimmer, Frank AU - Preuss, Mike ID - 50221 T2 - arXiv:2309.17207 TI - Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents ER - TY - CHAP AU - Alt, Christoph AU - Kenter, Tobias AU - Faghih-Naini, Sara AU - Faj, Jennifer AU - Opdenhövel, Jan-Oliver AU - Plessl, Christian AU - Aizinger, Vadym AU - Hönig, Jan AU - Köstler, Harald ID - 46191 SN - 0302-9743 T2 - Lecture Notes in Computer Science TI - Shallow Water DG Simulations on FPGAs: Design and Comparison of a Novel Code Generation Pipeline ER - TY - GEN AB - This preprint makes the claim of having computed the $9^{th}$ Dedekind Number. This was done by building an efficient FPGA Accelerator for the core operation of the process, and parallelizing it on the Noctua 2 Supercluster at Paderborn University. The resulting value is 286386577668298411128469151667598498812366. This value can be verified in two steps. We have made the data file containing the 490M results available, each of which can be verified separately on CPU, and the whole file sums to our proposed value. AU - Van Hirtum, Lennart AU - De Causmaecker, Patrick AU - Goemaere, Jens AU - Kenter, Tobias AU - Riebler, Heinrich AU - Lass, Michael AU - Plessl, Christian ID - 43439 T2 - arXiv:2304.03039 TI - A computation of D(9) using FPGA Supercomputing ER - TY - GEN AB - We investigate the early time development of the anisotropic transverse flow and spatial eccentricities of a fireball with various particle-based transport approaches using a fixed initial condition. In numerical simulations ranging from the quasi-collisionless case to the hydrodynamic regime, we find that the onset of $v_n$ and of related measures of anisotropic flow can be described with a simple power-law ansatz, with an exponent that depends on the amount of rescatterings in the system. In the few-rescatterings regime we perform semi-analytical calculations, based on a systematic expansion in powers of time and the cross section, which can reproduce the numerical findings. AU - Borghini, Nicolas AU - Borrell, Marc AU - Roch, Hendrik ID - 32177 T2 - arXiv:2201.13294 TI - Early time behavior of spatial and momentum anisotropies in kinetic theory across different Knudsen numbers ER - TY - GEN AB - We test the ability of the "escape mechanism" to create the anisotropic flow observed in high-energy nuclear collisions. We compare the flow harmonics $v_n$ in the few-rescatterings regime from two types of transport simulations, with $2\to 2$ and $2\to 0$ collision kernels respectively, and from analytical calculations neglecting the gain term of the Boltzmann equation. We find that the even flow harmonics are similar in the three approaches, while the odd harmonics differ significantly. AU - Bachmann, Benedikt AU - Borghini, Nicolas AU - Feld, Nina AU - Roch, Hendrik ID - 32178 T2 - arXiv:2203.13306 TI - Even anisotropic-flow harmonics are from Venus, odd ones are from Mars ER - TY - JOUR AU - Hou, W AU - Yao, Y AU - Li, Y AU - Peng, B AU - Shi, K AU - Zhou, Z AU - Pan, J AU - Liu, M AU - Hu, J ID - 32183 IS - 1 JF - Frontiers of materials science SN - 2095-025x TI - Linearly shifting ferromagnetic resonance response of La0.7Sr0.3MnO3 thin film for body temperature sensors VL - 16 ER - TY - JOUR AU - Wojciechowski, M ID - 32234 JF - Data Brief SN - 2352-3409 TI - Dataset for random uniform distributions of 2D circles and 3D spheres. VL - 43 ER - TY - THES AU - Lass, Michael ID - 32414 TI - Bringing Massive Parallelism and Hardware Acceleration to Linear Scaling Density Functional Theory Through Targeted Approximations ER - TY - GEN AB - The Julia programming language has evolved into a modern alternative to fill existing gaps in scientific computing and data science applications. Julia leverages a unified and coordinated single-language and ecosystem paradigm and has a proven track record of achieving high performance without sacrificing user productivity. These aspects make Julia a viable alternative to high-performance computing's (HPC's) existing and increasingly costly many-body workflow composition strategy in which traditional HPC languages (e.g., Fortran, C, C++) are used for simulations, and higher-level languages (e.g., Python, R, MATLAB) are used for data analysis and interactive computing. Julia's rapid growth in language capabilities, package ecosystem, and community make it a promising universal language for HPC. This paper presents the views of a multidisciplinary group of researchers from academia, government, and industry that advocate for an HPC software development paradigm that emphasizes developer productivity, workflow portability, and low barriers for entry. We believe that the Julia programming language, its ecosystem, and its community provide modern and powerful capabilities that enable this group's objectives. Crucially, we believe that Julia can provide a feasible and less costly approach to programming scientific applications and workflows that target HPC facilities. In this work, we examine the current practice and role of Julia as a common, end-to-end programming model to address major challenges in scientific reproducibility, data-driven AI/machine learning, co-design and workflows, scalability and performance portability in heterogeneous computing, network communication, data management, and community education. As a result, the diversification of current investments to fulfill the needs of the upcoming decade is crucial as more supercomputing centers prepare for the exascale era. AU - Churavy, Valentin AU - Godoy, William F AU - Bauer, Carsten AU - Ranocha, Hendrik AU - Schlottke-Lakemper, Michael AU - Räss, Ludovic AU - Blaschke, Johannes AU - Giordano, Mosè AU - Schnetter, Erik AU - Omlin, Samuel AU - Vetter, Jeffrey S AU - Edelman, Alan ID - 36879 TI - Bridging HPC Communities through the Julia Programming Language ER - TY - JOUR AB - AbstractTailored nanoscale quantum light sources, matching the specific needs of use cases, are crucial building blocks for photonic quantum technologies. Several different approaches to realize solid-state quantum emitters with high performance have been pursued and different concepts for energy tuning have been established. However, the properties of the emitted photons are always defined by the individual quantum emitter and can therefore not be controlled with full flexibility. Here we introduce an all-optical nonlinear method to tailor and control the single photon emission. We demonstrate a laser-controlled down-conversion process from an excited state of a semiconductor quantum three-level system. Based on this concept, we realize energy tuning and polarization control of the single photon emission with a control-laser field. Our results mark an important step towards tailored single photon emission from a photonic quantum system based on quantum optical principles. AU - Jonas, B. AU - Heinze, Dirk Florian AU - Schöll, E. AU - Kallert, P. AU - Langer, T. AU - Krehs, S. AU - Widhalm, A. AU - Jöns, Klaus AU - Reuter, Dirk AU - Schumacher, Stefan AU - Zrenner, Artur ID - 40523 IS - 1 JF - Nature Communications KW - General Physics and Astronomy KW - General Biochemistry KW - Genetics and Molecular Biology KW - General Chemistry KW - Multidisciplinary SN - 2041-1723 TI - Nonlinear down-conversion in a single quantum dot VL - 13 ER -