TY  - GEN
AB  - Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.
AU  - Meyer, Marius
AU  - Kenter, Tobias
AU  - Petrica, Lucian
AU  - O'Brien, Kenneth
AU  - Blott, Michaela
AU  - Plessl, Christian
ID  - 53364
T2  - arXiv:2403.18374
TI  - Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
ER  - 
TY  - JOUR
AB  - We present a novel approach to characterize and quantify microheterogeneity and microphase separation in computer simulations of complex liquid mixtures. Our post-processing method is based on local density fluctuations of the different constituents in sampling spheres of varying size. It can be easily applied to both molecular dynamics (MD) and Monte Carlo (MC) simulations, including periodic boundary conditions. Multidimensional correlation of the density distributions yields a clear picture of the domain formation due to the subtle balance of different interactions. We apply our approach to the example of force field molecular dynamics simulations of imidazolium-based ionic liquids with different side chain lengths at different temperatures, namely 1-ethyl-3-methylimidazolium chloride, 1-hexyl-3-methylimidazolium chloride, and 1-decyl-3-methylimidazolium chloride, which are known to form distinct liquid domains. We put the results into the context of existing microheterogeneity analyses and demonstrate the advantages and sensitivity of our novel method. Furthermore, we show how to estimate the configuration entropy from our analysis, and we investigate voids in the system. The analysis has been implemented into our program package TRAVIS and is thus available as free software.
AU  - Lass, Michael
AU  - Kenter, Tobias
AU  - Plessl, Christian
AU  - Brehm, Martin
ID  - 53474
IS  - 4
JF  - Entropy
SN  - 1099-4300
TI  - Characterizing Microheterogeneity in Liquid Mixtures via Local Density Fluctuations
VL  - 26
ER  - 
TY  - CONF
AU  - Olgu, Kaan
AU  - Kenter, Tobias
AU  - Nunez-Yanez, Jose
AU  - Mcintosh-Smith, Simon
ID  - 53503
T2  - Proceedings of the 12th International Workshop on OpenCL and SYCL
TI  - Optimisation and Evaluation of Breadth First Search with oneAPI/SYCL on Intel FPGAs: from Describing Algorithms to Describing Architectures
ER  - 
TY  - GEN
AB  - At large scales, quantum systems may become advantageous over their classical
counterparts at performing certain tasks. Developing tools to analyse these
systems at the relevant scales, in a manner consistent with quantum mechanics,
is therefore critical to benchmarking performance and characterising their
operation. While classical computational approaches cannot perform
like-for-like computations of quantum systems beyond a certain scale, classical
high-performance computing (HPC) may nevertheless be useful for precisely these
characterisation and certification tasks. By developing open-source customised
algorithms using high-performance computing, we perform quantum tomography on a
megascale quantum photonic detector covering a Hilbert space of $10^6$. This
requires finding $10^8$ elements of the matrix corresponding to the positive
operator valued measure (POVM), the quantum description of the detector, and is
achieved in minutes of computation time. Moreover, by exploiting the structure
of the problem, we achieve highly efficient parallel scaling, paving the way
for quantum objects up to a system size of $10^{12}$ elements to be
reconstructed using this method. In general, this shows that a consistent
quantum mechanical description of quantum phenomena is applicable at everyday
scales. More concretely, this enables the reconstruction of large-scale quantum
sources, processes and detectors used in computation and sampling tasks, which
may be necessary to prove their nonclassical character or quantum computational
advantage.
AU  - Schapeler, Timon
AU  - Schade, Robert
AU  - Lass, Michael
AU  - Plessl, Christian
AU  - Bartley, Tim
ID  - 53202
T2  - arXiv:2404.02844
TI  - Scalable quantum detector tomography by high-performance computing
ER  - 
TY  - JOUR
AB  - The rise of exascale supercomputers has fueled competition among GPU vendors, driving lattice QCD developers to write code that supports multiple APIs. Moreover, new developments in algorithms and physics research require frequent updates to existing software. These challenges have to be balanced against constantly changing personnel. At the same time, there is a wide range of applications for HISQ fermions in QCD studies. This situation encourages the development of software featuring a HISQ action that is flexible, high-performing, open source, easy to use, and easy to adapt. In this technical paper, we explain the design strategy, provide implementation details, list available algorithms and modules, and show key performance indicators for SIMULATeQCD, a simple multi-GPU lattice code for large-scale QCD calculations, mainly developed and used by the HotQCD collaboration. The code is publicly available on GitHub.
AU  - Mazur, Lukas
AU  - Bollweg, Dennis
AU  - Clarke, David A.
AU  - Altenkort, Luis
AU  - Kaczmarek, Olaf
AU  - Larsen, Rasmus
AU  - Shu, Hai-Tao
AU  - Goswami, Jishnu
AU  - Scior, Philipp
AU  - Sandmeyer, Hauke
AU  - Neumann, Marius
AU  - Dick, Henrik
AU  - Ali, Sajid
AU  - Kim, Jangho
AU  - Schmidt, Christian
AU  - Petreczky, Peter
AU  - Mukherjee, Swagato
ID  - 46120
JF  - Computer Physics Communications
TI  - SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations
ER  - 
TY  - JOUR
AU  - Altenkort, Luis
AU  - Eller, Alexander M.
AU  - Francis, Anthony
AU  - Kaczmarek, Olaf
AU  - Mazur, Lukas
AU  - Moore, Guy D.
AU  - Shu, Hai-Tao
ID  - 46119
IS  - 1
JF  - Physical Review D
SN  - 2470-0010
TI  - Viscosity of pure-glue QCD from the lattice
VL  - 108
ER  - 
TY  - JOUR
AB  - <jats:p>While FPGA accelerator boards and their respective high-level design tools are maturing, there is still a lack of multi-FPGA applications, libraries, and not least, benchmarks and reference implementations towards sustained HPC usage of these devices. As in the early days of GPUs in HPC, for workloads that can reasonably be decoupled into loosely coupled working sets, multi-accelerator support can be achieved by using standard communication interfaces like MPI on the host side. However, for performance and productivity, some applications can profit from a tighter coupling of the accelerators. FPGAs offer unique opportunities here when extending the dataflow characteristics to their communication interfaces.</jats:p>
          <jats:p>In this work, we extend the HPCC FPGA benchmark suite by multi-FPGA support and three missing benchmarks that particularly characterize or stress inter-device communication: b_eff, PTRANS, and LINPACK. With all benchmarks implemented for current boards with Intel and Xilinx FPGAs, we established a baseline for multi-FPGA performance. Additionally, for the communication-centric benchmarks, we explored the potential of direct FPGA-to-FPGA communication with a circuit-switched inter-FPGA network that is currently only available for one of the boards. The evaluation with parallel execution on up to 26 FPGA boards makes use of one of the largest academic FPGA installations.</jats:p>
AU  - Meyer, Marius
AU  - Kenter, Tobias
AU  - Plessl, Christian
ID  - 38041
JF  - ACM Transactions on Reconfigurable Technology and Systems
KW  - General Computer Science
SN  - 1936-7406
TI  - Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks
ER  - 
TY  - CHAP
AU  - Hansmeier, Tim
AU  - Kenter, Tobias
AU  - Meyer, Marius
AU  - Riebler, Heinrich
AU  - Platzner, Marco
AU  - Plessl, Christian
ED  - Haake, Claus-Jochen
ED  - Meyer auf der Heide, Friedhelm
ED  - Platzner, Marco
ED  - Wachsmuth, Henning
ED  - Wehrheim, Heike
ID  - 45893
T2  - On-The-Fly Computing -- Individualized IT-services in dynamic markets
TI  - Compute Centers I: Heterogeneous Execution Environments
VL  - 412
ER  - 
TY  - CONF
AU  - Opdenhövel, Jan-Oliver
AU  - Plessl, Christian
AU  - Kenter, Tobias
ID  - 46190
T2  - Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies
TI  - Mutation Tree Reconstruction of Tumor Cells on FPGAs Using a Bit-Level Matrix Representation
ER  - 
TY  - CONF
AB  - The computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic simulations. In practical simulations, several trillions of ERIs may have to be
computed for every time step.
In this work, we investigate FPGAs as accelerators for the ERI computation. We use template parameters, here within the Intel oneAPI tool flow, to create customized designs for 256 different ERI quartet classes, based on their orbitals. To maximize data reuse, all intermediates are buffered in FPGA on-chip memory with customized layout. The pre-calculation of intermediates also helps to overcome data dependencies caused by multi-dimensional recurrence
relations. The involved loop structures are partially or even fully unrolled for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our
best knowledge, this is the first work on ERI computation on FPGAs that supports more than just the single most basic quartet class. Also, the integration of ERI computation and compression it a novelty that is not even covered by CPU or GPU libraries so far.
Our evaluation shows that using 16-bit integer for the ERI compression, the fastest FPGA kernels exceed the performance of 10 GERIS ($10 \times 10^9$ ERIs per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors around $10^{-7}$ - $10^{-5}$ Hartree. The measured throughput can be accurately explained by a performance model. The FPGA kernels deployed on 2 FPGAs outperform similar computations using the widely used libint reference on a two-socket server with 40 Xeon Gold 6148 CPU cores of the same process technology by factors up to 6.0x and on a new two-socket server with 128 EPYC 7713 CPU cores by up to 1.9x.
AU  - Wu, Xin
AU  - Kenter, Tobias
AU  - Schade, Robert
AU  - Kühne, Thomas
AU  - Plessl, Christian
ID  - 43228
T2  - 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
TI  - Computing and Compressing Electron Repulsion Integrals on FPGAs
ER  - 
TY  - JOUR
AB  - <jats:p> The non-orthogonal local submatrix method applied to electronic structure–based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32-mixed floating-point arithmetic when using 4400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms. </jats:p>
AU  - Schade, Robert
AU  - Kenter, Tobias
AU  - Elgabarty, Hossam
AU  - Lass, Michael
AU  - Kühne, Thomas
AU  - Plessl, Christian
ID  - 45361
JF  - The International Journal of High Performance Computing Applications
KW  - Hardware and Architecture
KW  - Theoretical Computer Science
KW  - Software
SN  - 1094-3420
TI  - Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics
ER  - 
TY  - GEN
AB  - Viscous hydrodynamics serves as a successful mesoscopic description of the
Quark-Gluon Plasma produced in relativistic heavy-ion collisions. In order to
investigate, how such an effective description emerges from the underlying
microscopic dynamics we calculate the hydrodynamic and non-hydrodynamic modes
of linear response in the sound channel from a first-principle calculation in
kinetic theory. We do this with a new approach wherein we discretize the
collision kernel to directly calculate eigenvalues and eigenmodes of the
evolution operator. This allows us to study the Green's functions at any point
in the complex frequency space. Our study focuses on scalar theory with quartic
interaction and we find that the analytic structure of Green's functions in the
complex plane is far more complicated than just poles or cuts which is a first
step towards an equivalent study in QCD kinetic theory.
AU  - Ochsenfeld, Stephan
AU  - Schlichting, Sören
ID  - 50172
T2  - arXiv:2308.04491
TI  - Hydrodynamic and Non-hydrodynamic Excitations in Kinetic Theory -- A  Numerical Analysis in Scalar Field Theory
ER  - 
TY  - GEN
AB  - Memory Gym presents a suite of 2D partially observable environments, namely
Mortar Mayhem, Mystery Path, and Searing Spotlights, designed to benchmark
memory capabilities in decision-making agents. These environments, originally
with finite tasks, are expanded into innovative, endless formats, mirroring the
escalating challenges of cumulative memory games such as ``I packed my bag''.
This progression in task design shifts the focus from merely assessing sample
efficiency to also probing the levels of memory effectiveness in dynamic,
prolonged scenarios. To address the gap in available memory-based Deep
Reinforcement Learning baselines, we introduce an implementation that
integrates Transformer-XL (TrXL) with Proximal Policy Optimization. This
approach utilizes TrXL as a form of episodic memory, employing a sliding window
technique. Our comparative study between the Gated Recurrent Unit (GRU) and
TrXL reveals varied performances across different settings. TrXL, on the finite
environments, demonstrates superior sample efficiency in Mystery Path and
outperforms in Mortar Mayhem. However, GRU is more efficient on Searing
Spotlights. Most notably, in all endless tasks, GRU makes a remarkable
resurgence, consistently outperforming TrXL by significant margins. Website and
Source Code: https://github.com/MarcoMeter/endless-memory-gym/
AU  - Pleines, Marco
AU  - Pallasch, Matthias
AU  - Zimmer, Frank
AU  - Preuss, Mike
ID  - 50221
T2  - arXiv:2309.17207
TI  - Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of  Agents
ER  - 
TY  - CHAP
AU  - Alt, Christoph
AU  - Kenter, Tobias
AU  - Faghih-Naini, Sara
AU  - Faj, Jennifer
AU  - Opdenhövel, Jan-Oliver
AU  - Plessl, Christian
AU  - Aizinger, Vadym
AU  - Hönig, Jan
AU  - Köstler, Harald
ID  - 46191
SN  - 0302-9743
T2  - Lecture Notes in Computer Science
TI  - Shallow Water DG Simulations on FPGAs: Design and Comparison of a Novel Code Generation Pipeline
ER  - 
TY  - GEN
AB  - This preprint makes the claim of having computed the $9^{th}$ Dedekind
Number. This was done by building an efficient FPGA Accelerator for the core
operation of the process, and parallelizing it on the Noctua 2 Supercluster at
Paderborn University. The resulting value is
286386577668298411128469151667598498812366. This value can be verified in two
steps. We have made the data file containing the 490M results available, each
of which can be verified separately on CPU, and the whole file sums to our
proposed value.
AU  - Van Hirtum, Lennart
AU  - De Causmaecker, Patrick
AU  - Goemaere, Jens
AU  - Kenter, Tobias
AU  - Riebler, Heinrich
AU  - Lass, Michael
AU  - Plessl, Christian
ID  - 43439
T2  - arXiv:2304.03039
TI  - A computation of D(9) using FPGA Supercomputing
ER  - 
TY  - CONF
AU  - Faj, Jennifer
AU  - Kenter, Tobias
AU  - Faghih-Naini, Sara
AU  - Plessl, Christian
AU  - Aizinger, Vadym
ID  - 46188
T2  - Proceedings of the Platform for Advanced Scientific Computing Conference (PASC)
TI  - Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured Meshes
ER  - 
TY  - CONF
AU  - Prouveur, Charles
AU  - Haefele, Matthieu
AU  - Kenter, Tobias
AU  - Voss, Nils
ID  - 46189
T2  - Proceedings of the Platform for Advanced Scientific Computing Conference (PASC)
TI  - FPGA Acceleration for HPC Supercapacitor Simulations
ER  - 
TY  - GEN
AB  - We investigate the early time development of the anisotropic transverse flow
and spatial eccentricities of a fireball with various particle-based transport
approaches using a fixed initial condition. In numerical simulations ranging
from the quasi-collisionless case to the hydrodynamic regime, we find that the
onset of $v_n$ and of related measures of anisotropic flow can be described
with a simple power-law ansatz, with an exponent that depends on the amount of
rescatterings in the system. In the few-rescatterings regime we perform
semi-analytical calculations, based on a systematic expansion in powers of time
and the cross section, which can reproduce the numerical findings.
AU  - Borghini, Nicolas
AU  - Borrell, Marc
AU  - Roch, Hendrik
ID  - 32177
T2  - arXiv:2201.13294
TI  - Early time behavior of spatial and momentum anisotropies in kinetic  theory across different Knudsen numbers
ER  - 
TY  - GEN
AB  - We test the ability of the "escape mechanism" to create the anisotropic flow
observed in high-energy nuclear collisions. We compare the flow harmonics $v_n$
in the few-rescatterings regime from two types of transport simulations, with
$2\to 2$ and $2\to 0$ collision kernels respectively, and from analytical
calculations neglecting the gain term of the Boltzmann equation. We find that
the even flow harmonics are similar in the three approaches, while the odd
harmonics differ significantly.
AU  - Bachmann, Benedikt
AU  - Borghini, Nicolas
AU  - Feld, Nina
AU  - Roch, Hendrik
ID  - 32178
T2  - arXiv:2203.13306
TI  - Even anisotropic-flow harmonics are from Venus, odd ones are from Mars
ER  - 
TY  - JOUR
AU  - Hou, W
AU  - Yao, Y
AU  - Li, Y
AU  - Peng, B
AU  - Shi, K
AU  - Zhou, Z
AU  - Pan, J
AU  - Liu, M
AU  - Hu, J
ID  - 32183
IS  - 1
JF  - Frontiers of materials science
SN  - 2095-025x
TI  - Linearly shifting ferromagnetic resonance response of La0.7Sr0.3MnO3 thin film for body temperature sensors
VL  - 16
ER  -