TY  - CHAP
AU  - Alt, Christoph
AU  - Kenter, Tobias
AU  - Faghih-Naini, Sara
AU  - Faj, Jennifer
AU  - Opdenhövel, Jan-Oliver
AU  - Plessl, Christian
AU  - Aizinger, Vadym
AU  - Hönig, Jan
AU  - Köstler, Harald
ID  - 46191
SN  - 0302-9743
T2  - Lecture Notes in Computer Science
TI  - Shallow Water DG Simulations on FPGAs: Design and Comparison of a Novel Code Generation Pipeline
ER  - 
TY  - GEN
AB  - This preprint makes the claim of having computed the $9^{th}$ Dedekind
Number. This was done by building an efficient FPGA Accelerator for the core
operation of the process, and parallelizing it on the Noctua 2 Supercluster at
Paderborn University. The resulting value is
286386577668298411128469151667598498812366. This value can be verified in two
steps. We have made the data file containing the 490M results available, each
of which can be verified separately on CPU, and the whole file sums to our
proposed value.
AU  - Van Hirtum, Lennart
AU  - De Causmaecker, Patrick
AU  - Goemaere, Jens
AU  - Kenter, Tobias
AU  - Riebler, Heinrich
AU  - Lass, Michael
AU  - Plessl, Christian
ID  - 43439
T2  - arXiv:2304.03039
TI  - A computation of D(9) using FPGA Supercomputing
ER  - 
TY  - CONF
AU  - Faj, Jennifer
AU  - Kenter, Tobias
AU  - Faghih-Naini, Sara
AU  - Plessl, Christian
AU  - Aizinger, Vadym
ID  - 46188
T2  - Proceedings of the Platform for Advanced Scientific Computing Conference (PASC)
TI  - Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured Meshes
ER  - 
TY  - CONF
AU  - Prouveur, Charles
AU  - Haefele, Matthieu
AU  - Kenter, Tobias
AU  - Voss, Nils
ID  - 46189
T2  - Proceedings of the Platform for Advanced Scientific Computing Conference (PASC)
TI  - FPGA Acceleration for HPC Supercapacitor Simulations
ER  - 
TY  - CHAP
AU  - Hansmeier, Tim
AU  - Kenter, Tobias
AU  - Meyer, Marius
AU  - Riebler, Heinrich
AU  - Platzner, Marco
AU  - Plessl, Christian
ED  - Haake, Claus-Jochen
ED  - Meyer auf der Heide, Friedhelm
ED  - Platzner, Marco
ED  - Wachsmuth, Henning
ED  - Wehrheim, Heike
ID  - 45893
T2  - On-The-Fly Computing -- Individualized IT-services in dynamic markets
TI  - Compute Centers I: Heterogeneous Execution Environments
VL  - 412
ER  - 
TY  - CONF
AU  - Opdenhövel, Jan-Oliver
AU  - Plessl, Christian
AU  - Kenter, Tobias
ID  - 46190
T2  - Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART)
TI  - Mutation Tree Reconstruction of Tumor Cells on FPGAs Using a Bit-Level Matrix Representation
ER  - 
TY  - JOUR
AB  - <jats:p>While FPGA accelerator boards and their respective high-level design tools are maturing, there is still a lack of multi-FPGA applications, libraries, and not least, benchmarks and reference implementations towards sustained HPC usage of these devices. As in the early days of GPUs in HPC, for workloads that can reasonably be decoupled into loosely coupled working sets, multi-accelerator support can be achieved by using standard communication interfaces like MPI on the host side. However, for performance and productivity, some applications can profit from a tighter coupling of the accelerators. FPGAs offer unique opportunities here when extending the dataflow characteristics to their communication interfaces.</jats:p>
          <jats:p>In this work, we extend the HPCC FPGA benchmark suite by multi-FPGA support and three missing benchmarks that particularly characterize or stress inter-device communication: b_eff, PTRANS, and LINPACK. With all benchmarks implemented for current boards with Intel and Xilinx FPGAs, we established a baseline for multi-FPGA performance. Additionally, for the communication-centric benchmarks, we explored the potential of direct FPGA-to-FPGA communication with a circuit-switched inter-FPGA network that is currently only available for one of the boards. The evaluation with parallel execution on up to 26 FPGA boards makes use of one of the largest academic FPGA installations.</jats:p>
AU  - Meyer, Marius
AU  - Kenter, Tobias
AU  - Plessl, Christian
ID  - 38041
JF  - ACM Transactions on Reconfigurable Technology and Systems
KW  - General Computer Science
SN  - 1936-7406
TI  - Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks
ER  - 
TY  - CONF
AB  - The computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic simulations. In practical simulations, several trillions of ERIs may have to be
computed for every time step.
In this work, we investigate FPGAs as accelerators for the ERI computation. We use template parameters, here within the Intel oneAPI tool flow, to create customized designs for 256 different ERI quartet classes, based on their orbitals. To maximize data reuse, all intermediates are buffered in FPGA on-chip memory with customized layout. The pre-calculation of intermediates also helps to overcome data dependencies caused by multi-dimensional recurrence
relations. The involved loop structures are partially or even fully unrolled for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our
best knowledge, this is the first work on ERI computation on FPGAs that supports more than just the single most basic quartet class. Also, the integration of ERI computation and compression it a novelty that is not even covered by CPU or GPU libraries so far.
Our evaluation shows that using 16-bit integer for the ERI compression, the fastest FPGA kernels exceed the performance of 10 GERIS ($10 \times 10^9$ ERIs per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors around $10^{-7}$ - $10^{-5}$ Hartree. The measured throughput can be accurately explained by a performance model. The FPGA kernels deployed on 2 FPGAs outperform similar computations using the widely used libint reference on a two-socket server with 40 Xeon Gold 6148 CPU cores of the same process technology by factors up to 6.0x and on a new two-socket server with 128 EPYC 7713 CPU cores by up to 1.9x.
AU  - Wu, Xin
AU  - Kenter, Tobias
AU  - Schade, Robert
AU  - Kühne, Thomas
AU  - Plessl, Christian
ID  - 43228
T2  - 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
TI  - Computing and Compressing Electron Repulsion Integrals on FPGAs
ER  - 
TY  - JOUR
AB  - <jats:p> The non-orthogonal local submatrix method applied to electronic structure–based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32-mixed floating-point arithmetic when using 4400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms. </jats:p>
AU  - Schade, Robert
AU  - Kenter, Tobias
AU  - Elgabarty, Hossam
AU  - Lass, Michael
AU  - Kühne, Thomas
AU  - Plessl, Christian
ID  - 45361
JF  - The International Journal of High Performance Computing Applications
KW  - Hardware and Architecture
KW  - Theoretical Computer Science
KW  - Software
SN  - 1094-3420
TI  - Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics
ER  - 
TY  - THES
AU  - Lass, Michael
ID  - 32414
TI  - Bringing Massive Parallelism and Hardware Acceleration to Linear Scaling Density Functional Theory Through Targeted Approximations
ER  -