TY  - CONF
AB  - The computation of highly contracted electron repulsion integrals (ERIs) is essential to achieve quantum accuracy in atomistic simulations based on quantum mechanics. Its growing computational demands make energy efficiency a critical concern. Recent studies demonstrate FPGAs’ superior performance and energy efficiency for computing primitive ERIs, but the computation of highly contracted ERIs introduces significant algorithmic complexity and new design challenges for FPGA acceleration.In this work, we present SORCERI, the first streaming overlay acceleration for highly contracted ERI computations on FPGAs. SORCERI introduces a novel streaming Rys computing unit to calculate roots and weights of Rys polynomials on-chip, and a streaming contraction unit for the contraction of primitive ERIs. This shifts the design bottleneck from limited CPU-FPGA communication bandwidth to available FPGA computation resources. To address practical deployment challenges for a large number of quartet classes, we design three streaming overlays, together with an efficient memory transpose optimization, to cover the 21 most commonly used quartet classes in realistic atomistic simulations. To address the new computation constraints, we use flexible calculation stages with a free-running streaming architecture to achieve high DSP utilization and good timing closure.Experiments demonstrate that SORCERI achieves an average 5.96x, 1.99x, and 1.16x better performance per watt than libint on a 64-core AMD EPYC 7713 CPU, libintx on an Nvidia A40 GPU, and SERI, the prior best-performing FPGA design for primitive ERIs. Furthermore, SORCERI reaches a peak throughput of 44.11 GERIS (109 ERIs per second) that is 1.52x, 1.13x, and 1.93x greater than libint, libintx and SERI, respectively. SORCERI will be released soon at https://github.com/SFU-HiAccel/SORCERI.
AU  - Stachura, Philip
AU  - Wu, Xin
AU  - Plessl, Christian
AU  - Fang, Zhenman
ID  - 63890
KW  - electron repulsion integrals
KW  - quantum chemistry
KW  - atomistic simulation
KW  - overlay architecture
KW  - fpga acceleration
SN  - 9798400720796
T2  - Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '26)
TI  - SORCERI: Streaming Overlay Acceleration for Highly Contracted Electron Repulsion Integral Computations in Quantum Chemistry
ER  - 
TY  - CONF
AB  - Various methods to measure the dynamic behavior of particles require the calculation of autocorrelation functions. For this purpose, fast multi-tau correlators have been developed in dedicated hardware, in software, and on FPGAs. However, for methods such as X-ray Photon Correlation Spectroscopy (XPCS), which requires to calculate the autocorrelation function independently for hundreds of thousands to millions of pixels from high-resolution detectors, current approaches rely on offline processing after data acquisition. Moreover, the internal pipeline state of so many independent correlators is far too large to keep it on-chip. In this work, we propose a design approach on FPGAs, where pipeline contexts are stored in off-chip HBM memory. Each compute unit iteratively loads the state for a single pixel, processes a short time series for this pixel, and afterwards writes back the context in a dataflow pipeline. We have implemented the required compute kernels with Vitis HLS and analyze resulting designs on an Alveo U280 card. The design achieves the expected performance and for the first time provides sufficient throughput for current high-end detectors used in XPCS.
AU  - Tareen, Abdul Rehman
AU  - Plessl, Christian
AU  - Kenter, Tobias
ID  - 65101
T2  - 2025 International Conference on Field Programmable Technology (ICFPT)
TI  - Fast Multi-Tau Correlators on FPGA with Context Switching From and to High- Bandwidth Memory
ER  - 
TY  - CONF
AU  - Jungemann, Linus
AU  - Wintermann, Bjarne
AU  - Riebler, Heinrich
AU  - Plessl, Christian
ID  - 59804
T2  - Proceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies
TI  - FINN-HPC: Closing the Gap for Energy-Efficient Neural Network Inference on FPGAs in HPC
ER  - 
TY  - JOUR
AB  - SYCL is an open standard for targeting heterogeneous hardware from C++. In this work, we evaluate a SYCL implementation for a discontinuous Galerkin discretization of the 2D shallow water equations targeting CPUs, GPUs, and also FPGAs. The discretization uses polynomial orders zero to two on unstructured triangular meshes. Separating memory accesses from the numerical code allow us to optimize data accesses for the target architecture. A performance analysis shows good portability across x86 and ARM CPUs, GPUs from different vendors, and even two variants of Intel Stratix 10 FPGAs. Measuring the energy to solution shows that GPUs yield an up to 10x higher energy efficiency in terms of degrees of freedom per joule compared to CPUs. With custom designed caches, FPGAs offer a meaningful complement to the other architectures with particularly good computational performance on smaller meshes. FPGAs with High Bandwidth Memory are less affected by bandwidth issues and have similar energy efficiency as latest generation CPUs.
AU  - Büttner, Markus
AU  - Alt, Christoph
AU  - Kenter, Tobias
AU  - Köstler, Harald
AU  - Plessl, Christian
AU  - Aizinger, Vadym
ID  - 62064
IS  - 6
JF  - The Journal of Supercomputing
SN  - 1573-0484
TI  - Analyzing performance portability for a SYCL implementation of the 2D shallow water equations
VL  - 81
ER  - 
TY  - CONF
AB  - In the context of high-performance computing (HPC) for distributed workloads, individual field-programmable gate arrays (FPGAs) need efficient ways to exchange data, which requires network infrastructure and software abstractions. Dedicated multi-FPGA clusters provide inter-FPGA networks for direct device to device communication. The oneAPI high-level synthesis toolchain offers I/O pipes to allow user kernels to interact with the networking ports of the FPGA board. In this work, we evaluate using oneAPI I/O pipes for direct FPGA-to-FPGA communication by scaling a SYCL implementation of a Jacobi solver on up to 25 FPGAs in the Noctua 2 cluster. We see good results in weak and strong scaling experiments.
AU  - Alt, Christoph
AU  - Plessl, Christian
AU  - Kenter, Tobias
ID  - 62066
KW  - Multi-FPGA
KW  - High-level Synthesis
KW  - oneAPI
KW  - FPGA
SN  - 9798400713606
T2  - Proceedings of the 13th International Workshop on OpenCL and SYCL
TI  - Evaluating oneAPI I/O Pipes in a Case Study of Scaling a SYCL Jacobi Solver to multiple FPGAs
ER  - 
TY  - CONF
AU  - Sundriyal, Shivam
AU  - Büttner, Markus
AU  - Alt, Christoph
AU  - Kenter, Tobias
AU  - Aizinger, Vadym
ID  - 62065
T2  - 2025 IEEE High Performance Extreme Computing Conference (HPEC)
TI  - Adaptive Spectral Block Floating Point for Discontinuous Galerkin Methods
ER  - 
TY  - CONF
AB  - Efficient graph processing is essential for a wide range of applications. Scalability and memory access patterns are still a challenge, especially with the Breadth-First Search algorithm. This work focuses on leveraging HPC systems with multiple GPUs available in a single node with peer-to-peer functionality of the Intel oneAPI implementation of SYCL. We propose three GPU-based load-balancing methods: work-group localisation for efficient data access, even workload distribution for higher GPU occupancy, and a hybrid strided-access approach for heuristic balancing. These methods ensure performance, portability, and productivity with a unified codebase. Our proposed methodologies outperform state-of-the-art single-GPU implementations based on CUDA on synthetic RMAT graphs. We analysed BFS performance across NVIDIA A100, Intel Max 1550, and AMD MI300X GPUs, achieving a peak performance of 153.27 GTEPS on an RMAT25-64 graph using 8 GPUs on the NVIDIA A100. Furthermore, our work demonstrates the capability to handle RMAT graphs up to scale 29, achieving superior performance on synthetic graphs and competitive results on real-world datasets.
AU  - Olgu, Kaan
AU  - Kenter, Tobias
AU  - Nunez-Yanez, Jose
AU  - McIntosh-Smith, Simon
AU  - Deakin, Tom
ID  - 65102
T2  - Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
TI  - Towards Efficient Load Balancing BFS on GPUs: One Code for AMD, Intel & Nvidia
ER  - 
TY  - GEN
AB  - Otus is a high-performance computing cluster that was launched in 2025 and is operated by the Paderborn Center for Parallel Computing (PC2) at Paderborn University in Germany. The system is part of the National High Performance Computing (NHR) initiative. Otus complements the previous supercomputer Noctua 2, offering approximately twice the computing power while retaining the three node types that were characteristic of Noctua 2: 1) CPU compute nodes with different memory capacities, 2) high-end GPU nodes, and 3) HPC-grade FPGA nodes. On the Top500 list, which ranks the 500 most powerful supercomputers in the world, Otus is in position 164 with the CPU partition and in position 255 with the GPU partition (June 2025). On the Green500 list, ranking the 500 most energy-efficient supercomputers in the world, Otus is in position 5 with the GPU partition (June 2025).


This article provides a comprehensive overview of the system in terms of its hardware, software, system integration, and its overall integration into the data center building to ensure energy-efficient operation. The article aims to provide unique insights for scientists using the system and for other centers operating HPC clusters. The article will be continuously updated to reflect the latest system setup and measurements. 
AU  - Ehtesabi, Sadaf
AU  - Hossain, Manoar
AU  - Kenter, Tobias
AU  - Krawinkel, Andreas
AU  - Ostermann, Lukas
AU  - Plessl, Christian
AU  - Riebler, Heinrich
AU  - Rohde, Stefan
AU  - Schade, Robert
AU  - Schwarz, Michael
AU  - Simon, Jens
AU  - Winnwa, Nils
AU  - Wiens, Alex
AU  - Wu, Xin
ID  - 62981
KW  - Otus
KW  - Supercomputer
KW  - FPGA
KW  - PC2
KW  - Paderborn Center for Parallel Computing
KW  - Noctua 2
KW  - HPC
TI  - Otus Supercomputer
VL  - 1
ER  - 
TY  - JOUR
AB  - We present a novel approach to characterize and quantify microheterogeneity and microphase separation in computer simulations of complex liquid mixtures. Our post-processing method is based on local density fluctuations of the different constituents in sampling spheres of varying size. It can be easily applied to both molecular dynamics (MD) and Monte Carlo (MC) simulations, including periodic boundary conditions. Multidimensional correlation of the density distributions yields a clear picture of the domain formation due to the subtle balance of different interactions. We apply our approach to the example of force field molecular dynamics simulations of imidazolium-based ionic liquids with different side chain lengths at different temperatures, namely 1-ethyl-3-methylimidazolium chloride, 1-hexyl-3-methylimidazolium chloride, and 1-decyl-3-methylimidazolium chloride, which are known to form distinct liquid domains. We put the results into the context of existing microheterogeneity analyses and demonstrate the advantages and sensitivity of our novel method. Furthermore, we show how to estimate the configuration entropy from our analysis, and we investigate voids in the system. The analysis has been implemented into our program package TRAVIS and is thus available as free software.
AU  - Lass, Michael
AU  - Kenter, Tobias
AU  - Plessl, Christian
AU  - Brehm, Martin
ID  - 53474
IS  - 4
JF  - Entropy
SN  - 1099-4300
TI  - Characterizing Microheterogeneity in Liquid Mixtures via Local Density Fluctuations
VL  - 26
ER  - 
TY  - JOUR
AB  - Noctua 2 is a supercomputer operated at the Paderborn Center for Parallel Computing (PC2) at Paderborn University in Germany. Noctua 2 was inaugurated in 2022 and is an Atos BullSequana XH2000 system. It consists mainly of three node types: 1) CPU Compute nodes with AMD EPYC processors in different main memory configurations, 2) GPU nodes with NVIDIA A100 GPUs, and 3) FPGA nodes with Xilinx Alveo U280 and Intel Stratix 10 FPGA cards. While CPUs and GPUs are known off-the-shelf components in HPC systems, the operation of a large number of FPGA cards from different vendors and a dedicated FPGA-to-FPGA network are unique characteristics of Noctua 2. This paper describes in detail the overall setup of Noctua 2 and gives insights into the operation of the cluster from a hardware, software and facility perspective.
AU  - Bauer, Carsten
AU  - Kenter, Tobias
AU  - Lass, Michael
AU  - Mazur, Lukas
AU  - Meyer, Marius
AU  - Nitsche, Holger
AU  - Riebler, Heinrich
AU  - Schade, Robert
AU  - Schwarz, Michael
AU  - Winnwa, Nils
AU  - Wiens, Alex
AU  - Wu, Xin
AU  - Plessl, Christian
AU  - Simon, Jens
ID  - 53663
JF  - Journal of large-scale research facilities
KW  - Noctua 2
KW  - Supercomputer
KW  - FPGA
KW  - PC2
KW  - Paderborn Center for Parallel Computing
TI  - Noctua 2 Supercomputer
VL  - 9
ER  - 
TY  - CHAP
AB  - <jats:title>Abstract</jats:title><jats:p>Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.</jats:p>
AU  - Meyer, Marius
AU  - Kenter, Tobias
AU  - Petrica, Lucian
AU  - O’Brien, Kenneth
AU  - Blott, Michaela
AU  - Plessl, Christian
ID  - 56606
SN  - 0302-9743
T2  - Lecture Notes in Computer Science
TI  - Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
ER  - 
TY  - CONF
AU  - Opdenhövel, Jan-Oliver
AU  - Alt, Christoph
AU  - Plessl, Christian
AU  - Kenter, Tobias
ID  - 56605
T2  - 2024 34th International Conference on Field-Programmable Logic and Applications (FPL)
TI  - StencilStream: A SYCL-based Stencil Simulation Framework Targeting FPGAs
ER  - 
TY  - CONF
AU  - Tareen, Abdul Rehman
AU  - Meyer, Marius
AU  - Plessl, Christian
AU  - Kenter, Tobias
ID  - 56607
T2  - 2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
TI  - HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reductions on FPGAs with High Bandwidth Memory
VL  - 35
ER  - 
TY  - CONF
AB  - The computation of electron repulsion integrals (ERIs) is a key component for quantum chemical methods. The intensive computation and bandwidth demand for ERI evaluation presents a significant challenge for quantum-mechanics-based atomistic simulations with hybrid density functional theory: due to the tens of trillions of ERI computations in each time step, practical applications are usually limited to thousands of atoms. In this work, we propose SERI, a high-throughput streaming accelerator for ERI computation on HBM-based FPGAs. In contrast to prior buffer-based designs, SERI proposes a novel streaming architecture to address the on-chip buffer limitation and the floorplanning challenge, and leverages the high-bandwidth memory to overcome the bandwidth bottleneck in prior designs. Moreover, to meet the varying computation, bandwidth, and floorplanning requirements between the 55 canonical quartet classes in ERI calculation, we design an automation tool, together with an accurate performance model, to automatically customize the architecture and floorplanning strategy for each canonical quartet class to maximize their throughput. Our performance evaluation on the AMD/Xilinx Alveo U280 FPGA board shows that, SERI achieves an average speedup of 9.80 x over the previous best-performing FPGA design, a 3.21x speedup over a 64-core AMD EPYC 7713 CPU, and a 15.64x speedup over an Nvidia A40 GPU. It reaches a peak throughput of 23.8 GERIS ($10^9$ ERIs per second) on one Alveo U280 FPGA. SERI will be released soon at https://github.com/SFU-HiAccel/SERI.
AU  - Stachura, Philip
AU  - Li, Guanyu
AU  - Wu, Xin
AU  - Plessl, Christian
AU  - Fang, Zhenman
ID  - 56609
T2  - 2024 34th International Conference on Field-Programmable Logic and Applications (FPL)
TI  - SERI: High-Throughput Streaming Acceleration of Electron Repulsion Integral Computation in Quantum Chemistry using HBM-based FPGAs
ER  - 
TY  - CONF
AU  - Büttner, Markus
AU  - Alt, Christoph
AU  - Kenter, Tobias
AU  - Köstler, Harald
AU  - Plessl, Christian
AU  - Aizinger, Vadym
ID  - 54312
T2  - Proceedings of the Platform for Advanced Scientific Computing Conference (PASC)
TI  - Enabling Performance Portability for Shallow Water Equations on CPUs, GPUs, and FPGAs with SYCL
ER  - 
TY  - CHAP
AB  - Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.
AU  - Meyer, Marius
AU  - Kenter, Tobias
AU  - Petrica, Lucian
AU  - O’Brien, Kenneth
AU  - Blott, Michaela
AU  - Plessl, Christian
ID  - 62067
SN  - 0302-9743
T2  - Lecture Notes in Computer Science
TI  - Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
ER  - 
TY  - JOUR
AB  - This manuscript makes the claim of having computed the 9th Dedekind number, D(9). This was done by accelerating the core operation of the process with an efficient FPGA design that outperforms an optimized 64-core CPU reference by 95x. The FPGA execution was parallelized on the Noctua 2 supercomputer at Paderborn University. The resulting value for D(9) is 286386577668298411128469151667598498812366. This value can be verified in two steps. We have made the data file containing the 490 M results available, each of which can be verified separately on CPU, and the whole file sums to our proposed value. The paper explains the mathematical approach in the first part, before putting the focus on a deep dive into the FPGA accelerator implementation followed by a performance analysis. The FPGA implementation was done in Register-Transfer Level using a dual-clock architecture and shows how we achieved an impressive FMax of 450 MHz on the targeted Stratix 10 GX 2,800 FPGAs. The total compute time used was 47,000 FPGA hours.
AU  - Van Hirtum, Lennart
AU  - De Causmaecker, Patrick
AU  - Goemaere, Jens
AU  - Kenter, Tobias
AU  - Riebler, Heinrich
AU  - Lass, Michael
AU  - Plessl, Christian
ID  - 56604
IS  - 3
JF  - ACM Transactions on Reconfigurable Technology and Systems
SN  - 1936-7406
TI  - A Computation of the Ninth Dedekind Number Using FPGA Supercomputing
VL  - 17
ER  - 
TY  - CONF
AU  - Olgu, Kaan
AU  - Kenter, Tobias
AU  - Nunez-Yanez, Jose
AU  - Mcintosh-Smith, Simon
ID  - 53503
T2  - Proceedings of the 12th International Workshop on OpenCL and SYCL
TI  - Optimisation and Evaluation of Breadth First Search with oneAPI/SYCL on Intel FPGAs: from Describing Algorithms to Describing Architectures
ER  - 
TY  - GEN
AB  - This preprint makes the claim of having computed the $9^{th}$ Dedekind
Number. This was done by building an efficient FPGA Accelerator for the core
operation of the process, and parallelizing it on the Noctua 2 Supercluster at
Paderborn University. The resulting value is
286386577668298411128469151667598498812366. This value can be verified in two
steps. We have made the data file containing the 490M results available, each
of which can be verified separately on CPU, and the whole file sums to our
proposed value.
AU  - Van Hirtum, Lennart
AU  - De Causmaecker, Patrick
AU  - Goemaere, Jens
AU  - Kenter, Tobias
AU  - Riebler, Heinrich
AU  - Lass, Michael
AU  - Plessl, Christian
ID  - 43439
T2  - arXiv:2304.03039
TI  - A computation of D(9) using FPGA Supercomputing
ER  - 
TY  - CONF
AU  - Faj, Jennifer
AU  - Kenter, Tobias
AU  - Faghih-Naini, Sara
AU  - Plessl, Christian
AU  - Aizinger, Vadym
ID  - 46188
T2  - Proceedings of the Platform for Advanced Scientific Computing Conference (PASC)
TI  - Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured Meshes
ER  -