TY  - CONF
AU  - Pape, Gerrit
AU  - Wintermann, Bjarne
AU  - Jungemann, Linus
AU  - Lass, Michael
AU  - Meyer, Marius
AU  - Riebler, Heinrich
AU  - Plessl, Christian
ID  - 59816
T2  - Proceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies
TI  - AuroraFlow, an Easy-to-Use, Low-Latency FPGA Communication Solution Demonstrated on Multi-FPGA Neural Network Inference
ER  - 
TY  - JOUR
AB  - In this work, we introduce PHOENIX, a highly optimized explicit open-source solver for two-dimensional nonlinear Schrödinger equations with extensions. The nonlinear Schrödinger equation and its extensions (Gross-Pitaevskii equation) are widely studied to model and analyze complex phenomena in fields such as optics, condensed matter physics, fluid dynamics, and plasma physics. It serves as a powerful tool for understanding nonlinear wave dynamics, soliton formation, and the interplay between nonlinearity, dispersion, and diffraction. By extending the nonlinear Schrödinger equation, various physical effects such as non-Hermiticity, spin-orbit interaction, and quantum optical aspects can be incorporated. PHOENIX is designed to accommodate a wide range of applications by a straightforward extendability without the need for user knowledge of computing architectures or performance optimization. The high performance and power efficiency of PHOENIX are demonstrated on a wide range of entry-class to high-end consumer and high-performance computing GPUs and CPUs. Compared to a more conventional MATLAB implementation, a speedup of up to three orders of magnitude and energy savings of up to 99.8% are achieved. The performance is compared to a performance model showing that PHOENIX performs close to the relevant performance bounds in many situations. The possibilities of PHOENIX are demonstrated with a range of practical examples from the realm of nonlinear (quantum) photonics in planar microresonators with active media including exciton-polariton condensates. Examples range from solutions on very large grids, the use of local optimization algorithms, to Monte Carlo ensemble evolutions with quantum noise enabling the tomography of the system's quantum state.
AU  - Wingenbach, Jan
AU  - Bauch, David
AU  - Ma, Xuekai
AU  - Schade, Robert
AU  - Plessl, Christian
AU  - Schumacher, Stefan
ID  - 60298
JF  - Computer Physics Communications
SN  - 0010-4655
TI  - PHOENIX – Paderborn highly optimized and energy efficient solver for two-dimensional nonlinear Schrödinger equations with integrated extensions
VL  - 315
ER  - 
TY  - JOUR
AB  - <jats:p>
            This manuscript makes the claim of having computed the
            <jats:inline-formula content-type="math/tex">
              <jats:tex-math notation="LaTeX" version="MathJax">\(9\)</jats:tex-math>
            </jats:inline-formula>
            th Dedekind number, D(9). This was done by accelerating the core operation of the process with an efficient FPGA design that outperforms an optimized 64-core CPU reference by 95
            <jats:inline-formula content-type="math/tex">
              <jats:tex-math notation="LaTeX" version="MathJax">\(\times\)</jats:tex-math>
            </jats:inline-formula>
            . The FPGA execution was parallelized on the Noctua 2 supercomputer at Paderborn University. The resulting value for D(9) is 286386577668298411128469151667598498812366. This value can be verified in two steps. We have made the data file containing the 490 M results available, each of which can be verified separately on CPU, and the whole file sums to our proposed value. The paper explains the mathematical approach in the first part, before putting the focus on a deep dive into the FPGA accelerator implementation followed by a performance analysis. The FPGA implementation was done in Register-Transfer Level using a dual-clock architecture and shows how we achieved an impressive FMax of 450 MHz on the targeted Stratix 10 GX 2,800 FPGAs. The total compute time used was 47,000 FPGA hours.
          </jats:p>
AU  - Van Hirtum, Lennart
AU  - De Causmaecker, Patrick
AU  - Goemaere, Jens
AU  - Kenter, Tobias
AU  - Riebler, Heinrich
AU  - Lass, Michael
AU  - Plessl, Christian
ID  - 56604
IS  - 3
JF  - ACM Transactions on Reconfigurable Technology and Systems
SN  - 1936-7406
TI  - A Computation of the Ninth Dedekind Number Using FPGA Supercomputing
VL  - 17
ER  - 
TY  - JOUR
AU  - Pfnür, H.
AU  - Tegenkamp, C.
AU  - Sanna, S.
AU  - Jeckelmann, E.
AU  - Horn-von Hoegen, M.
AU  - Bovensiepen, U.
AU  - Esser, N.
AU  - Schmidt, Wolf Gero
AU  - Dähne, M.
AU  - Wippermann, S.
AU  - Bechstedt, F.
AU  - Bode, M.
AU  - Claessen, R.
AU  - Ernstorfer, R.
AU  - Hogan, C.
AU  - Ligges, M.
AU  - Pucci, A.
AU  - Schäfer, J.
AU  - Speiser, E.
AU  - Wolf, M.
AU  - Wollschläger, J.
ID  - 54869
IS  - 2
JF  - Surface Science Reports
SN  - 0167-5729
TI  - Atomic wires on substrates: Physics between one and two dimensions
VL  - 79
ER  - 
TY  - JOUR
AB  - We present a novel approach to characterize and quantify microheterogeneity and microphase separation in computer simulations of complex liquid mixtures. Our post-processing method is based on local density fluctuations of the different constituents in sampling spheres of varying size. It can be easily applied to both molecular dynamics (MD) and Monte Carlo (MC) simulations, including periodic boundary conditions. Multidimensional correlation of the density distributions yields a clear picture of the domain formation due to the subtle balance of different interactions. We apply our approach to the example of force field molecular dynamics simulations of imidazolium-based ionic liquids with different side chain lengths at different temperatures, namely 1-ethyl-3-methylimidazolium chloride, 1-hexyl-3-methylimidazolium chloride, and 1-decyl-3-methylimidazolium chloride, which are known to form distinct liquid domains. We put the results into the context of existing microheterogeneity analyses and demonstrate the advantages and sensitivity of our novel method. Furthermore, we show how to estimate the configuration entropy from our analysis, and we investigate voids in the system. The analysis has been implemented into our program package TRAVIS and is thus available as free software.
AU  - Lass, Michael
AU  - Kenter, Tobias
AU  - Plessl, Christian
AU  - Brehm, Martin
ID  - 53474
IS  - 4
JF  - Entropy
SN  - 1099-4300
TI  - Characterizing Microheterogeneity in Liquid Mixtures via Local Density Fluctuations
VL  - 26
ER  - 
TY  - JOUR
AU  - Krenz, Marvin
AU  - Gerstmann, Uwe
AU  - Schmidt, Wolf Gero
ID  - 54865
IS  - 7
JF  - Physical Review Letters
SN  - 0031-9007
TI  - Defect-Assisted Exciton Transfer across the Tetracene-Si(111):H Interface
VL  - 132
ER  - 
TY  - JOUR
AB  - <jats:title>Abstract</jats:title><jats:p>Most properties of solid materials are defined by their internal electric field and charge density distributions which so far are difficult to measure with high spatial resolution. Especially for 2D materials, the atomic electric fields influence the optoelectronic properties. In this study, the atomic‐scale electric field and charge density distribution of WSe<jats:sub>2</jats:sub> bi‐ and trilayers are revealed using an emerging microscopy technique, differential phase contrast (DPC) imaging in scanning transmission electron microscopy (STEM). For pristine material, a higher positive charge density located at the selenium atomic columns compared to the tungsten atomic columns is obtained and tentatively explained by a coherent scattering effect. Furthermore, the change in the electric field distribution induced by a missing selenium atomic column is investigated. A characteristic electric field distribution in the vicinity of the defect with locally reduced magnitudes compared to the pristine lattice is observed. This effect is accompanied by a considerable inward relaxation of the surrounding lattice, which according to first principles DFT calculation is fully compatible with a missing column of Se atoms. This shows that DPC imaging, as an electric field sensitive technique, provides additional and remarkable information to the otherwise only structural analysis obtained with conventional STEM imaging.</jats:p>
AU  - Groll, Maja
AU  - Bürger, Julius
AU  - Caltzidis, Ioannis
AU  - Jöns, Klaus D.
AU  - Schmidt, Wolf Gero
AU  - Gerstmann, Uwe
AU  - Lindner, Jörg K. N.
ID  - 54868
JF  - Small
SN  - 1613-6810
TI  - DFT‐Assisted Investigation of the Electric Field and Charge Density Distribution of Pristine and Defective 2D WSe<sub>2</sub> by Differential Phase Contrast Imaging
ER  - 
TY  - CONF
AU  - Büttner, Markus
AU  - Alt, Christoph
AU  - Kenter, Tobias
AU  - Köstler, Harald
AU  - Plessl, Christian
AU  - Aizinger, Vadym
ID  - 54312
T2  - Proceedings of the Platform for Advanced Scientific Computing Conference (PASC)
TI  - Enabling Performance Portability for Shallow Water Equations on CPUs, GPUs, and FPGAs with SYCL
ER  - 
TY  - JOUR
AB  - <jats:title>Abstract</jats:title>
               <jats:p>Experiments with ultracold atoms in optical lattices usually involve a weak parabolic trapping potential which merely serves to confine the atoms, but otherwise remains negligible. In contrast, we suggest a different class of experiments in which the presence of a stronger trap is an essential part of the set-up. Because the trap-modified on-site energies exhibit a slowly varying level spacing, similar to that of an anharmonic oscillator, an additional time-periodic trap modulation with judiciously chosen parameters creates nonlinear resonances which enable efficient Floquet engineering. We employ a Mathieu approximation for constructing the near-resonant Floquet states in an accurate manner and demonstrate the emergence of effective ground states from the resonant trap eigenstates. Moreover, we show that the population of the Floquet states is strongly affected by the phase of a sudden turn-on of the trap modulation, which leads to significantly modified and rich dynamics. As a guideline for further studies, we argue that the deliberate population of only the resonance-induced effective ground states will allow one to realize Floquet condensates which follow classical periodic orbits, thus providing challenging future perspectives for the investigation of the quantum–classical correspondence.</jats:p>
AU  - Ali, Usman
AU  - Holthaus, Martin
AU  - Meier, Torsten
ID  - 57839
IS  - 12
JF  - New Journal of Physics
SN  - 1367-2630
TI  - Floquet dynamics of ultracold atoms in optical lattices with a parametrically modulated trapping potential
VL  - 26
ER  - 
TY  - CONF
AU  - Tareen, Abdul Rehman
AU  - Meyer, Marius
AU  - Plessl, Christian
AU  - Kenter, Tobias
ID  - 56607
T2  - 2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
TI  - HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reductions on FPGAs with High Bandwidth Memory
VL  - 35
ER  - 
TY  - JOUR
AB  - Noctua 2 is a supercomputer operated at the Paderborn Center for Parallel Computing (PC2) at Paderborn University in Germany. Noctua 2 was inaugurated in 2022 and is an Atos BullSequana XH2000 system. It consists mainly of three node types: 1) CPU Compute nodes with AMD EPYC processors in different main memory configurations, 2) GPU nodes with NVIDIA A100 GPUs, and 3) FPGA nodes with Xilinx Alveo U280 and Intel Stratix 10 FPGA cards. While CPUs and GPUs are known off-the-shelf components in HPC systems, the operation of a large number of FPGA cards from different vendors and a dedicated FPGA-to-FPGA network are unique characteristics of Noctua 2. This paper describes in detail the overall setup of Noctua 2 and gives insights into the operation of the cluster from a hardware, software and facility perspective.
AU  - Bauer, Carsten
AU  - Kenter, Tobias
AU  - Lass, Michael
AU  - Mazur, Lukas
AU  - Meyer, Marius
AU  - Nitsche, Holger
AU  - Riebler, Heinrich
AU  - Schade, Robert
AU  - Schwarz, Michael
AU  - Winnwa, Nils
AU  - Wiens, Alex
AU  - Wu, Xin
AU  - Plessl, Christian
AU  - Simon, Jens
ID  - 53663
JF  - Journal of large-scale research facilities
KW  - Noctua 2
KW  - Supercomputer
KW  - FPGA
KW  - PC2
KW  - Paderborn Center for Parallel Computing
TI  - Noctua 2 Supercomputer
VL  - 9
ER  - 
TY  - JOUR
AU  - Schäfer, F.
AU  - Trautmann, A.
AU  - Ngo, C.
AU  - Steiner, J. T.
AU  - Fuchs, C.
AU  - Volz, K.
AU  - Dobener, F.
AU  - Stein, M.
AU  - Meier, Torsten
AU  - Chatterjee, S.
ID  - 55267
IS  - 7
JF  - Physical Review B
SN  - 2469-9950
TI  - Optical Stark effect in type-II semiconductor heterostructures
VL  - 109
ER  - 
TY  - CONF
AU  - Olgu, Kaan
AU  - Kenter, Tobias
AU  - Nunez-Yanez, Jose
AU  - Mcintosh-Smith, Simon
ID  - 53503
T2  - Proceedings of the 12th International Workshop on OpenCL and SYCL
TI  - Optimisation and Evaluation of Breadth First Search with oneAPI/SYCL on Intel FPGAs: from Describing Algorithms to Describing Architectures
ER  - 
TY  - GEN
AB  - Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.
AU  - Meyer, Marius
AU  - Kenter, Tobias
AU  - Petrica, Lucian
AU  - O'Brien, Kenneth
AU  - Blott, Michaela
AU  - Plessl, Christian
ID  - 53364
T2  - arXiv:2403.18374
TI  - Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
ER  - 
TY  - CHAP
AB  - <jats:title>Abstract</jats:title><jats:p>Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.</jats:p>
AU  - Meyer, Marius
AU  - Kenter, Tobias
AU  - Petrica, Lucian
AU  - O’Brien, Kenneth
AU  - Blott, Michaela
AU  - Plessl, Christian
ID  - 56606
SN  - 0302-9743
T2  - Lecture Notes in Computer Science
TI  - Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
ER  - 
TY  - JOUR
AB  - <jats:title>Abstract</jats:title>
               <jats:p>Theoretical spectroscopy based on double perturbation theory is typically challenged by systems with large orbital hyperfine splitting. Therefore, we here derive a rigorous, non-perturbative scheme starting from Dirac’s equation which allows to calculate the contribution of the orbital HFI for complex structures including heavy atoms with strong spin-orbit coupling (SOC). Using the PAW formalism, the method has been implemented in the software package Quantum ESPRESSO. We show that the ‘orbital part’ actually scales with SOC strength if orbital quenching is hindered by low local symmetry, i.e. in case of dimers or atoms at surfaces. This holds true in particular when the unpaired electron is localized in quasi-atomic <jats:italic>p</jats:italic>-like orbitals. Here, the orbital part is by far not negligible, but becomes dominant by surpassing the dipolar contribution by a factor of five.</jats:p>
AU  - Franzke, Katharina
AU  - Schmidt, Wolf Gero
AU  - Gerstmann, Uwe
ID  - 54856
IS  - 1
JF  - Journal of Physics: Conference Series
SN  - 1742-6588
TI  - Relativistic calculation of the orbital hyperfine splitting in complex microscopic structures
VL  - 2701
ER  - 
TY  - JOUR
AB  - At large scales, quantum systems may become advantageous over their classical counterparts at performing certain tasks. Developing tools to analyze these systems at the relevant scales, in a manner consistent with quantum mechanics, is therefore critical to benchmarking performance and characterizing their operation. While classical computational approaches cannot perform like-for-like computations of quantum systems beyond a certain scale, classical high-performance computing (HPC) may nevertheless be useful for precisely these characterization and certification tasks. By developing open-source customized algorithms using high-performance computing, we perform quantum tomography on a megascale quantum photonic detector covering a Hilbert space of 106. This requires finding 108 elements of the matrix corresponding to the positive operator valued measure (POVM), the quantum description of the detector, and is achieved in minutes of computation time. Moreover, by exploiting the structure of the problem, we achieve highly efficient parallel scaling, paving the way for quantum objects up to a system size of 1012 elements to be reconstructed using this method. In general, this shows that a consistent quantum mechanical description of quantum phenomena is applicable at everyday scales. More concretely, this enables the reconstruction of large-scale quantum sources, processes and detectors used in computation and sampling tasks, which may be necessary to prove their nonclassical character or quantum computational advantage.
AU  - Schapeler, Timon
AU  - Schade, Robert
AU  - Lass, Michael
AU  - Plessl, Christian
AU  - Bartley, Tim
ID  - 53202
IS  - 1
JF  - Quantum Science and Technology
TI  - Scalable quantum detector tomography by high-performance computing
VL  - 10
ER  - 
TY  - CONF
AB  - The computation of electron repulsion integrals (ERIs) is a key component for quantum chemical methods. The intensive computation and bandwidth demand for ERI evaluation presents a significant challenge for quantum-mechanics-based atomistic simulations with hybrid density functional theory: due to the tens of trillions of ERI computations in each time step, practical applications are usually limited to thousands of atoms. In this work, we propose SERI, a high-throughput streaming accelerator for ERI computation on HBM-based FPGAs. In contrast to prior buffer-based designs, SERI proposes a novel streaming architecture to address the on-chip buffer limitation and the floorplanning challenge, and leverages the high-bandwidth memory to overcome the bandwidth bottleneck in prior designs. Moreover, to meet the varying computation, bandwidth, and floorplanning requirements between the 55 canonical quartet classes in ERI calculation, we design an automation tool, together with an accurate performance model, to automatically customize the architecture and floorplanning strategy for each canonical quartet class to maximize their throughput. Our performance evaluation on the AMD/Xilinx Alveo U280 FPGA board shows that, SERI achieves an average speedup of 9.80 x over the previous best-performing FPGA design, a 3.21x speedup over a 64-core AMD EPYC 7713 CPU, and a 15.64x speedup over an Nvidia A40 GPU. It reaches a peak throughput of 23.8 GERIS ($10^9$ ERIs per second) on one Alveo U280 FPGA. SERI will be released soon at https://github.com/SFU-HiAccel/SERI.
AU  - Stachura, Philip
AU  - Li, Guanyu
AU  - Wu, Xin
AU  - Plessl, Christian
AU  - Fang, Zhenman
ID  - 56609
T2  - 2024 34th International Conference on Field-Programmable Logic and Applications (FPL)
TI  - SERI: High-Throughput Streaming Acceleration of Electron Repulsion Integral Computation in Quantum Chemistry using HBM-based FPGAs
ER  - 
TY  - CONF
AU  - Opdenhövel, Jan-Oliver
AU  - Alt, Christoph
AU  - Plessl, Christian
AU  - Kenter, Tobias
ID  - 56605
T2  - 2024 34th International Conference on Field-Programmable Logic and Applications (FPL)
TI  - StencilStream: A SYCL-based Stencil Simulation Framework Targeting FPGAs
ER  - 
TY  - JOUR
AB  - <jats:p>Density-functional theory calculations on P-rich InP(001):H surfaces are presented. Depending on temperature, pressure and substrate doping, hydrogen desorption or adsorption will occur and influence the surface electronic properties. For p-doped samples, the charge transition levels of the P dangling bond defects resulting from H desorption will lead to Fermi level pinning in the lower half of the band gap. This explains recent experimental data. For n-doped substrates, H-deficient surfaces are the ground-state structure. This will lead to Fermi level pinning below the bulk conduction band minimum. Surface defects resulting from the adsorption of additional hydrogen can be expected as well, but affect the surface electronic properties less than H desorption.</jats:p>
AU  - Sciotto, Rachele
AU  - Ruiz Alvarado, Isaac Azahel
AU  - Schmidt, Wolf Gero
ID  - 54855
IS  - 1
JF  - Surfaces
SN  - 2571-9637
TI  - Substrate Doping and Defect Influence on P-Rich InP(001):H Surface Properties
VL  - 7
ER  -