The Proceedings of the PASC Conference (PASC20) are published in the Association for Computing Machinery's (ACM's) Digital Library. In recognition of the high quality of the PASC Conference papers track, the ACM continues to provide the proceedings as an Open Table of Contents (OpenTOC). This means that the definitive versions of PASC Conference papers are available to everyone at no charge to the author and without any pay-wall constraints for readers.

The OpenTOC for the PASC Conference is hosted on the ACM’s SIGHPC website. PASC papers can be accessed for free at: www.sighpc.org/for-our-community/acm-open-tocs.

Automatic Generation of Efficient Linear Algebra Programs

Henrik Barthels, Christos Psarras, and Paolo Bientinesi

The level of abstraction at which application experts reason about linear algebra computations and the level of abstraction used by developers of high-performance numerical linear algebra libraries do not match. The former is conveniently captured by high-level languages and libraries such as Matlab and Eigen, while the latter expresses the kernels included in the BLAS and LAPACK libraries. Unfortunately, the translation from a high-level computation to an efficient sequence of kernels is a task, far from trivial, that requires extensive knowledge of both linear algebra and high-performance computing. Internally, almost all high-level languages and libraries use efficient kernels; however, the translation algorithms are too simplistic and thus lead to a suboptimal use of said kernels, with significant performance losses. In order to both achieve the productivity that comes with high-level languages, and make use of the efficiency of low level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem and produces as output an efficient sequence of calls to high-performance kernels. In 25 application problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10×.

Henrik Barthels, Christos Psarras, and Paolo Bientinesi. 2020. Automatic Generation of Efficient Linear Algebra Programs. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 1, 1–11. DOI:https://doi.org/10.1145/3394277.3401836


Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications

Qinglei Cao, Yu Pei, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra

Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix---a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications.

Qinglei Cao, Yu Pei, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra. 2020. Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 2, 1–11. DOI:https://doi.org/10.1145/3394277.3401846


Benchmarking of state-of-the-art HPC Clusters with a Production CFD Code

Fabio Banchelli, Marta Garcia-Gasulla, Guillaume Houzeaux, and Filippo Mantovani

Computing technologies populating high-performance computing (HPC) clusters are getting more and more diverse, offering a wide range of architectural features. As a consequence, efficient programming of such platforms becomes a complex task. In this paper we provide a micro-benchmarking of three HPC clusters based on different CPU architectures, predominant in the Top500 ranking: x86, Armv8 and IBM Power9. On these platforms we study a production fluid-dynamics application leveraging different compiler technologies and micro-architectural features. We finally provide a scalability study on state-of-the-art HPC clusters. The two most relevant conclusions of our study are: i) Compiler development is critical for squeezing performance out of most recent technologies; ii) Micro-architectural features such as Single Instruction Multiple Data (SIMD) units and Simultaneous Multi-Threading (SMT) can impact the overall performance. However, a closer look shows that while SIMD is improving the performance of compute bound regions, SMT does not show a clear benefit on HPC workloads.

Fabio Banchelli, Marta Garcia-Gasulla, Guillaume Houzeaux, and Filippo Mantovani. 2020. Benchmarking of state-of-the-art HPC Clusters with a Production CFD Code. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 3, 1–11. DOI:https://doi.org/10.1145/3394277.3401847


Evaluating the Influence of Hemorheological Parameters on Circulating Tumor Cell Trajectory and Simulation Time

Sayan Roychowdhury, John Gounley, and Amanda Randles.

Extravasation of circulating tumor cells (CTCs) occurs primarily in the microvasculature, where flow and cell interactions significantly affect the blood rheology. Capturing cell trajectory at this scale requires the coupling of several interaction models, leading to increased computational cost that scales as more cells are added or the domain size is increased. In this work, we focus on micro-scale vessels and study the influence of certain hemorheological factors, including the presence of red blood cell aggregation, hematocrit level, microvessel size, and shear rate, on the trajectory of a circulating tumor cell. We determine which of the aforementioned factors significantly affect CTC motion and identify those which can potentially be disregarded, thus reducing simulation time. We measure the effect of these elements by studying the radial CTC movement and runtime at various combinations of these hemorheological parameters. To accurately capture blood flow dynamics and single cell movement, we perform high-fidelity hemodynamic simulations at a sub-micron resolution using our in-house fluid dynamics solver, HARVEY. We find that increasing hematocrit increases the likelihood of tumor cell margination, which is exacerbated by the presence of red blood cell aggregation. As microvessel diameter increases, there is no major CTC movement towards the wall; however, including aggregation causes the CTC to marginate quicker as the vessel size increases. Finally, as the shear rate is increased, the presence of aggregation has a diminished effect on tumor cell margination.

Sayan Roychowdhury, John Gounley, and Amanda Randles. 2020. Evaluating the Influence of Hemorheological Parameters on Circulating Tumor Cell Trajectory and Simulation Time. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 4, 1–10. DOI:https://doi.org/10.1145/3394277.3401848


Load Balancing in Large Scale Bayesian Inference

Daniel Wälchli, Sergio M. Martin, Athena Economides, Lucas Amoudruz, George Arampatzis, Xin Bian, and Petros Koumoutsakos

We present a novel strategy to improve load balancing for large scale Bayesian inference problems. Load imbalance can be particularly destructive in generation based uncertainty quantification (UQ) methods since all compute nodes in a large-scale allocation have to synchronize after every generation and therefore remain in an idle state until the longest model evaluation finishes. Our strategy relies on the concurrent scheduling of independent Bayesian inference experiments while sharing a group of worker nodes, reducing the destructive effects of workload imbalance in population-based sampling methods.

To demonstrate the efficiency of our method, we infer parameters of a red blood cell (RBC) model. We perform a data-driven calibration of the RBC's membrane viscosity by applying hierarchical Bayesian inference methods. To this end, we employ a computational model to simulate the relaxation of an initially stretched RBC towards its equilibrium state. The results of this work advance upon the current state of the art towards realistic blood flow simulations by providing inferred parameters for the RBC membrane viscosity.

We show that our strategy achieves a notable reduction in imbalance and significantly improves effective node usage on 512 nodes of the CSCS Piz Daint supercomputer. Our results show that, by enabling multiple independent sampling experiments to run concurrently on a given allocation of supercomputer nodes, our method sustains a high computational efficiency on a large-scale supercomputing setting.

Daniel Wälchli, Sergio M. Martin, Athena Economides, Lucas Amoudruz, George Arampatzis, Xin Bian, and Petros Koumoutsakos. 2020. Load Balancing in Large Scale Bayesian Inference. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 5, 1–12. DOI:https://doi.org/10.1145/3394277.3401849


Deploying Scientific Al Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers

David Brayford and Sofia Vallecorsa

There is an ever-increasing need for computational power to train complex artificial intelligence (AI) and machine learning (ML) models to tackle large scientific problems. High performance computing (HPC) resources are required to efficiently compute and scale complex models across tens of thousands of compute nodes. In this paper, we discuss the issues associated with the deployment of machine learning frameworks on large scale secure HPC systems and how we successfully deployed a standard machine learning framework on a secure large scale HPC production system, to train a complex three-dimensional convolutional GAN (3DGAN), with petaflop performance. 3DGAN is an example from the high energy physics domain, designed to simulate the energy pattern produced by showers of secondary particles inside a particle detector on various HPC systems.

David Brayford and Sofia Vallecorsa. 2020. Deploying Scientific Al Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 6, 1–8. DOI:https://doi.org/10.1145/3394277.3401850


Hardware Locality-Aware Partitioning and Dynamic Load-Balancing of Unstructured Meshes for Large-Scale Scientific Applications

Pavanakumar Mohanamuraly and Gabriel Staffelbach

We present an open-source topology-aware hierarchical unstructured mesh partitioning and load-balancing tool TreePart. The framework provides powerful abstractions to automatically detect and build hierarchical MPI topology resembling the hardware at runtime. Using this information it intelligently chooses between shared and distributed parallel algorithms for partitioning and loadbalancing. It provides a range of partitioning methods by interfacing with existing shared and distributed memory parallel partitioning libraries. It provides powerful and scalable abstractions like onesided distributed dictionaries and MPI3 shared memory based halo communicators for optimising HPC codes. The tool was successfully integrated into our in-house code and we present results from a large-eddy simulation of a combustion problem.

Pavanakumar Mohanamuraly and Gabriel Staffelbach. 2020. Hardware Locality-Aware Partitioning and Dynamic Load-Balancing of Unstructured Meshes for Large-Scale Scientific Applications. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 7, 1–10. DOI:https://doi.org/10.1145/3394277.3401851


Performance Evaluation of a Two-Dimensional Flood Model on Heterogeneous High-Performance Computing Architectures (Best Paper Award for PASC20)

Md Bulbul Sharif, Sheikh K. Ghafoor, Thomas M. Hines, Mario Morales-Hernändez, Katherine J. Evans, Shih-Chieh Kao, Alfred J. Kalyanapu, Tigstu T. Dullo, and Sudershan Gangrade

This paper describes the implementation of a two-dimensional hydrodynamic flood model with two different numerical schemes on heterogeneous high-performance computing architectures. Both schemes were able to solve the nonlinear hyperbolic shallow water equations using an explicit upwind first-order approach on finite differences and finite volumes, respectively, and were conducted using MPI and CUDA. Four different test cases were simulated on the Summit supercomputer at Oak Ridge National Laboratory. Both numerical schemes scaled up to 128 nodes (768 GPUs) with a maximum 98.2x speedup of over 1 GPU. The lowest run time for the 10 day Hurricane Harvey event simulation at 5 meter resolution (272 million grid cells) was 50 minutes. GPUDirect communication proved to be more convenient than the standard communication strategy. Both strong and weak scaling are shown.

Md Bulbul Sharif, Sheikh K. Ghafoor, Thomas M. Hines, Mario Morales-Hernändez, Katherine J. Evans, Shih-Chieh Kao, Alfred J. Kalyanapu, Tigstu T. Dullo, and Sudershan Gangrade. 2020. Performance Evaluation of a Two-Dimensional Flood Model on Heterogeneous High-Performance Computing Architectures. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 8, 1–9. DOI:https://doi.org/10.1145/3394277.3401852


Urgent Supercomputing of Earthquakes: Use Case for Civil Protection

Josep de la Puente, Juan Esteban Rodriguez, Marisol Monterrubio-Velasco, Otilio Rojas, and Arnau Folch

Deadly earthquakes are events that are unpredictable, relatively rare and have a huge impact upon the lives of those who suffer their consequences. Furthermore, each earthquake has specific characteristics (location, magnitude, directivity) which, combined to local amplification and de-amplification effects, makes their outcome very singular. Empirical relations are the main methodology used to make early assessment of an earthquake's impact. Nevertheless, the lack of sufficient data registers for large events makes such approaches very imprecise. Physics-based simulators, on the other hand, are powerful tools that provide highly accurate shaking information. However, physical simulations require considerable computational resources, a detailed geological model, and accurate earthquake source information.

A better early assessment of the impact of earthquakes implies both technical and scientific challenges. We propose a novel HPC-based urgent seismic simulation workflow, hereafter referred to as Urgent Computing Integrated Services for EarthQuakes (UCIS4EQ), which can deliver, potentially, much more accurate short-time reports of the consequences of moderate to large earthquakes. UCIS4EQ is composed of four subsystems that are deployed as services and connected by means of a workflow manager. This paper describes those components and their functionality. The main objective of UCIS4EQ is to produce ground-shaking maps and other potentially useful information to civil protection agencies. The first demonstrator will be deployed in the framework of the Center of Excellence for Exascale in Solid Earth (ChEESE, cheese.coe.eu, last access: 12 Feb. 2020).

Josep de la Puente, Juan Esteban Rodriguez, Marisol Monterrubio-Velasco, Otilio Rojas, and Arnau Folch. 2020. Urgent Supercomputing of Earthquakes: Use Case for Civil Protection. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 9, 1–8. DOI:https://doi.org/10.1145/3394277.3401853


k-Dispatch: A Workflow Management System for the Automated Execution of Biomedical Ultrasound Simulations on Remote Computing Resources

Marta Jaros, Bradley E. Treeby, Panayiotis Georgiou, and Jiri Jaros

Therapeutic ultrasound is increasingly being used for applications in oncology, drug delivery, and neurostimulation. In order to adapt the treatment procedures to patient needs, complex physical models have to be evaluated prior to the treatment. These models, however, require intensive computations that can only be satisfied by cloud and HPC facilities. Unfortunately, employing these facilities and executing the required computations is not straightforward even for experienced developers.

k-Dispatch is a novel workflow management system aimed at modelling biomedical ultrasound procedures using the open-source k-Wave acoustic toolbox. It allows ultrasound procedures to be uploaded with a single click and provides a notification when the result is ready for download. Inside k-Dispatch, there is a complex workflow management system which decodes the workflow graph, optimizes the workflow execution parameters, submits jobs to remote computing facilities, monitors their progress, and logs the consumed core hours. In this paper, the architecture and deployment of k-Dispatch are discussed, including the approach used for workflow optimization. A key innovation is the use of previous performance data to automatically select the utilised hardware and execution parameters. A review of related work is also given, including workflow management systems, batch schedulers, and cluster simulators.

Marta Jaros, Bradley E. Treeby, Panayiotis Georgiou, and Jiri Jaros. 2020. K-Dispatch: A Workflow Management System for the Automated Execution of Biomedical Ultrasound Simulations on Remote Computing Resources. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 10, 1–10. DOI:https://doi.org/10.1145/3394277.3401854


A Smoothed Particle Hydrodynamics Mini-App for Exascale

Aurélien Cavelan, Rubén M. Cabezón, Michal Grabarczyk, and Florina M. Ciorba

The Smoothed Particles Hydrodynamics (SPH) is a particle-based, meshfree, Lagrangian method used to simulate multidimensional fluids with arbitrary geometries, most commonly employed in astrophysics, cosmology, and computational fluid-dynamics (CFD). It is expected that these computationally-demanding numerical simulations will significantly benefit from the up-and-coming Exascale computing infrastructures, that will perform 1018 FLOP/s. In this work, we review the status of a novel SPH-EXA mini-app, which is the result of an interdisciplinary co-design project between the fields of astrophysics, fluid dynamics and computer science, whose goal is to enable SPH simulations to run on Exascale systems. The SPH-EXA mini-app merges the main characteristics of three state-of-the-art parent SPH codes (namely ChaNGa, SPH-flow, SPHYNX) with state-of-the-art (parallel) programming, optimization, and parallelization methods. The proposed SPH-EXA mini-app is a C++14 lightweight and flexible header-only code with no external software dependencies. Parallelism is expressed via multiple programming models, which can be chosen at compilation time with or without accelerator support, for a hybrid process+thread+accelerator configuration. Strong- and weak-scaling experiments on a production supercomputer show that the SPH-EXA mini-app can be efficiently executed with up 267 million particles and up to 65 billion particles in total on 2,048 hybrid CPU-GPU nodes.

Aurélien Cavelan, Rubén M. Cabezón, Michal Grabarczyk, and Florina M. Ciorba. 2020. A Smoothed Particle Hydrodynamics Mini-App for Exascale. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 11, 1–11. DOI:https://doi.org/10.1145/3394277.3401855


Aphros: High Performance Software for Multiphase Flows with Large Scale Bubble and Drop Clusters

Petr Karnakov, Fabian Wermelinger, Sergey Litvinov, and Petros Koumoutsakos

We present the high performance implementation of a new algorithm for simulating multiphase flows with bubbles and drops that do not coalesce. The algorithm is more efficient than the standard multi-marker volume-of-fluid method since the number of required fields does not depend on the number of bubbles. The capabilities of our methods are demonstrated on simulations of a foaming waterfall where we analyze the effects of coalescence prevention on the bubble size distribution and show how rising bubbles cluster up as foam on the water surface. Our open-source implementation enables high throughput simulations of multiphase flow, supports distributed as well as hybrid execution modes and scales efficiently on large compute systems.

Petr Karnakov, Fabian Wermelinger, Sergey Litvinov, and Petros Koumoutsakos. 2020. Aphros: High Performance Software for Multiphase Flows with Large Scale Bubble and Drop Clusters. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 12, 1–10. DOI:https://doi.org/10.1145/3394277.3401856


Massive Scaling of MASSIF: Algorithm Development and Analysis for Simulation on GPUs

Anuva Kulkarni, Jelena Kovačević, and Franz Franchetti

Micromechanical Analysis of Stress-Strain Inhomogeneities with Fourier transforms (MASSIF) is a large-scale Fortran-based differential equation solver used to study local stresses and strains in materials. Due to its prohibitive memory requirements, it is extremely difficult to port the code to GPUs with small on-device memory. In this work, we present an algorithm design that uses domain decomposition with approximate convolution, which reduces memory footprint to make the MASSIF simulation feasible on distributed GPU systems. A first-order performance model of our method estimates that compression and multi-resolution sampling strategies can enable domain computation within GPU memory constraints for 3D grids larger than those simulated by the current state-of-the-art Fortran MPI implementation. The model analysis also provides an insight into design requirements for further scalability. Lastly, we discuss the extension of our method to irregular domain decomposition and challenges to be tackled in the future.

Anuva Kulkarni, Jelena Kovačević, and Franz Franchetti. 2020. Massive Scaling of MASSIF: Algorithm Development and Analysis for Simulation on GPUs. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 13, 1–10. DOI:https://doi.org/10.1145/3394277.3401857


Eventify: Event-Based Task Parallelism for Strong Scaling

David Haensel, Laura Morgenstern, Andreas Beckmann, Ivo Kabadshow, and Holger Dachsel

Today's processors become fatter, not faster. However, the exploitation of these massively parallel compute resources remains a challenge for many traditional HPC applications regarding scalability, portability and programmability. To tackle this challenge, several parallel programming approaches such as loop parallelism and task parallelism are researched in form of languages, libraries and frameworks. Task parallelism as provided by OpenMP, HPX, StarPU, Charm++ and Kokkos is the most promising approach to overcome the challenges of ever increasing parallelism. The aforementioned parallel programming technologies enable scalability for a broad range of algorithms with coarse-grained tasks, e. g. in linear algebra and classical N-body simulation. However, they do not fully address the performance bottlenecks of algorithms with fine-grained tasks and the resultant large task graphs. Additionally, we experienced the description of large task graphs to be cumbersome with the common approach of providing in-, out- and inout-dependencies. We introduce event-based task parallelism to solve the performance and programmability issues for algorithms that exhibit fine-grained task parallelism and contain repetitive task patterns. With user-defined event lists, the approach provides a more convenient and compact way to describe large task graphs. Furthermore, we show how these event lists are processed by a task engine that reuses user-defined, algorithmic data structures. As use case, we describe the implementation of a fast multipole method for molecular dynamics with event-based task parallelism. The performance analysis reveals that the event-based implementation is 52 % faster than a classical loop-parallel implementation with OpenMP.

David Haensel, Laura Morgenstern, Andreas Beckmann, Ivo Kabadshow, and Holger Dachsel. 2020. Eventify: Event-Based Task Parallelism for Strong Scaling. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 14, 1–10. DOI:https://doi.org/10.1145/3394277.3401858


A Scalable Framework for Numerical Simulation of Combustion in Internal Combustion Engines

Rahul Bale, Wei Hsiang Wang, Chung-Gang Li, Keiji Onishi, Kenji Uchida, Hidefumi Fujimoto, Ryoichi Kurose, and Makoto Tsubokura

Numerically modelling the multi-physics phenomenon of combustion is challenging as it involves fluid flow, chemical reaction, phase change, energy release, etc. Combining numerical models for all these phenomena into a single solver ensuring scalability and performance is a daunting task. Based on the hierarchical meshing technique building cube method (BCM) we present a numerical framework for modelling internal combustion engines. The framework efficiently combines a fully compressible flow solver, chemical reaction and combustion model, a particle-in-cell based liquid spray model, and an immersed boundary method for geometry treatment. The flow, temperature fields and the transport of reacting species an all speed Roe scheme is adopted discretization of the advective flux. The solver is coupled with the equilibrium chemical reaction library CANTERA to model combustion. The parcel model-based particle-source-in-cell (PSI-cell) method is adopted for modelling liquid fuel spray and its evaporation. Validation of the numerical framework is carried out by using experimental data of a model internal combustion engine known as the Rapid Compression Machine (RCM). Evaluation of the framework with strong scaling analysis shows good scalability.

Rahul Bale, Wei Hsiang Wang, Chung-Gang Li, Keiji Onishi, Kenji Uchida, Hidefumi Fujimoto, Ryoichi Kurose, and Makoto Tsubokura. 2020. A Scalable Framework for Numerical Simulation of Combustion in Internal Combustion Engines. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 15, 1–10. DOI:https://doi.org/10.1145/3394277.3401859


Low-Order Finite Element Solver with Small Matrix-Matrix Multiplication Accelerated by AI-Specific Hardware for Crustal Deformation Computation

Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Akira Naruse, Jack C. Wells, Christopher J. Zimmer, Tjerk P. Straatsma, Muneo Hori, Lalith Maddegedara, and Naonori Ueda

This study proposes a fast low-order finite element solver for crustal deformation computations by applying Tensor Core, AI-specific hardware on a Volta GPU. Tensor Core can compute large matrix-matrix multiplications rapidly in half precision. We redesign a state-of-the-art solver algorithm so that lower-precision data types can be used and memory access costs can be reduced even when we use small matrices. With the proposed solver, we solved 13 billion degrees-of-freedom two-layered problems that mimicked the Earth's crust and mantle using 36 compute nodes of Summit. In the matrix-vector kernel, we obtained a 4.1-fold speedup over a standard kernel in a single-precision format. Our proposed solver increased the FLOP count of the entire solver; however, we reduced the time-to-solution by 1.7-fold since the Tensor Core provided a high effective performance.

Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Akira Naruse, Jack C. Wells, Christopher J. Zimmer, Tjerk P. Straatsma, Muneo Hori, Lalith Maddegedara, and Naonori Ueda. 2020. Low-Order Finite Element Solver with Small Matrix-Matrix Multiplication Accelerated by AI-Specific Hardware for Crustal Deformation Computation. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 16, 1–11. DOI:https://doi.org/10.1145/3394277.3401860