Professional Documents
Culture Documents
Stefano Markidis
Erwin Laure (Eds.)
123
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zrich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrcken, Germany
8759
123
Editors
Stefano Markidis
KTH Royal Institute of Technology
Stockholm
Sweden
Erwin Laure
KTH Royal Institute of Technology
Stockholm
Sweden
ISSN 0302-9743
ISSN 1611-3349 (electronic)
Lecture Notes in Computer Science
ISBN 978-3-319-15975-1
ISBN 978-3-319-15976-8 (eBook)
DOI 10.1007/978-3-319-15976-8
Library of Congress Control Number: 2015932683
LNCS Sublibrary: SL1 Theoretical Computer Science and General Issues
Springer Cham Heidelberg New York Dordrecht London
Springer International Publishing Switzerland 2015
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made.
Printed on acid-free paper
Springer International Publishing AG Switzerland is part of Springer Science+Business Media
(www.springer.com)
Preface
January 2015
Stefano Markidis
Organization
EASC 2014 was organized by the European Commission funded projects CRESTA
(Grant Agreement No. 287703, cresta-project.eu), EPiGRAM (Grant Agreement No.
610598, epigram-project.eu), and by the Swedish e-Science Research Center SeRC
(e-science.se)
Steering Group
Erwin Laure
Stefano Markidis
William D. Gropp
Mark Parsons
Lorna Smith
Bastian Koller
Program Committee
Erwin Laure
Stefano Markidis
William D. Gropp
Satoshi Matsuoka
Mark Parsons
Lorna Smith
Daniel Holmes
Bastian Koller
Pavan Balaji
Jed Brown
Robert Clay
Roberto Gioiosa
Katie Antypas
Leroy Drummond-Lewis
Alec Johnson
Sponsoring Institutions
Cray Inc., Seattle, WA, USA
Mellanox Technologies, Sunnyvale, CA, USA
Contents
28
39
57
69
85
100
110
122
VIII
Contents
130
141
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
149
Towards Exascale
Scientific Applications
Abstract. GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale
eciency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here,
we describe some of the ways we have been able to realize this through
the use of parallelization on all levels, combined with a constant focus on
absolute performance. Release 4.6 of GROMACS uses SIMD acceleration
on a wide range of architectures, GPU ooading acceleration, and both
OpenMP and MPI parallelism within and between nodes, respectively.
The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we
see for exascale simulation - in particular a very ne-grained task parallelism. We also discuss the software management, code peer review and
continuous integration testing required for a project of this complexity.
Introduction
S. P
all et al.
computing has potential to take simulation to new heights, but the combination
of challenges that face software preparing for deployment at the exascale to
deliver these results are unique in the history of software. The days of simply
buying new hardware with a faster clock rate and getting shorter times to solution with old software are gone. The days of running applications on a single
core are gone. The days of heterogeneous processor design to suit oating-point
computation are back again. The days of performance being bounded by the
time taken for oating-point computations are ending fast. The need to design
with multi-core and multi-node parallelization in mind at all points is here to
stay, which also means Amdahls law [3] is more relevant than ever.1
A particular challenge for biomolecular simulations is that the computational
problem size is xed by the geometric size of the protein and the atomic-scale
resolution of the model physics. Most life science problems can be reduced to
this size (or smaller). It is possible to simulate much larger systems, but it is
typically not relevant. Second, the timescale of dynamics involving the entire
system increases much faster than the length scale, due to the requirement of
sampling the exponentially larger number of ensemble microstates. This means
that weak scaling is largely irrelevant for life science; to make use of increasing
amounts of computational resources to simulate these systems, we have to rely
either on strong-scaling software engineering techniques, or ensemble simulation
techniques.
The fundamental algorithm of molecular dynamics assigns positions and
velocities to every particle in the simulation system, and species the model
physics that governs the interactions between particles. The forces can then
be computed, which can be used to update the positions and velocities via
Newtons second law, using a given nite time step. This numerical integration scheme is iterated a large number of times, and it generates a series of
samples from the thermodynamic ensemble dened by the model physics. From
these samples, observations can be made that conrm or predict experiment.
Typical model physics have many components to describe the dierent kinds
of bonded and non-bonded interactions that exist. The non-bonded interactions
between particles model behaviour like van der Waals forces, or Coulombs law.
The non-bonded interactions are the most expensive aspects of computing the
forces, and the subject of a very large amount of research, computation and
optimization.
Historically, the GROMACS molecular dynamics simulation suite has aimed
at being a general-purpose tool for studying biomolecular systems, such as shown
in Fig. 1. The development of the simulation engine focused heavily on maximizing single-core oating-point performance of its innermost compute kernels for
non-bonded interactions. These kernels typically compute the electrostatic and
van der Waals forces acting on each simulation particle from its interactions with
all other inside a given spherical boundary. These kernels were rst written in C,
1
Amdahls law gives a model for the expected (and maximum) speedup of a program
when parallelized over multiple processors with respect to the serial version. It states
that the achievable speedup is limited by the sequential part of the program.
then FORTRAN, and later optimized in assembly language, mostly for commodity x86-family processors, because the data dependencies of the computations in
the kernels were too challenging for C or FORTRAN compilers (then or now).
The kernels were also specialized for interactions within and between water molecules, because of the prevalence of such interactions in biomolecular simulations.
From one point-of-view, this extensive use of interaction-specic kernels can be
seen as a software equivalent of application-specic integrated circuits.
Recognizing the need to build upon this good work by coupling multiple
processors, GROMACS 4.0 [14] introduced a minimal-communication neutral
territory domain-decomposition (DD) algorithm, [7,8] with fully dynamic load
balancing. This spatial decomposition of the simulation volume created
high-level data parallelism that was eective for near-linear scaling of the computation at around 400 atoms per core. The DD implementation required the
use of MPI for message-passing parallel constructs. However, the needs of many
simulation users can be met within a single node, [23] and in that context the
implementation overhead of MPI libraries was too high, not to mention it is
dicult to employ in distributed computing. In GROMACS 4.5, [20] we implemented a multi-threaded MPI library with the necessary subset of the MPI
API. The library has both POSIX and Windows threads back-ends (hence called
thread-MPI) and uses highly ecient hardware-supported atomic and lock-free
synchronization primitives. This allows the existing DD implementation to work
across multiple cores of a single node without depending on any external MPI
library.
However, the fundamental limitation remained of a one-to-one mapping of
MPI ranks to cores, and to domains. On the one hand, there is always a limit to
how small a spatial domain can be, which will limit the number of domains the
simulation box can be decomposed into, which in turn limits the number of cores
that a parallelization with such a mapping can utilize. On the other hand, the
one-to-one domains to cores mapping is cache-friendly as it creates independent
data sets so that cores sharing caches can act without conict, but the size of
the volume of data that must be communicated so that neighboring domains
act coherently grows rapidly with the number of domains. This approach is only
scalable for a xed problem size if the latency of communication between all
cores is comparable and the communication book-keeping overhead grows only
linearly with the number of cores. Neither is true, because network latencies
are orders of magnitude higher than shared-cache latencies. This is clearly a
major problem for designing for the exascale, where many cores, many nodes
and non-uniform memory and communication latencies will be key attributes.
The other important aspect of the target simulations for designing for strong
scaling is treating the long-range components of the atomic interactions. Many
systems of interest are spatially heterogeneous on the nanometer scale (e.g. proteins embedded in membranes and solvated in water), and the simulation artefacts caused by failing to treat the long-range eects are well known. The de
facto standard for treating the long-range electrostatic interactions has become
the smooth particle-mesh Ewald (PME) method, [12] whose cost for N atoms
S. P
all et al.
scales as N log(N ). A straightforward implementation where each rank of a parallel computation participates in an equivalent way leads to a 3D Fast Fourier
Transform (FFT) that communicates globally. This communication quickly limits the strong scaling. To mitigate this, GROMACS 4.0 introduced a multipleprogram multiple-data (MPMD) implementation that dedicates some ranks to
the FFT part; now only those ranks do all-to-all FFT communication.
GROMACS 4.5 improved further by using a 2D pencil decomposition [11,16] in
reciprocal space, within the same MPMD implementation. This coarse-grained
task parallelism works well on machines with homogeneous hardware, but it is
harder to port to accelerators or combine with RDMA constructs.
The transformation of GROMACS needed to perform well on exascale-level
parallel hardware began after GROMACS 4.5. This requires radical algorithm
changes, and better use of parallelization constructs from the ground up, not as
an afterthought. More hands are required to steer the project, and yet the old
functionality written before their time must generally be preserved. Computer
architectures are evolving rapidly, and no single developer can know the details
of all of them. In the following sections we describe how we are addressing some
of these challenges, and our ongoing plans for addressing others.
2
2.1
Modern computer hardware is not only parallel, but exposes multiple levels of
parallelism depending on the type and speed of data access and communication
capabilities across dierent compute elements. For a modern superscalar CPU
such as Intel Haswell, even a single core is equipped with 8 dierent execution
ports, and it is not even possible to buy a single-core chip. Add hardware threads,
complex communication crossbars, memory hierarchies, and caches larger than
hard disks from the 1990s. This results in a complex hierarchical organization
of compute and communication/network elements from SIMD units and caches
to network topologies, each level in the hierarchy requiring a dierent type of
software parallelization for ecient use. HPC codes have traditionally focused on
only two levels of parallelism: intra-node and inter-node. Such codes typically rely
solely on MPI parallelization to target parallelism on multiple levels: both intrasocket, intra-node, and inter-node. This approach had obvious advantages before
the multi-core and heterogeneous computing era when improvements came from
CPU frequency scaling and evolution of interconnect. However, nowadays most
scientic problems require complex parallel software architecture to be able use
petaop hardware eciently and going toward exascale this is becoming a necessity. This is particularly true for molecular dynamics which requires reducing the
wall-time per iteration to improve simulation performance.
On the lowest level, processors typically contain SIMD (single instruction
multiple data) units which oer ne-grained data-parallelism through silicon
dedicated to executing a limited set of instructions on multiple, currently typically
416, data elements simultaneously. Exploiting this low-level and ne-grained parallelism has become crucial for achieving high performance, especially with new
architectures like AVX and Intel MIC supporting wide SIMD. One level higher,
multi-core CPUs have become the standard and several architectures support multiple hardware threads per core. Hence, typical multi-socket SMP machines come
with dozens of cores capable of running 24 threads each (through simultaneous
multi-threading, SMT, support). Simply running multiple processes (MPI ranks)
on each core or hardware thread is typically less ecient than multi-threading.
Achieving strong scaling in molecular dynamics requires ecient use of the cache
hierarchy, which makes the picture even more complex. On the other hand, a chip
cannot be considered a homogeneous cluster either. Accelerator coprocessors like
GPUs or Intel MIC, often referred to as many-core, add another layer of complexity to the intra-node parallelism. These require ne-grained parallelism and
S. P
all et al.
carefully tuned data access patterns, as well as special programming models. Current accelerator architectures like GPUs also add another layer of interconnect in
form of PCIe bus (Peripheral Component Interconnect Express) as well as a separate main memory. This means that data movement across the PCIe link often
limits overall throughput. Integration of traditional latency-oriented CPU cores
with throughput-oriented cores like those in GPUs or MIC accelerators is ongoing, but the cost of data movement between the dierent units will at least for the
foreseeable future be a factor that needs to be optimized for.
Typical HPC hardware exhibits non-uniform memory access (NUMA) behavior on the node level: accessing data from dierent CPUs or cores of CPUs
has a non-uniform cost. We started multithreading trials quite early with the
idea of easily achieving load balancing, but the simultaneous introduction of
NUMA suddenly meant a processor resembled a cluster internally. Indiscriminately accessing memory across NUMA nodes will frequently lead to performance that is lower than for MPI. Moreover, the NUMA behavior extends to
other compute and communication components: the cost of communicating with
an accelerator or through a network interface typically depends on the intra-node
bus topology and requires special attention. On the top level, the interconnect
links together compute nodes into a network topology. A side-eect of the multicore evolution is that, while the network capacity (latency and bandwidth) per
compute node has improved, the typical number of CPU cores they serve has
increased faster; the capacity available per core has decreased substantially.
In order to exploit the capabilities of each level of hardware parallelism,
a performance-oriented application needs to consider multiple levels of parallelism: SIMD parallelism for maximizing single-core/thread performance; multithreading to exploit advantages of multi-core and SMT (e.g. fast data sharing);
inter-node communication-based parallelism (e.g. message passing with MPI);
and heterogeneous parallelism by utilizing both CPUs and accelerators like GPUs.
Driven by this evolution of hardware, we have initiated a re-redesign of
the parallelization in GROMACS. In particular, recent eorts have focused on
improvements targeting all levels of parallelization: new algorithms for wide
SIMD and accelerator architectures, a portable and extensible SIMD parallelization framework, ecient multi-threading throughout the entire code, and
an asynchronous ooad-model for accelerators. The resulting multi-level parallelization scheme implemented in GROMACS 4.6 is illustrated in Fig. 2. In the
following sections, we will give an overview of these improvements, highlighting
the advances they provide in terms of making ecient use of current petascale
hardware, as well as in paving the road towards exascale computing.
2.2
SIMD Parallelism
All modern CPU and GPU architectures use SIMD-like instructions to achieve
high op rates. Any computational code that aims for high performance will
have to make use of SIMD. For very regular work, such as matrix-vector multiplications, the compiler can generate good SIMD code, although manually tuned
vendor libraries typically do even better. But for irregular work, such as shortrange particle-particle non-bonded interactions, the compiler usually fails since
it cannot control data structures. If you think your compiler is really good at
optimizing, it can be an eye-opening experience to look at the raw assembly
instructions actually generated. In GROMACS, this was reluctantly recognized
a decade ago and SSE and Altivec SIMD kernels were written manually in assembly. These kernels were, and still are, extremely ecient for interactions involving
water molecules, but other interactions do not parallelize well with SIMD using
the standard approach of unrolling a particle-based Verlet-list [25].
It is clear that a dierent approach is needed in order to use wide SIMD execution units like AVX or GPUs. We developed a novel approach, where particles
are grouped into spatial clusters containing xed number of particles [17]. First,
the particles are placed on a grid in the x and y dimensions, and then binned
in the z dimension. This eciently groups particles that are close in space, and
permits the construction of a list of clusters, each containing exactly M particles.
A list is then constructed of all those cluster pairs containing particles that may
Fig. 2. Illustration of multi-level parallelism in GROMACS 4.6. This exploits several kinds of ne-grained data parallelism, a multiple-program multiple-data (MPMD)
decomposition separating the short-range particle-particle (PP) and long-range Particle Mesh Ewald (PME) force calculation algorithms, coarse-grained data parallelism
with domain-decomposition (DD) over MPI ranks (implemented either on single-node
workstations or compute clusters), and ensembles of related simulations scheduled e.g.
by a distributed computing controller.
10
S. P
all et al.
be close enough to interact. This list of pairs of interacting clusters is reused over
multiple successive evaluations of the non-bonded forces. The list is constructed
with a buer to prevent particle diusion corrupting the implementation of the
model physics.
The kernels that implement the computation of the interactions between two
clusters i and j use SIMD load instructions to ll vector registers with copies
of the positions of all M particles in i. The loop over the N particles in j
is unrolled according to the SIMD width of the CPU. Inside this loop, SIMD
load instructions ll vector registers with positions of all N particles from the
j cluster. This permits the computation of N interactions between an i and all
j particles simultaneously, and the computation of M N interactions in the
inner loop without needing to load particle data. With wide SIMD units it is
ecient to process more than one j cluster at a time.
M , N and the number of j clusters to process can be adjusted to suit the
underlying characteristics of the hardware. Using M = 1 and N = 1 recovers the
original Verlet-list algorithm. On CPUs, GROMACS uses M = 4 and N = 2,
4 or 8, depending on the SIMD width. On NVIDIA GPUs, we use M = 8 and
N = 4 to calculate 32 interactions at once with 32 hardware threads executing in
lock-step. To further improve the ratio of arithmetic to memory operations when
using GPUs, we add another level of hierarchy by grouping 8 clusters together.
Thus we store 64 particles in shared memory and calculate interactions with
about half of these for every particle in the cluster-pair list.
The kernel implementations reach about 50 % of the peak op rate on all
supported hardware, which is very high for MD. This comes at the cost of calculating about twice as many interactions as required; not all particle pairs in all
cluster pairs will be within the cut-o at each time step, so many interactions are
computed that are known to produce a zero result. The extra zero interactions
can actually be put to use as an eective additional pair list buer additionally to the standard Verlet list buer. As we have shown here, this scheme is
exible, since N and M can be adapted to current and future hardware. Most
algorithms and optimization tricks that have been developed for particle-based
pair lists can be reused for the cluster-pair list, although many will not improve
the performance.
The current implementation of the cluster-based non-bonded algorithm already supports a wide range of SIMD instruction sets and accelerator architectures:
SSE2, SSE4.1, AVX (256-bit and 128-bit with FMA), AVX2, BG/Q QPX, Intel
MIC (LRBni), NVIDIA CUDA. An implementation on a eld-programmable
gate array (FPGA) architecture is in progress.
Multi-threaded Parallelism.
Before GROMACS 4.6, we relied mostly on MPI for both inter-node and intranode parallelization over CPU cores. For MD this has worked well, since there is
little data to communicate and at medium to high parallelization all data ts in
L2 cache. Our initial plans were to only support OpenMP parallelization in the
separate Particle Mesh Ewald (PME) MPI ranks. The reason for using OpenMP
11
in PME was to reduce the number of MPI ranks involved in the costly collective
communication for the FFT grid transpose. This 3D-FFT is the only part of
the code that involves global data dependencies. Although this indeed greatly
reduced the MPI communication cost, it also introduced signicant overhead.
GROMACS 4.6 was designed to use OpenMP in all compute-intensive parts
of the MD algorithm.2 Most of the algorithms are straightforward to parallelize
using OpenMP. These scale very well, as Fig. 3 shows. Cache-intensive parts of
the code like performing domain decomposition, or integrating the forces and
velocities show slightly worse scaling. Moreover, the scaling in these parts tends
to deteriorate with increasing number of threads in an MPI rank especially
with large number of threads in a rank, and when teams of OpenMP threads
cross NUMA boundaries. When simulating at high ratios of cores/particles, each
MD step can take as little as a microsecond. There are many OpenMP barriers
used in the many code paths that are parallelized with OpenMP, each of which
takes a few microseconds, which can be costly.
Accordingly, the hybrid MPI + OpenMP parallelization is often slower than
an MPI-only scheme as Fig. 3 illustrates. Since PP (particle-particle) ranks only
do low-volume local communication, the reduction in MPI communication from
using the hybrid scheme is apparent only at high parallelization. There, MPI-only
parallelization (e.g. as in GROMACS 4.5) puts a hard upper limit on the number
of cores that can be used, due to algorithmic limits on the spatial domain size, or
the need to communicate with more than one nearest neighbor. With the hybrid
scheme, more cores can operate on the same spatial domain assigned to an MPI
rank, and there is no longer a hard limit on the parallelization. Strong scaling
curves now extend much further, with a more gradual loss of parallel eciency.
An example is given in Fig. 4, which shows a membrane protein system scaling
to twice as many cores with hybrid parallelization and reach double the peak
performance of GROMACS 4.5. In some cases, OpenMP-only parallelization can
be much faster than MPI-only parallelization if the load for each stage of the
force computation can be balanced individually. A typical example is a solute in
solvent, where the solute has bonded interactions but the solvent does not. With
OpenMP, the bonded interactions can be distributed equally over all threads in
a straightforward manner.
Heterogeneous Parallelization.
Heterogeneous architectures combine multiple types of processing units, typically
latency- and throughput-oriented cores most often CPUs and accelerators like
GPUs, Intel MIC, or FPGAs. Many-core accelerator architectures have been
become increasingly popular in technical and scientic computing mainly due to
their impressive raw oating point performance. However, in order to eciently
utilize these architectures, a very high level of ne-grained parallelism is required.
2
At the time of that decision, sharing a GPU among multiple MPI ranks was inefcient, so the only ecient way to use multiple cores in a node was with OpenMP
within a rank. This constraint has since been relaxed.
12
S. P
all et al.
13
ns/day
300
200
MPI only
MPI + 2 OpenMP
MPI + 4 OpenMP
GMX4.5, MPI only
100
192
384
576
#cores
768
960
Fig. 4. Improvements in strong scaling performance since GROMACS 4.5, using the
M N kernels and OpenMP parallelization in GROMACS 4.6. The plot shows simulation performance in ns/day for dierent software versions and parallelization schemes.
Performance with one core per MPI rank is shown for GROMACS 4.5 (purple) and 4.6
(black). Performance with GROMACS 4.6 is shown using two (red) and four (green)
cores per MPI rank using OpenMP threading within each MPI rank. Simulations were
carried out on the Triolith cluster at NSC, using two 8-core Intel E5-2660 (2.2 GHz
Sandy Bridge) processors per node and FDR Inniband network. The test system
is the GLIC membrane protein shown in Fig. 1 (144,000 atoms, PME electrostatics.)
(Colour gure online).
some cases slowdown) over the fast performance on multi-core CPUs, thanks to
the highly tuned SIMD assembly kernels.
With this experience, we set out to provide native GPU support in GROMACS
4.6 with a few important design principles in mind. Building on the observation
that highly optimized CPU code is hard to beat, our goal was to ensure that all
compute resources available, both CPU and accelerators, are utilized to the greatest extent possible. We also wanted to ensure that our heterogeneous GPU acceleration supported most existing features of GROMACS in a single code base to
avoid having to reimplement major parts of the code for GPU-only execution.
This means that the most suitable parallelization is the ooad model, which other
MD codes have also employed successfully [9,18]. As Fig. 5 illustrates, we aim to
execute the compute-intensive short-range non-bonded force calculation on GPU
accelerators, while the CPU computes bonded and long-range electrostatics
forces, because the latter are communication intensive.
14
S. P
all et al.
15
arbitrary simulation box shapes and virtual interaction sites are all supported
(Fig. 6). Even though the overhead of managing an accelerator is non-negligible,
GROMACS 4.6 shows great strong scaling in GPU accelerated runs reaching
126 atoms/core (1260 atoms/GPU) on common simulation systems (Fig. 7).
Based on a similar parallelization design, the upcoming GROMACS version
will also support the Intel MIC accelerator architecture. Intel MIC supports
native execution of standard MPI codes using the so-called symmetric mode,
where the card is essentially treated as a general-purpose multi-core node. However, as MIC is a highly parallel architecture requiring ne-grained parallelism,
many parts of typical MPI codes will be inecient on these processors. Hence,
ecient utilization of Xeon Phi devices in molecular dynamics especially with
typical bio-molecular simulations and strong-scaling in mind is only possible
by treating them as accelerators. Similarly to GPUs, this means a parallelization
scheme based on ooading only those tasks that are suitable for wide SIMD and
highly thread-parallel execution to MIC.
2.3
Ensemble Simulations
The performance and scaling advances in GROMACS (and many other programs) have made it ecient to run simulations that simply were too large only
a few years ago. However, infrastructures such as the European PRACE provide
access only to problems that scale to thousands of cores. This used to be an
impossible barrier for biomolecular dynamics on anything but ridiculously large
systems when an implementation could only run well with hundreds of particles per core. Scaling has improved, but the number of computational units in
supercomputers is growing even faster. There are now multiple machines in the
world that reach roughly a million cores. Under ideal conditions, GROMACS
can scale to levels where each PP rank handles 40 atoms, but there are few if
any concrete biological problems that require 40 million atoms without corresponding increases in the number of samples generated. Even in the theoretical
case where we could improve scaling to the point where each core only contains
a single atom, the simulation system would still be almost an order of magnitude
larger than the example in Fig. 1.
To adapt to this reality, researchers are increasingly using large ensembles of
simulations, either to simply sample better, or new algorithms such as replica
exchange simulation, [24] Markov state models, [22] or milestoning [13] that
analyze and exchange data between multiple simulations to improve overall
sampling. In many cases, this achieves as much as two-fold superscaling, i.e.,
an ensemble of 100 simulations running on 10 nodes each might provide the
same sampling eciency as a single simulation running on 2000 cores. To automate this, GROMACS has been co-developed with a new framework for Parallel
Adaptive Molecular Dynamics called Copernicus [19]. Given a set of input structures and sampling settings, this framework automatically starts a rst batch
of sampling runs, makes sure all simulations complete (with extensive support
for checkpointing and restarting of failed runs), and automatically performs the
adaptive step data analysis to decide what new simulations to start in a second generation. The current ensemble sampling algorithms scale to hundreds
16
S. P
all et al.
Fig. 5. GROMACS heterogeneous parallelization using both CPU and GPU resources
during each simulation time-step. The compute-heavy non-bonded interactions are
ooaded to the GPU, while the CPU is responsible for domain-decomposition bookkeeping, bonded force calculation, and lattice summation algorithms. The diagram
shows tasks carried out during a GPU-accelerated normal MD step (black arrows) as
well as a step which includes the additional pair-search and domain-decomposition
tasks are carried out (blue arrows). The latter, as shown above in blue, also includes
an additional transfer, and the subsequent pruning of the pair list as part of the nonbonded kernel (Colour gure online).
Source: http://dx.doi.org/10.6084/m9.gshare.971161. Reused under CC-BY; retrieved 22:15, March 23, 2014 (GMT).
17
Fig. 6. An important feature of the current heterogeneous GROMACS GPU implementation is that it works, and works eciently, in combination with most other features of
the software. GPU simulations can employ domain decomposition, non-standard boxes,
pressure scaling, and virtual interaction sites to signicantly improve the absolute simulation performance compared to the baseline. Simulation system: RNAse protein solvated in rectangular (24 K atoms) and rhombic dodecahedron (16.8 k atoms) box, PME
electrostatics, cut-o 0.9 nm. Hardware: 2x Intel Xeon E5650 (2.67 GHz Westmere), 2x
NVIDIA Tesla C2070 (Fermi) GPU accelerators.
Achieving strong scaling to a higher core count for a xed-size problem requires
careful consideration of load balance. The advantage provided by spatial DD is
one of data locality and reuse, but if the distribution of computational work is not
homogeneous then more care is needed. A typical membrane protein simulation
is dominated by
water, which is usually treated with a rigid 3-point model,
a lipid membrane, whose alkyl tails are modeled by particles with zero partial
charge and bonds of constrained length, and
a protein, which is modeled with a backbone of xed-length bonds that require
a lengthy series of constraint calculations, as well as partial charge on all
particles.
18
S. P
all et al.
These problems are well known, and are addressed in the GROMACS DD
scheme via automatic dynamic load balancing that distributes the spatial volumes unevenly according to the observed imbalance in compute load. This approach has limitations because it works at the level of DD domains that must map
to MPI ranks, so cores within the same node or socket have unnecessary copies
of the same data. We have not yet succeeded in developing a highly eective
intra-rank decomposition of work to multiple cores. We hope to address this via
intra-node or intra-socket task parallelism.
One advantage of the PME algorithm as implemented in GROMACS is that it
is possible to shift the computational workload between the real- and reciprocalspace parts of the algorithm at will. This makes it possible to write code that
can run optimally at dierent settings on dierent kinds of hardware. The performance of the compute, communication and bookkeeping parts of the overall
algorithm vary greatly with the characteristics of the hardware that implements
it, and with the properties of the simulation system studied. For example,
19
shifting compute work from reciprocal to real space to make better use of an
idle GPU increases the volume that must be communicated during DD, while
lowering the required communication volume during the 3D FFTs. Evaluating
how best to manage these compromises can only happen at runtime.
The MPMD version of PME is intended to reduce the overall communication
cost on typical switched networks by minimizing the number of ranks participating in the 3D FFTs. This requires generating a mapping between PME and
non-PME ranks and scheduling data transfer to and from them. However, on
hardware with relatively ecient implementations of global communication, it
can be advantageous to prefer the SPMD implementation because it has more
regular communication patterns [2]. The same may be true on architectures with
accelerators, because the MPMD implementation makes no use of the accelerators on the PME ranks. The performance of both implementations is limited by
lack of overlap of communication and computation.
Attempts to use low-latency partitioned global address space (PGAS) methods that require single-program multiple-data (SPMD) approaches are particularly challenged, because the gain from any decrease in communication latency
must also overcome the overall increase in communication that accompanies the
MPMD-to-SPMD transition [21]. The advent of implementations of non-blocking
collective (NBC) MPI routines is promising if computation can be found to overlap with the background communication. The most straightforward approach
would be to revert to SPMD and hope that the increase in total communication cost is oset by the gain in available compute time, however, the available
performance is still bounded by the overall cost of the global communication.
Finding compute to overlap with the NBC on the MPMD PME ranks is likely
to deliver better results. Permitting PME ranks to execute kernels for bonded
and/or non-bonded interactions from their associated non-PME ranks is the
most straightforward way to achieve this overlap. This is particularly true at the
scaling limit, where the presence of bonded interactions is one of the primary
problems in balancing the compute load between the non-PME ranks.
The introduction of automatic ensemble computing introduces another
layer of decomposition, by which we essentially achieve MSMPMD parallelism:
Multiple-simulation (ensemble), multiple-program (direct/lattice space), and
multiple-data (domain decomposition).
2.5
20
S. P
all et al.
doing only long-range work and GROMACS 4.6 doing only short-range work
on homogeneous systems of the same size were comparable, so we hope we can
deploy a working version in the future.
2.6
We plan to address some of the exascale-level strong-scaling problems mentioned above through the use of a more ne-grained task parallelism than what
is currently possible in GROMACS. Considerable technical challenges remain to
convert OpenMP-based data-parallel loop constructs into series of tasks that
are coarse enough to avoid spending lots of time scheduling work, and yet ne
enough to balance the overall load. Our initial plan is to experiment with the
cross-platform Thread Building Blocks (TBB) library, [1] which can coexist with
OpenMP and deploy equivalent loop constructs in the early phases of development. Many alternatives exist; those that require the use of custom compilers, runtime environments, or language extensions are unattractive because that
increases the number of combinations of algorithm implementations that must
be maintained and tested, and compromises the high portability enjoyed by
GROMACS.
One particular problem that might be alleviated with ne-grained task parallelism is reducing the cost of the communication required during the integration
phase. Polymers such as protein backbones are modeled with xed-length bonds,
with at least two bonds per particle, which leads to coupled constraints that
domain decomposition spreads over multiple ranks. Iterating to satisfy those
constraints can be a costly part of the algorithm at high parallelism. Because
the spatial regions that contain bonded interactions are distributed over many
ranks, and the constraint computations cannot begin until after all the forces
for their atoms have been computed, the current implementation waits for all
forces on all ranks to be computed before starting the integration phase. The
performance of post-integration constraint-satisfaction phase is bounded by the
latency for the multiple communication stages required. This means that ranks
that lack atoms with coupled bonded interactions, such as all those with only
water molecules, literally have nothing to do at this stage. In an ideal implementation, such ranks could contribute very early in each iteration to complete
all the tasks needed for the forces for the atoms involved in coupled bond constraints. Integration for those atoms could take place while forces for interactions
between unrelated atoms are being computed, so that there is computation to
do on all nodes while the communication for the constraint iteration takes place.
This kind of implementation would require considerably more exibility in the
book-keeping and execution model, which is simply not present today.
3
3.1
21
The major part of the GROMACS code base has been around 11.5 million
lines of C code since version 4.0 (http://www.ohloh.net/p/gromacs). Ideally,
software engineering on such moderately large multi-purpose code bases would
take place within the context of eective abstractions [26]. For example, someone
developing a new integration algorithm should not need to pay any attention to
whether the parallelization is implemented by constructs from a threading library
(like POSIX threads), a compiler-provided threading layer (like OpenMP), an
external message-passing library (like MPI), or remote direct memory access
(like SHMEM). Equally, she/he should not need to know whether the kernels
that compute the forces they are using as inputs are running on any particular
kind of accelerator or CPU. Implementing such abstractions generally costs some
developer time, and some compute time. These are necessary evils if the software
is to be able to change as new hardware, new algorithms or new implementations
emerge.
Considerable progress has been made in modularizing some aspects of the
code base to provide eective abstraction layers. For example, once the main
MD iteration loop has begun, the programmer does not need to know whether
the MPI layer is provided by an external library because the computation is
taking place on multiple nodes, or the internal thread-based implementation is
working to parallelize the computation on a single node. Portable abstract atomic
operations have been available as a side-eect of the thread-MPI development.
Integrators receive vectors of positions, velocities and forces without needing
to know the details of the kernels that computed the forces. The dozens of
non-bonded kernels can make portable SIMD function calls that compile to the
correct hardware operations automatically.
However, the size of the top-level function that implements the loop over time
steps has remained at about 1800 code and comment lines since 4.0. It remains
riddled with special-case conditions, comments, and function calls for dierent parallelization conditions, integration algorithms, optimization constructs,
housekeeping for communication and output, and ensemble algorithms. The
function that computes the forces is even worse, now that both the old and
new non-bonded kernel infrastructures are supported! The code complexity is
necessary for a general-purpose multi-architecture tool like GROMACS. However, needing to be aware of dozens of irrelevant possibilities is a heavy barrier
to participation in the project, because it is very dicult to understand all side
eects of a change.
To address this, we are in the process of a transition from C99 to C++98
for much of this high-level control code. While we remain alert to the possibility
that HPC compilers will not be as eective at compiling C++98 as they are
for C99, the impact on execution time of most of this code is negligible and the
impact on developer time is considerable.
22
S. P
all et al.
Our expectation is that the use of virtual function dispatch will eliminate
much of the complexity of understanding conditional code (including switch
statements over enumerations that must be updated in widely scattered parts of
the code), despite a slightly slower implementation of the actual function call.
After all, GROMACS has long used a custom vtable-like implementation for runtime dispatch of the non-bonded interaction kernels. Objects managing resources
via RAII exploiting compiler-generated destructor calls for doing the right thing
will lead to shorter development times and fewer problems because developers
have to manage fewer things. Templated container types will help alleviate the
burden of manual memory allocation and deallocation. Existing C++ testing
and mocking libraries will simplify the process of developing adequate testing
infrastructure, and existing task-parallelism support libraries such as Intel TBB
[1] will be benecial.
It is true that some of these objectives could be met by re-writing in more
objected-oriented C, but the prospect of o-loading some tedious tasks to the
compiler is attractive.
3.2
Version control is widely considered necessary for successful software development. GROMACS used CVS in its early days and now uses Git (git clone
git://git.gromacs.org/gromacs.git). The ability to trace when behavior changed
and nd some metadata about why it might have changed is supremely valuable.
Coordinating the information about desires of users and developers, known
problems, and progress with current work is an ongoing task that is dicult
with a development team scattered around the world and thousands of users
who rarely meet. GROMACS uses the Redmine issue-tracking system3 to discuss feature development, report and discuss bugs, and to monitor intended and
actual progress towards milestones. Commits in the git repository are expected to
reference Redmine issues where appropriate, which generates automatic HTML
cross-references to save people time nding information.
Peer review of scientic research is the accepted gold standard of quality
because of the need for specialist understanding to fully appreciate, value, criticize and improve the work. Software development on projects like GROMACS is
comparably complex, and our experience has been that peer review has worked
well there. Specically, all proposed changes to GROMACS even from the core
authors must go through our Gerrit code-review website4 , and receive positive
reviews from at least two other developers of suitable experience, before they
can be merged. User- and developer-level documentation must be part of the
same change. Requiring this review to happen before acceptance has eliminated
many problems before they could be felt. It also creates social pressure for people to be active in reviewing others code, lest they have no karma with which
to get their own proposals reviewed. As features are implemented or bugs xed,
3
4
http://redmine.gromacs.org.
http://gerrit.gromacs.org.
23
http://jenkins.gromacs.org.
24
S. P
all et al.
with what is available. It is important that compilation should not fail when
conguration succeeded, because the end user is generally incapable of diagnosing what the problem was. A biochemist attempting to install GROMACS
on their laptop generally does not know that scrolling back through 100 lines of
output from recursive make calls is needed to nd the original compilation error,
and even then they will generally need to ask someone else what the problem is
and how to resolve it. It is far more ecient for both users and developers to
detect during conguration that compilation will fail, and to provide suggested
solutions and guidance at that time. Accordingly, GROMACS uses the CMake
build system (http://www.cmake.org), primarily for its cross-platform support,
but makes extensive use of its high-level constructs, including sub-projects and
scoped variables.
3.3
Profiling
25
Future Directions
GROMACS has grown from an in-house simulation code into a large international software project, which now also has highly professional developer, testing
and proling environments to match it. We believe the code is quite unique in
the extent to which it interacts with the underlying hardware, and while there
are many signicant challenges remaining this provides a very strong base for
further extreme-scale computing development. However, scientic software is
rapidly becoming very dependent on deep technical computing expertise: Many
amazingly smart algorithms are becoming irrelevant since they cannot be implemented eciently on modern hardware, and the inherent complexity of this
hardware makes it very dicult even for highly skilled physicists and chemists
to predict what will work. It is similarly not realistic to expect every research
group to aord a resident computer expert, which will likely require both research
groups and computing centers to increasingly join eorts to create large open
source community codes where it is realistic to fund multiple full time developers. In closing, the high performance and extreme-scale computing landscape is
currently changing faster than it has ever done before. It is a formidable challenge for software to keep up with this pace, but the potential rewards of exascale
computing are equally large.
Acknowledgments. This work was supported by the European research Council
(258980, BH), the Swedish e-Science research center, and the EU FP7 CRESTA project
(287703). Computational resources were provided by the Swedish National Infrastructure for computing (grants SNIC 025/12-32 & 2013-26/24) and the Leibniz Supercomputing Center.
6
http://www.bsc.es/computer-sciences/extrae.
26
S. P
all et al.
References
1. Intel Thread Building Blocks. https://www.threadingbuildingblocks.org
2. Abraham, M.J., Gready, J.E.: Optimization of parameters for molecular dynamics simulation using smooth particle-mesh Ewald in GROMACS 4.5. J. Comput.
Chem. 32(9), 20312040 (2011)
3. Amdahl, G.M.: Validity of the single processor approach to achieving large scale
computing capabilities. In: Proceedings of the Spring Joint Computer Conference,
AFIPS 1967 (Spring), pp. 483485. ACM, New York, NY, USA (1967). http://doi.
acm.org/10.1145/1465482.1465560
4. Anderson, J.A., Lorenz, C.D., Travesset, A.: General purpose molecular dynamics
simulations fully implemented on graphics processing units. J. Comput. Phys. 227,
53245329 (2008)
5. Andoh, Y., Yoshii, N., Fujimoto, K., Mizutani, K., Kojima, H., Yamada, A.,
Okazaki, S., Kawaguchi, K., Nagao, H., Iwahashi, K., Mizutani, F., Minami,
K., Ichikawa, S.I., Komatsu, H., Ishizuki, S., Takeda, Y., Fukushima, M.:
MODYLAS: a highly parallelized general-purpose molecular dynamics simulation program for large-scale systems with long-range forces calculated by
Fast Multipole Method (FMM) and highly scalable ne-grained new parallel processing algorithms. J. Chem. Theory Comput. 9(7), 32013209 (2013).
http://pubs.acs.org/doi/abs/10.1021/ct400203a
6. Arnold, A., Fahrenberger, F., Holm, C., Lenz, O., Bolten, M., Dachsel, H.,
Halver, R., Kabadshow, I., G
ahler, F., Heber, F., Iseringhausen, J., Hofmann, M., Pippig, M., Potts, D., Sutmann, G.: Comparison of scalable
fast methods for long-range interactions. Phys. Rev. E 88, 063308 (2013).
http://link.aps.org/doi/10.1103/PhysRevE.88.063308
7. Bowers, K.J., Dror, R.O., Shaw, D.E.: Overview of neutral territory methods for
the parallel evaluation of pairwise particle interactions. J. Phys. Conf. Ser. 16(1),
300 (2005). http://stacks.iop.org/1742-6596/16/i=1/a=041
8. Bowers, K.J., Dror, R.O., Shaw, D.E.: Zonal methods for the parallel execution
of range-limited n-body simulations. J. Comput. Phys. 221(1), 303329 (2007).
http://dx.doi.org/10.1016/j.jcp.2006.06.014
9. Brown, W.M., Wang, P., Plimpton, S.J., Tharrington, A.N.: Implementing molecular dynamics on hybrid high performance computers - short range forces. Comp.
Phys. Comm. 182, 898911 (2011)
10. Eastman, P., Pande, V.S.: Ecient nonbonded interactions for molecular dynamics
on a graphics processing unit. J. Comput. Chem. 31, 12681272 (2010)
11. Eleftheriou, M., Moreira, J.E., Fitch, B.G., Germain, R.S.: A volumetric FFT for
BlueGene/L. In: Pinkston, T.M., Prasanna, V.K. (eds.) HiPC 2003. LNCS (LNAI),
vol. 2913, pp. 194203. Springer, Heidelberg (2003)
12. Essmann, U., Perera, L., Berkowitz, M.L., Darden, T., Lee, H., Pedersen, L.G.: A
smooth particle mesh Ewald method. J. Chem. Phys. 103(19), 85778593 (1995)
13. Faradjian, A., Elber, R.: Computing time scales from reaction coordinates by milestoning. J. Chem. Phys. 120, 1088010889 (2004)
14. Hess, B., Kutzner, C., van der Spoel, D., Lindahl, E.: GROMACS 4: algorithms for
highly ecient, load-balanced, and scalable molecular simulation. J. Chem. Theor.
Comput. 4(3), 435447 (2008)
15. Humphrey, W., Dalke, A., Schulten, K.: VMD: visual molecular dynamics. J. Mol.
Graph. 14(1), 3338 (1996)
27
16. Jagode, H.: Fourier transforms for the BlueGene/L communication network. Ph.D.
thesis, The University of Edinburgh, Edinburgh, UK (2005)
17. P
all, S., Hess, B.: A exible algorithm for calculating pair interactions on
SIMD architectures. Comput. Phys. Commun. 184(12), 26412650 (2013).
http://www.sciencedirect.com/science/article/pii/S0010465513001975
18. Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E.,
Chipot, C., Skeel, R.D., Kale, L., Schulten, K.: Scalable molecular dynamics with
NAMD. J. Comput. Chem. 26, 17811802 (2005)
19. Pronk, S., Larsson, P., Pouya, I., Bowman, G.R., Haque, I.S., Beauchamp, K.,
Hess, B., Pande, V.S., Kasson, P.M., Lindahl, E.: Copernicus: A new paradigm
for parallel adaptive molecular dynamics. In: Proceedings of 2011 International
Conference for High Performance Computing, Networking, Storage and Analysis,
SC 2011, pp. 60:160:10. ACM, New York, NY, USA (2011) http://doi.acm.org/
10.1145/2063384.2063465
20. Pronk, S., P
all, S., Schulz, R., Larsson, P., Bjelkmar, P., Apostolov, R.,
Shirts, M.R., Smith, J.C., Kasson, P.M., van der Spoel, D., Hess, B.,
Lindahl, E.: GROMACS 4.5: a high-throughput and highly parallel open
source molecular simulation toolkit. Bioinformatics 29(7), 845854 (2013).
http://bioinformatics.oxfordjournals.org/content/29/7/845.abstract
21. Reyes, R., Turner, A., Hess, B.: Introducing SHMEM into the GROMACS molecular dynamics application: experience and results. In: Weiland, M., Jackson, A.,
Johnson, N. (eds.) Proceedings of the 7th International Conference on PGAS
Programming Models. The University of Edinburgh, October 2013. http://www.
pgas2013.org.uk/sites/default/les/pgas2013proceedings.pdf
22. Sch
utte, C., Winkelmann, S., Hartmann, C.: Optimal control of molecular dynamics using Markov state models. Math. Program. (Series B) 134, 259282 (2012)
23. Shirts, M., Pande, V.S.: Screen savers of the world unite!. Science 290(5498), 1903
1904 (2000). http://www.sciencemag.org/content/290/5498/1903.short
24. Sugita, Y., Okamoto, Y.: Replica-exchange molecular dynamics method for protein
folding. Chem. Phys. Lett. 314, 141151 (1999)
25. Verlet, L.: Computer Experiments on classical uids. I. Thermodynamical properties of Lennard-Jones molecules. Phys. Rev. 159, 98103 (1967).
http://link.aps.org/doi/10.1103/PhysRev.159.98
26. Wilson, G., Aruliah, D.A., Brown, C.T., Chue Hong, N.P., Davis, M., Guy, R.T.,
Haddock, S.H.D., Hu, K.D., Mitchell, I.M., Plumbley, M.D., Waugh, B., White,
E.P., Wilson, P.: Best practices for scientic computing. PLoS Biol 12(1), e1001745
(2014). http://dx.doi.org/10.1371/journal.pbio.1001745
27. Yokota, R., Barba, L.A.: A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems. Int. J. High Perform. Comput. Appl. 26(4),
337346 (2012). http://hpc.sagepub.com/content/26/4/337.abstract
Introduction
The lattice-Boltzmann (LB) method is widely applied to model uid ow, and
relies on a stream-collision scheme applied between neighbouring points on a
lattice. These local interactions allow LB implementations to be eciently parallelized, and indeed numerous high performance LB codes exist today [10,14].
Todays parallel LB implementations are able to eciently resolve large nonsparse bulk ow systems (e.g., cuboids of uid cells) using Petaop supercomputers [10,12]. Eciently modelling sparse systems on large core counts is still an
unsolved problem, primarily because it is dicult to obtain a good load balance
in calculation volume, neighbour count and communication volume for sparse
geometries on large core counts [13]. Additionally, the presence of wall sites,
inlets and outlets create a heterogeneity in the computational cost of dierent
c Springer International Publishing Switzerland 2015
S. Markidis and E. Laure (Eds.): EASC 2014, LNCS 8759, pp. 2838, 2015.
DOI: 10.1007/978-3-319-15976-8 2
29
lattice sites. Here we test two techniques for their potential to improve the load
balance in simulations using sparse geometries, and their performance in general.
We perform this analysis building forth on existing advances. Indeed, several
LB codes already provide special decomposition techniques to more eciently
model ow in sparse geometries. For example, Palabos [1], MUSUBI [14] and
WaLBerla [10] apply a block-wise decomposition strategy, while codes such as
HemeLB [13] and MUPHY [19] rely on third-party partitioning libraries such as
ParMETIS and PT Scotch.
Here we implement and test a weighted decomposition technique to try and
improve the parallel simulation performance of the HemeLB simulation environment for sparse geometries [16], by adding weights corresponding to the computational cost of lattice sites which do not represent bulk uid sites. In addition,
we examine the eect of also pre-ordering the lattice via a space-lling curve
when applying this method.
Several other groups have investigated the use of weighted decomposition
in other areas, for example in environmental uid mechanics [5]. In addition,
Catalyurek et al. [9] investigate adaptive repartitioning with Zoltan using
weighted graphs. Specically, Axner et al. [4] applied a weighting technique to a
lattice-Boltzmann solver for sparse geometries. Whereas we apply weights to vertices, they applied heavier weights to edges near in- and outlets, to ensure that
these regions would not be distributed across several processes.
HemeLB
30
D. Groen et al.
Fig. 1. Overview of the obtained calculation performance (in billions of lattice site
updates per second as a function of the years in which the simulation runs were performed. The runs were performed on a variety of supercomputers, each of which is
briey described above or below the respective data points. The number of cores used
is shown by the size of the circle, ranging from 2,048 cores (smallest circles) to 49,152
cores (largest circles). The uid fraction is shown by the color of the circle. These
include very sparse simulation domains such as vascular networks (red circles), sparse
domains such as bifurcations (green circles), ranging to non-sparse domains such cylinders (blue circles) (Color gure online).
multiple blocks [16]. After this initial decomposition, HemeLB then uses the
ParMETIS V3 PartKWay() function to optimize the decomposition, abandoning
the original block-level structure [13]. This function relies on a K-way partitioning technique, which rst shrinks the geometry to a minimally decomposable
size, then performs the decomposition, and then renes the geometry back to
its original size. One of the ways we can assess the quality of the decomposition
is by examining the edge cut, which is equal to the number of lattice neighbour
links that cross process boundaries.
Weighting
31
Fig. 2. 2D example of a sparse domain with the dierent types of lattice sites. In/outlets
are given by the blue bars and vessel walls by the red curves. Bulk sites are shown by
yellow dots, wall sites by green dots, wall in/outlet sites by red dots, and in/outlet
sites by blue dots (Color gure online).
A second, and more straightforward, optimization we have applied is by taking the Cartesian x, y and z coordinates of all lattice sites, and then sorting
them according to Morton-ordered space-lling curve. We do this prior to partitioning the simulation domain, and in doing so, we eectively eliminate any
bias introduced by the early stage decomposition scheme described in [16]. We
do this by replacing the ParMETIS V3 PartKWay() in the code function with a
32
D. Groen et al.
Table 1. Weight values as obtained from tting against the runtimes of six test simulations on two compute architectures (Intel SandyBridge and AMD Interlagos). The
site type is given, followed by the weigh obtained from tting the performance data
of the six runs, followed by the simplied integer value we adopted in ParMETIS. In
this work we use Bouzidi-Firdaouss-Lallemand (BFL) [7] wall conditions and in and
outlet conditions described in Nash et al. [18]. We observed rather erratic ts for the
weightings of in/outlet sites that are adjacent to walls, as these made up only a very
marginal fraction of the overall site counts in our benchmark runs (less than 1 % in
most cases).
Site type
Bulk
10.0
10.0
Wall (BFL)
18.708
20.226
In/outlet
40.037
37.398
16
ParMETIS V3 PartGeomKWay() function. This optimization is functionally independent from the weighted decomposition technique, but can lead to a better
decomposition result from ParMETIS when applied.
3.3
After having inserted these optimizations, we have also tried improving the partition by reducing the tolerance in ParMETIS. The amount of load imbalance
permitted within ParMETIS is indicated by the tolerance value, and a lower
value will increase the number of iterations ParMETIS will do to reach its
nal state. Decreasing the tolerance from 1.001 to 1.00001 resulted for us in
an increase of the ParMETIS processing time while showing a negligible dierence in the quality of partitioning. As a result, we have chosen not to investigate
this optimization in this work.
Setup
In our performance tests we used two dierent simulation domains. These include
a smaller bifurcation geometry and a larger aneurysm geometry (see Fig. 3 for
both). The bifurcation simulation domain consists of 650492 lattice sites, which
occupy about 10 % of the bounding box of the geometry. The aneurysm simulation domain consists of 5667778 lattice sites, which occupy about 1.5 % of the
bounding box of the geometry. We run our simulations using pressure in- and
outlets described in Nash et al. [18], the LBGK collision operator [6], the D3Q19
advection model and Bouzidi-Firdaouss-Lallemand wall conditions [7].
33
Fig. 3. Overview of the bifurcation geometry (left) and the aneurysm geometry (right)
used in our performance tests. The blue blob in the aneurysm geometry is a marker
indicating a region of specic interest to the user. The bifurcation geometry has a
sparsity of about 10 % (i.e., the lattice sites occupy about 10 % of the bounding box
of the geometry), and the aneurysm geometry a sparsity of about 1.5 % (Color gure
online).
For our benchmarks we use the HECToR Cray XT6 supercomputer at EPCC
in Edinburgh, and compile our code using the GCC compiler version 4.3.4. We
have run our simulations for 50000 time steps using 1281024 cores for the
bifurcation simulation domain, and 51212288 cores for the aneurysm simulation domain. We repeated the run for each core count ve times and averaged
the results. We do this because the scheduler at HECToR does not necessarily allocate processes within a single job to adjacent nodes; and as a result the
performance diers between runs. We have also performed several runs using
the aneurysm simulation domain on the ARCHER Cray XC30 supercomputer
at EPCC. These runs were performed with an otherwise identical conguration.
ARCHER relies on an Intel Ivy Bridge architecture and has a peak performance
of about 1.6 PFLOPs in total.
Results
We present our measurements of the total simulation time and the maximum
LB calculation time for the bifurcation simulation domain in Fig. 4.
We nd that both incorporating a space-lling curve and using weighted
decomposition results in a reduction of the simulation time. However, the use
of a space-lling curve does little to reduce the calculation load imbalance,
whereas enabling weighted decomposition results in a reduction of the calculation load imbalance by up to 85 %. We also examined the edge-cut returned
by ParMETIS during the domain decomposition stage. For each core count, the
edge cut obtained in all the runs was within a margin of 4.5 %, with slightly
higher edge cuts for runs using a space-lling curve or weighted decomposition.
D. Groen et al.
Original
SpaceFillingCurve
Weighted
Weighted+SFCurve
(linear scaling)
200
150
34
100
75
50
30
128
256
number of cores
512
1024
200
Original
SpaceFillingCurve
Weighted
Weighted+SFCurve
(averages)
150
100
75
50
30
20
15
10
7.5
5
128
256
number of cores
512
1024
Fig. 4. Total simulation time and maximum LB calculation time for the simulation
using the bifurcation model, run on HECToR. We performed measurements for the
non-optimized code, a code with only weighting enabled, a code with only the spacelling curve enabled, and a code with both enabled. We provide lines to guide the eyes.
In the image on the left we plotted a linear scaling line using a thick gray dotted line.
In the image on the right we plotted the average LB calculation time of all our run
types using thin gray dotted lines (Color gure online).
Maximum time spent on LB, calculation only [s]
Original
Weighted
Weighted+SFCurve
(linear scaling)
200
150
100
75
50
30
1024
2048
4096
number of cores
8192
12288
200
Original
Weighted
Weighted+SFCurve
(averages)
150
100
75
50
30
20
15
10
7.5
5
1024
2048
4096
number of cores
8192
12288
Fig. 5. Total simulation time and maximum LB calculation time for the simulation
of the aneurysm model, run on HECToR. See Fig. 4 for an explanation of the lines
and symbols. Here we only performed measurements for the non-optimized code, a
code with only the space-lling curve optimization enabled, and a code with both
optimizations enabled.
We present our measurements of the total simulation time and the maximum
LB calculation time for the aneurysm simulation domain in Fig. 5. Here we nd
that applying weighted decomposition results in an increase of runtime by 5 %
in most of our runs. Using the space-lling curve in addition to the weighted
decomposition results in a further increase in runtime, especially for runs performed on 4096 and 8192 cores. However, the use of weighted decomposition also
results in a calculation load imbalance which is up to 65 % lower than that of
the original simulation, while we again observe little dierence here between runs
that use a space-lling curve and the runs without. When we examine the edge
Original
SpaceFillingCurve
Weighted
Weighted+SFCurve
50
75
30
20
15
10
128
256
512
1024
35
Original
Weighted
Weighted+SFCurve
50
30
20
15
10
1024
number of cores
2048
4096
8192
12288
number of cores
Fig. 6. Total MPI communication time for the simulation of the bifurcation model
(left, from the run presented in Fig. 4) and the aneurysm model (right, from the run
presented in Fig. 5).
cut obtained by ParMETIS in dierent runs, we nd that using weighted decomposition results in a slightly lower edge cut (0.5 %) and using a space-lling
curve results in an edge cut which is up to 5.3 % higher.
To provide more insight into the cause of the increase in simulation time,
we present our measurements of the MPI communication overhead in these runs
in Fig. 6. Here the runs which use our optimization strategies take less time to
do MPI communication when applied to the bifurcation simulation domain, and
more time to do MPI communications when applied to the aneurysm domain.
These dierences match largely with the dierences we observed in the overall simulation time. Because the total time spent on MPI communications is
generally larger than the calculation time for high core counts, and the dierences between the runs are considerable, the communication performance is a
major component of the overall simulation performance. However, the communication performance correlates only weakly with the edge cut values returned
by ParMETIS and therefore the total communication volume. For example, the
slightly lower edge cut for the aneurysm simulations with weighted decomposition is in contrast with the slightly higher communication overhead. This means
that the communication load imbalance is likely to be a major bottleneck in the
performance of our larger runs, and should be investigated more closely.
5.1
We have repeated the simulations using the aneurysm simulation domain on the
ARCHER supercomputer, both with and without using weighted decomposition.
We present the measured simulation and calculation times of these runs in Fig. 7,
and the MPI communication time in Fig. 8. In these runs, we obtained approximately three times the performance per core compared to HECToR. When using
weighted decomposition, the calculation load imbalance was reduced by up to
70 %, the simulation time by approximately 212 % and the MPI communication
time by approximately 520 %. In particular, the reduction in communication
36
D. Groen et al.
100
Original
Weighted
75
50
30
20
15
768
1536
3072
number of cores
6144
100
Original
Weighted
(averages)
75
50
30
20
15
10
7.5
5
1536
768
3072
number of cores
6144
Fig. 7. Total simulation time and maximum LB calculation time for the simulation of
the aneurysm model, as run on ARCHER. See Fig. 4 for an explanation of the lines
and symbols. Here we only performed measurements for the non-optimized code, and
a code with weighted decomposition enabled.
30
Original
Weighted
20
15
10
7.5
768
1536
3072
number of cores
6144
Fig. 8. Total MPI communication time for the run presented in Fig. 7.
37
simulations it appears that a low edge cut is only a minor factor in the overall
communication performance for sparse problems, even though graph partitioning libraries are frequently optimized to accomplish such a minimal edge cut.
This is in accordance with some earlier conclusions in the literature [15]. We
intend to more thoroughly investigate the communication load imbalance of our
larger runs. As part of preparing HemeLB for the exascale within the CRESTA
project, we are working with experts from the Deutschen Zentrums f
ur Lucht
und Raumfahrt (DLR) to enable domain decompositions using PT-Scotch and
Zoltan. The use of these alternate graph partitioning libraries may result in further performance improvements, especially if these libraries optimize not only for
a calculation load balance and a low edge cut, but also take into account other
communication characteristics. Furthermore, since we have observed dierences
in site weights between dierent computer architectures, we are looking into an
auto-tuning function that automatically calculates the weights at runtime or
compilation time.
Acknowledgements. We thank Timm Krueger for his valuable input. This work has
received funding from the CRESTA and MAPPER projects within the EC-FP7 (ICT2011.9.13) under Grant Agreements nos. 287703 and 261507, and from EPSRC Grants
EP/I017909/1 (www.2020science.net) and EP/I034602/1. This work made use of the
HECToR supercomputer at EPCC in Edinburgh, funded by the Oce of Science and
Technology through EPSRCs High End Computing Programme.
References
1. Palabos LBM Wiki (2011). http://wiki.palabos.org/
2. Cresta case study: application soars above petascale after tools collaboration
(2014). http://www.cresta-project.eu/images/cresta casestudy1 2014.pdf
3. ParMETIS (2014). http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview
4. Axner, L., Bernsdorf, J., Zeiser, T., Lammers, P., Linxweiler, J., Hoekstra, A.G.:
Performance evaluation of a parallel sparse lattice Boltzmann solver. J. Computat.
Phys. 227(10), 48954911 (2008)
5. Barad, M.F., Colella, P., Schladow, S.G.: An adaptive cut-cell method for environmental uid mechanics. Int. J. Numer. Methods Fluids 60(5), 473514 (2009)
6. Bhatnagar, P.L., Gross, E.P., Krook, M.: A model for collision processes in gases.
i. small amplitude processes in charged and neutral one-component systems. Phys.
Rev. 94, 511525 (1954)
7. Bouzidi, M., Firdaouss, M., Lallemand, P.: Momentum transfer of a Boltzmannlattice uid with boundaries. Phys. Fluids 13(11), 34523459 (2001)
8. Carver, H.B., Groen, D., Hetherington, J., Nash, R.W., Bernabeu, M.O., Coveney,
P.V.: Coalesced communication: a design pattern for complex parallel scientic
software. Advances in Engineering Software (2015, in press)
9. Catalyurek, U.V., Boman, E.G., Devine, K.D., Bozdag, D., Heaphy, R., Riesen,
L.A.: Hypergraph-based dynamic load balancing for adaptive scientic computations. In: IEEE International Parallel and Distributed Processing Symposium,
2007, IPDPS 2007, pp. 111. IEEE (2007)
38
D. Groen et al.
Abstract. Current Monte Carlo neutron transport applications use continuous energy cross section data to provide the statistical foundation
for particle trajectories. This classical algorithm requires storage and
random access of very large data structures. Recently, Forget et al. [1]
reported on a fundamentally new approach, based on multipole expansions, that distills cross section data down to a more abstract mathematical format. Their formulation greatly reduces memory storage and
improves data locality at the cost of also increasing oating point computation. In the present study, we abstract the multipole representation
into a proxy application, which we then use to determine the hardware performance parameters of the algorithm relative to the classical
continuous energy algorithm. This study is done to determine the viability of both algorithms on current and next-generation high performance
computing platforms.
Keywords: Monte carlo Multi-core Neutron transport Reactor simulation Multipole Cross section
Introduction
Monte Carlo (MC) transport algorithms are considered the gold standard of
accuracy for a broad range of applications e.g., nuclear reactor physics, shielding,
detection, medical dosimetry, and weapons design to name just a few examples.
In the design and analysis of nuclear reactor cores, the key application driver of
the present analysis, MC methods for neutron transport oer signicant potential
advantages compared to deterministic methods given their simplicity, avoidance
of ad hoc approximations in energy treatment, and lack of need for complex computational meshing of reactor geometries.
On the other hand it is well known that robust analysis of a full reactor core
is still beyond the reach of MC methods. Tremendous advances have been made
in recent years, but the computing requirements for full quasi-static depletion
c Springer International Publishing Switzerland 2015
S. Markidis and E. Laure (Eds.): EASC 2014, LNCS 8759, pp. 3956, 2015.
DOI: 10.1007/978-3-319-15976-8 3
40
Computer-based simulation of nuclear reactors is a well established eld, with origins dating back to the early years of digital computing. Traditional reactor simulation techniques aim to solve deterministic equations (typically a variant of the
diusion equation) for a given material geometry and initial neutron distribution
(source) within the reactor. This is done using mature and well understood numerical methods. Deterministic codes are capable of running quickly and providing
relatively accurate gross power distributions, but are still limited when accurate
localized eects are required, such as e.g. at sharp material interfaces.
An alternative formulation, the Monte Carlo (MC) method, simulates the path
of individual neutrons as they travel through the reactor core. As many particle
histories are simulated and tallied, a picture of the full distribution of neutrons
within the domain emerges. Such codes are inherently simple, easy to understand,
and potentially easy to restructure when porting to new systems. Furthermore, the
methodologies utilized by MC simulation require very few assumptions, resulting
41
OpenMC
XSBench
The XSBench proxy application models the most computationally intensive part
of a typical MC reactor core transport algorithm the calculation of macroscopic
neutron cross sections, a kernel which accounts for around 85 % of the total runtime of OpenMC [4]. XSBench retains the essential performance-related computational conditions and tasks of fully featured reactor core MC neutron transport
codes, yet at a fraction of the programming complexity of the full application [6].
Particle tracking and other features of the full MC transport algorithm were not
included in XSBench as they take up only a small portion of runtime in robust
reactor computations. This provides a much simpler and far more transparent
platform for testing the algorithm on dierent architectures, making alterations
to the code, and collecting hardware runtime performance data.
XSBench was developed by members of the Center for Exascale Simulation of
Advanced Reactors (CESAR) at Argonne National Laboratory. The application is
written in C, with multi-core parallelism support provided by OpenMP. XSBench
is an open source software project. All source code is publicly available online [12].
1.4
RSBench
42
unpacking of this data by way of a signicant number of additional computations (FLOPs). The multipole algorithm has also been experimentally implemented into OpenMC, but is only capable of simulating several select nuclides with
this method due to limited multipole cross section library support.
RSBench is in active development by members of the CESAR group at
Argonne National Laboratory. The application is written in C, with multi-core
parallelism support provided by OpenMP. RSBench is an open source software
project. All source code is publicly available online [13].
2
2.1
Algorithm
Reactor Model
When carrying out reactor core analysis, the geometry and material properties of
a postulated nuclear reactor must be specied in order to dene the variables and
scope of the simulation model. For the purposes of XSBench and RSBench, we use
a well known community reactor benchmark known as the Hoogenboom-Martin
model [14]. This model is a simplied analog to a more complete, real-world
reactor problem, and provides a standardized basis for discussions on performance
within the reactor simulation community. XSBench and RSBench recreate the
computational conditions present when fully featured MC neutron transport
codes (such as OpenMC ) simulate the Hoogenboom-Martin reactor model, preserving a similar data structure, a similar level of randomness of data accesses,
and a similar distribution of FLOPs and memory loads.
2.2
43
of a very large data structure that holds cross section data points for many discrete energy levels. In the case of the simplied Hoogenboom-Martin benchmark,
roughly 5.6 GB1 of data is required. The multipole method greatly reduces these
requirements down the the order of approximately 100 MB or less for all data.
2.3
The classical continuous energy cross section representation, as used by real world
applications like OpenMC, is abstracted in the proxy-application XSBench. This
section describes the data structure used by this algorithm along with the access
patterns of the algorithm.
Data Structure. A material in the Hoogenboom-Martin reactor model is composed of a mixture of nuclides. For instance, the reactor fuel material might
consist of several hundred dierent nuclides, while the pressure vessel side wall
material might only contain a dozen or so. In total, there are 12 dierent materials and 355 dierent nuclides present in the modeled reactor. The data usage
requirements to store this model are signicant, totaling 5.6 GB, as summarized
in Table 1.
Table 1. XSBench data structure summary
Nuclides tracked
355
4,012,565
For each nuclide, an array of nuclide grid points are stored as data in main
memory. Each nuclide grid point (as represented in Fig. 1) has an energy level,
as well as ve cross section values (corresponding to ve dierent particle interaction types) for that energy level. The grid points are ordered from lowest to
highest energy levels. The number, distribution, and granularity of energy levels varies between nuclides. One nuclide may have hundreds of thousands of grid
points clustered around lower energy levels, while another nuclide may only have a
few hundred grid points distributed across the full energy spectrum. This obviates
straightforward approaches to uniformly organizing and accessing the data. Collectively, this data structure (depicted in Fig. 2) is known as the nuclide energy grid.
In order to increase the speed of the calculation, the algorithm utilizes another
data structure, called the unionized energy grid, as described by Lepp
anen [16] and
Romano [2]. The unionized grid facilitates fast lookups of cross section data from
1
44
s
f
t
Fig. 1. A cross section data packet for a neutron within a given nuclide at a given energy
level, Ei .
the nuclide grids. This structure is an array of grid points, consisting of an energy
level and a set of pointers to the closest corresponding energy level on each of the
dierent nuclide grids (Fig. 3).
N0 :
E2
E4
E5
N1 :
E0
E10
N2 :
E1
E3
E6
Nn :
E7
E8
Em
E9
Fig. 2. Simplied example of the nuclide energy grid. Note how each nuclide has a varying number and distribution of energy levels.
Access Patterns. In a full MC neutron transport application, the data structure is accessed each time a macroscopic cross section needs to be calculated. This
happens anytime a particle changes energy (via a collision) or crosses a material boundary within the reactor. These macroscopic cross section calculations
occur with very high frequency in the MC transport algorithm, and the inputs
to them are eectively random. For the sake of simplicity, XSBench was written
ignoring the particle tracking aspect of the MC neutron transport algorithm and
instead isolates the macroscopic cross section lookup kernel. This provides a large
reduction in program complexity while retaining similarly random input conditions for the macroscopic cross section lookups via the use of a random number
generator.
In XSBench, each macroscopic cross section lookup consists of two randomly
sampled inputs: the neutron energy Ep , and the material mp . Given these two
45
E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
Em
N0 :
N1 :
N2 :
Nn :
Fig. 3. Simplied example of the unionized energy grid. Each grid element is the index
where the energy level Ei can be found in the nuclide energy grid for nuclide Ni .
inputs, a binary search that executes in log(n) time is done on the unionized energy
grid for the given energy. Once the correct entry is found on the unionized energy
grid, the material input is used to perform lookups from the nuclide grids present
in the material. Use of the unionized energy grid means that binary searches are
not required on each individual nuclide grid. For each nuclide present in the material, the two bounding nuclide grid points are found using the pointers from the
unionized energy grid and interpolated to give the exact microscopic cross section
at that point.
All calculated microscopic cross sections are then accumulated (weighted by
their atomic density in the given material), which results in the macroscopic cross
section for the material. Algorithm 1 is an abbreviated summary of this calculation.
R(mp , Ep )
Locate Ep on Unionized Grid
for n mp do
a n, Ep
b n, Ep + 1
a , b
+ n
end for
In theory, one could pre-compute all macroscopic cross sections on the unionized energy grid for each material. This would allow the algorithm to run much
faster, requiring far fewer memory loads and far fewer oating point operations
per macroscopic cross section lookup. However, this would assume a static distribution of nuclides within a material. In practice, MC transport nuclide-depletion
46
calculations are quasi-static; they will need to track the burn-up of fuels and account for heterogeneous temperature distributions within the reactor itself. This
means that concentrations are dynamic, rather than static, therefore necessitating the use of the more versatile data model deployed in OpenMC and XSBench.
Even if static concentrations were assumed, pre-computation of the full spectrum
of macroscopic cross sections would need to be done for all geometric regions
(of which there are many millions) in the reactor model, leading to even higher
memory requirements.
We have veried that XSBench faithfully mimics the data access patterns of
the full MC application under a broad range of conditions [6]. The runtime of
full-scale MC transport applications, such as OpenMC, is 85 % composed of macroscopic cross section lookups [4]. Within this process, XSBench is virtually indistinguishable from OpenMC, as the same type and size of data structure is used, with
a similarly random access pattern and a similar number of oating point operations occurring between memory loads. Thus, performance analysis done with
XSBench provides results applicable to the full MC neutron transport algorithm,
while being far easier to implement, run, and interpret.
2.4
A multipole representation cross section algorithm is abstracted in the proxyapplication RSBench. This section summarizes the data structure used by this
algorithm along with the access patterns and computations performed by the algorithm. The multipole representation stores cross section data in the form of poles.
Each pole can be characterized by several variables that dene the parameters
(residues) of the resonance that can be used to compute the actual microscopic
cross section contribution at any energy from the pole. Forget et al. also utilize a
window methodology that limits the number of poles that need to be evaluated
for a given cross section calculation [1]. The energy spectrum is broken up into a
series of windows, each covering a specic energy range and storing a set of tting
factors. These tting factors represent a background function that can be evaluated to represent the contributions from all poles outside the window. Use of the
windowing method saves time by requiring that only poles within a single energy
window need to be evaluated to determine the microscopic cross sections, rather
than all poles in the entire energy spectrum. A more in-depth explanation of the
mathematics behind the multipole representation is oered by Forget et al. [1].
Data Structure. The primary data structures employed by RSBench are two
separate 2-D jagged arrays. The rst 2-D array contains the resonance data for all
poles and accompanying residues. The rst dimension correlates to each nuclide
present in the reactor. The second dimension correlates to the number of poles
present in that nuclide (each nuclide has a dierent number of poles, varying from
100 to 6,000) [1]. For the purposes of this mini-app, a representative average number of poles per nuclide is set at 1,000 as default. Each element of this array is a
pole data structure that holds several pieces of information including the center
47
energy for the pole, resonance residues for several reaction types, and the l index,
as depicted in Fig. 4.
The second 2-D array contains data for all windows. The rst dimension correlates to each nuclide present in in the reactor. The second dimension correlates
to the number of windows used for that particular nuclide. Window sizing is an
empirical process done when building a library of multipole cross section data,
where each nuclide is likely to require a dierent window size to achieve a given
accuracy. Thus, each nuclide in RSBench has a dierent number of windows (ranging from 4 to 25 poles per window [1]). For the purposes of this mini-app, a representative number of windows is set to 250 as default. Each element of this array is a
window data structure that holds several pieces of information including the function tting factors for several reaction types and the start and end pole indices,
as represented in Fig. 5.
Compared to the classical method, such as used by XSBench, these 2-D arrays
together are in total much smaller as no unionized energy grid is necessary and
far fewer data points are needed, as summarized in Table 2. Use of the multipole
method in this case results in a memory footprint reduction of over two orders of
magnitude.
Note that the average number of poles per nuclide and windows per nuclide
used in RSBench are only approximations. Multipole data requirements are well
understood for U-235 and U-238, but library les have yet to be computed for
the other 353 nuclides in our simulation. Approximations were selected based on
interpolation, under the assumption that multipole memory requirements from
U-235 and U-238 will have similar ratios of data memory requirements compared
to the other nuclides for the classical continuous energy cross section storage
method. E.g., the ratio of data storage needed between U-238 and Ni-58 for continuous energy representation will remain the same for multipole representation.
Energy Ei
RT
P
RA
RF
l index
48
355
1,000
250
Total # of resonances
355,000
and end pole indices of the window, the pole data structures are retrieved and
used in several lengthy computations to determine their contributions to the various microscopic cross sections. Macroscopic cross sections are then accumulated.
This process is summarized in Algorithm 2.
Algorithm 2. Multipole Macroscopic Cross Section Lookup
1: R(mp , Ep )
2: for n mp do
3:
Locate W Covering Energy Ep
4:
Calculate l
5:
T WT
6:
A WA
7:
F WF
8:
for P W do
9:
T PRT , l
10:
A PRA
11:
F PRF
12:
end for
13:
E = T A
14:
+ n
15: end for
accumulate macro xs
The equations used to assemble microscopic cross sections out of multipole resonance data are described in detail by Forget et al. [1]. Simplied forms of the 0 K
multipole equations used by RSBench, are given in Eqs. 1, 2, and 3. Note that the
eects of neutron spin are neglected under the assumption that all neutrons are
49
spin zero, which in our experience does not impact performance. This simplication is made to reduce the programming complexity of the RSBench application,
making it easier to instrument and port to new languages and systems, while still
retaining a similar performance prole to the full multipole algorithm.
N 2(l+1)
(j)
irx
1
x (E) =
Re (j)
E
p E
lj =1 j=1
(1)
N 2(l+1)
(j)
irt
1
t (E) = p (E) +
Re exp(i2l ) (j)
E
p E
lj =1 j=1
(2)
(3)
lj
(j)
(j)
and where rx and rt are the residues for reaction x and total cross section
(j)
around resonance , gj is the spin statistical factor, p is the complex conjugate of the pole, and l is the phase shift. In this form, the cross sections can
be computed by summations over angular momentum of the channel (l), channel spin (j), number of resonances (N ) and number of poles associated to a given
resonance type 2(l + 1).
Application
To investigate the performance proles of our two MC transport cross section algorithms on existing systems, we carried out a series of tests using RSBench and
XSBench on single node, multi-core, shared memory system. The system used
was a single node consisting of two Intel Xeon E5-2650 octo-core CPUs for a total
of 16 physical CPUs. All tests, unless otherwise noted, were run at 2.8 GHz using
Intel Turbo Boost.
We performed a scaling study to determine performance improvements as
additional cores were added. We ran both proxy applications with only a single
thread to determine a baseline performance against which eciency can be measured. Then, further runs were done to test each number of threads between 1 and
32. Eciency is dened as
Eciencyn =
Rn
R1 n
(4)
where n is the number of cores, Rn is the experimental calculation rate for n cores,
and R1 is the experimental calculation rate for one core.
The tests reveal that even for these proxy-applications of the MC transport
algorithm, perfect scaling was not achievable. Figures 6 and 7 show that eciency
50
degraded gradually as more cores were used on the nodes. For the Xeon system,
eciency at 16 cores degraded to 69 % for XSBench and 83 % for RSBench.
One might reasonably conclude that 69 % or 83 % eciency out to 16 cores
is adequate speedup. However, next-generation node architectures are likely to
require up to thousand-way on-node shared memory parallelism [710], and thus
it is crucial to ascertain the cause of the observed degradation and the implications for greater levels of scalability. Considering nodes with 32, 64, 128, or 1024
shared memory cores and beyond, it cannot be taken for granted that performance
will continue to improve. We thus seek to identify to the greatest extent possible
which particular system resources are being exhausted, and how quickly, so that
designers of future hardware systems as well as developers of future MC particle
transport applications can avoid bottlenecks.
High performance computing (HPC) applications generally have several possible reasons for performance loss due to scaling:
1. FLOP bound A CPU can only perform so many oating point operations per
second.
2. Memory Bandwidth Bound A nite amount of data can be sent between
DRAM and the CPU.
3. Memory Latency Bound An operation on the CPU that requires data be sent
from the DRAM can take a long time to arrive.
51
To investigate the performance and resource utilization proles of both proxy applications, and to determine the cause of multi-core scaling degradation, we performed
a series of experiments. Each experiment involves varying a system parameter,
monitoring hardware usage using performance counters, and/or altering a portion
of the XSBench and RSBench codes. The following section presents descriptions,
results, and preliminary conclusions for each experiment. For the purposes of simplicity, we concentrate our analysis on the Intel Xeon system described in Sect. 3.
This allows us to get highly in-depth results as we are able to run experiments dealing with architecture-specic features and hardware counters.
52
4.1
Resource Usage
To better understand scaling degradation in our kernels, we implemented performance counting features into the source code of XSBench and RSBench using the
Performance Application Programming Interface (PAPI) [17]. This allowed us to
select from a large variety of performance counters (both preset and native to our
particular Xeon chips). We collected data for many counters, including:
ix86arch::LLC MISSES - Last Level (L3) Cache Misses.
PAPI TOT CYC - Total CPU Cycles.
PAPI FP INS - Floating point instructions.
These raw performance counters allowed us to calculate a number of composite
metrics, including bandwidth usage, FLOP utilization, and cache miss rate. Each
of the metrics are discussed in the following subsections.
Bandwidth. Consumption of available system bandwidth resources used by
XSBench and RSBench is calculated using Eq. 5.
Bandwidth =
(5)
53
Using Eq. 5, we collected the bandwidth usage for our proxy applications as
run on varying numbers of cores, as shown in Fig. 8. Note that the maximum theoretically available bandwidth for the Xeon node is 51.2 GB/s [18]. Figure 8 shows
that less than half the available bandwidth is ever used by either of our proxy
applications, even when running at 32 threads per node2 .
There is, however, the question as to how much bandwidth is realistically
usable on any given system. Even a perfectly constructed application that oods
the memory system with easy, predictable loads is unlikely to be able to use the
full system bandwidth. In order to determine what is actually usable on our Xeon
system, we ran the STREAM benchmark, which measures real world bandwidth sustainable from ordinary user programs [19]. Results from this benchmark
are shown in Fig. 8, and compared to XSBench and RSBench. As can be seen,
XSBench converges with STREAM, leading us to believe that the classical cross
section algorithm is bottlenecked by system bandwidth. In contrast, we nd that
the bandwidth usage of RSBench is much more conservative using only 1 GB/s,
a factor of over 20 less than what XSBench uses.
The 16-core Xeon node used in our testing features hardware threading, supporting
up to 32 threads per node.
54
Performance parameter
2,075,457
1,017,772
23.7
0.91
3.6
10.8
27
Conclusions
55
systems, as processor cores per node and computational capacity are expected to
greatly outpace increases in bandwidth to main memory. This is an important
result, as the multipole method is not widely used in monte carlo transport codes
yet exhibits an ideal performance prole for on-node scaling on many-core exascale
architectures of the near future.
Future Work
There are additional capabilities that do not yet commonly exist in full-scale MC
neutron transport algorithms, such as on-the-y Doppler broadening to account
for the material temperature dependence of cross sections, that we plan to implement in XSBench and RSBench for experimentation with various hardware architectures and features. This addition is predicted to enhance the advantages of the
multipole algorithm as Doppler broadening is an inherently easier task when cross
section data is already stored in the multipole format.
Acknowledgments. This work was supported by the Oce of Advanced Scientic
Computing Research, Oce of Science, U.S. Department of Energy, under Contract
DE-AC02-06CH11357. The submitted manuscript has been created by the University of
Chicago as Operator of Argonne National Laboratory (Argonne) under Contract DEAC02-06CH11357 with the U.S. Department of Energy. The U.S. Government retains
for itself, and others acting on its behalf, a paid-up, nonexclusive, irrevocable worldwide
license in said article to reproduce, prepare derivative works, distribute copies to the
public, and perform publicly and display publicly, by or on behalf of the Government.
References
1. Forget, B., Xu, S., Smith, K.: Direct Doppler broadening in Monte Carlo simulations
using the multipole representation. Ann. Nucl. Energy 64(C), 7885 (2014)
2. Romano, P.K., Forget, B.: The OpenMC Monte Carlo particle transport code. Ann.
Nucl. Energy 51, 274281 (2013)
3. Romano, P.K., Forget, B., Brown, F.B.: Towards scalable parallelism in Monte Carlo
particle transport codes using remote memory access, pp. 1721 (2010)
4. Siegel, A.R., Smith, K., Romano, P.K., Forget, B., Felker, K.G.: Multi-core performance studies of a Monte Carlo neutron transport code. Int. J. High Perform.
Comput. Appl. 28(1), 8796 (2013)
5. Tramm, J., Siegel, A.R.: Memory bottlenecks and memory contention in multi-core
Monte Carlo transport codes. In: Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo, Paris, October 2013. Argonne National
Laboratory (2013)
6. Tramm, J.R., Siegel, A.R., Islam, T., Schulz, M.: XSBench - the development and
verication of a performance abstraction for Monte Carlo reactor analysis. Presented at the PHYSOR 2014 - the role of reactor physics toward a sustainable future,
Kyoto
7. Dosanjh, S., Barrett, R., Doerer, D., Hammond, S., Hemmert, K., Heroux, M.,
Lin, P., Pedretti, K., Rodrigues, A., Trucano, T., Luitjens, J.: Exascale design space
exploration and co-design. Future Gener. Comput. Syst. 30, 4658 (2013)
56
8. Attig, N., Gibbon, P., Lippert, T.: Trends in supercomputing: the European path
to exascale. Comput. Phys. Commun. 182(9), 20412046 (2011)
9. Rajovic, N., Vilanova, L., Villavieja, C., Puzovic, N., Ramirez, A.: The low power
architecture approach towards exascale computing. J. Comput. Sci. 4, 439443
(2013)
10. Engelmann, C.: Scaling to a million cores and beyond: using light-weight simulation
to understand the challenges ahead on the road to exascale. Future Gener. Comput.
Syst. 30, 5965 (2013)
11. Romano, P.: OpenMC Monte Carlo code, January 2014. https://github.com/
mit-crpg/openmc
12. Tramm, J.: XSBench: the Monte Carlo macroscopic cross section lookup benchmark, January 2014. https://github.com/jtramm/XSBench
13. Tramm, J.: RSBench: a mini-app to represent the multipole resonance representation lookup cross section algorithm, January 2014. https://github.com/jtramm/
RSBench
14. Hoogenboom, J.E., Martin, W.R., Petrovic, B.: Monte Carlo performance benchmark for detailed power density calculation in a full size reactor core benchmark
specications. Ann. Arbor. 1001, 4210448109 (2010)
15. Romano, P.K., Siegel, A.R., Forget, B., Smith, K.: Data decomposition of Monte
Carlo particle transport simulations via tally servers. J. Comput. Phys. 252(C),
2036 (2013)
16. Lepp
anen, J.: Two practical methods for unionized energy grid construction in
continuous-energy Monte Carlo neutron transport calculation. Ann. Nucl. Energy
36(7), 878885 (2009)
17. ICL: PAPI - performance application programming interface, September 2013.
http://icl.cs.utk.edu/papi/index.html
18. Intel: Xeon processor e52650 cpu specications, September 2013. http://ark.intel.
com/products/64590/
19. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer
Architecture (TCCA) Newsletter, December 1995, pp. 1925 (1995)
Keywords: Nek5000
ment method
Introduction
Nek5000 is an open-source code for simulating incompressible ows [1]. The code
is widely used in a broad range of applications. The various research projects
at KTH Royal Institute of Technology Mechanics Department using Nek5000
include the study of turbulent pipe ow, the ow along airplane wings, a jet in
cross-ow and Lagrangian particle motion in complex geometries [2].
The Nek5000 discretization scheme is based on the spectral-element method [3].
In this approach, the incompressible Navier-Stokes equations are discretized in
c Springer International Publishing Switzerland 2015
S. Markidis and E. Laure (Eds.): EASC 2014, LNCS 8759, pp. 5768, 2015.
DOI: 10.1007/978-3-319-15976-8 4
58
J. Gong et al.
Theoretical Background
in
R3
(1)
u
d + Re
t
v (u u)d =
q( u)d = 0
v pd +
v 2 ud +
v f d
(2)
When the spectral element method (SEM) [7] is employed in spatial discretization, the variable u (and v, w, p) and its rst derivatives can be continuously represented as
u(x, y, z) =
N
N
N
u(x, y, z)
2
=
i (x)j (y)k (z)
uijk
x
|J|
i=0 j=0
k=0
(3)
59
n n
f1
H 0 0 0
u1
0 H 0 0 un2 f2n
(4)
0 0 H 0 un3 = f3n
pn
gn
0 0 0 A
0
1
where H = Re
A + t
B is the discrete Helmholtz operator and A is the symmetric positive-denite Laplace operator. Note that f n accounts for all the terms
known prior to time tn . The resultant linear system is computed with Conjugate
Gradient (CG) linear solver accelerated with convenient preconditioners.
3
3.1
|
Time | Imb. | Imb.|
Calls |Group
|
| Time |Time%|
| Function
100.0%
| 191.107502 |
-- | -- | 44148206.0 |Total
|-----------------------------------------------------------| 99.3% | 189.766300|
-- | -- | 42866946.0 |USER
||----------------------------------------------------------|| 33.5% | 63.981577 |
-- | -- | 30584418.0 |mxf10_
|| 19.6% | 37.490228 |
-- | -- |
2450.0 |axhelm_
||
9.7% | 18.620119 |
-- | -- |
40.0 |cggo_
||
8.8% | 16.788978 |
-- | -- | 5118400.0 |mxf12_
||
8.3% | 15.881574 |
-- | -- |
10.0 |hmh_gmres_
||
4.4% |
8.406132 |
-- | -- |
914.0 |h1mg_schwarz
||
2.4% |
4.595712 |
-- | -- |
914.0 |hsmg_do_fast_
||
2.4% |
4.544659 |
-- | -- |
457.0 |h1mg_solve_
||
1.4% |
2.630254 |
-- | -- | 2924800.0 |mxf6_
Profiling results of Nek5000 on single node with the CrayPAT profile
60
J. Gong et al.
It is clear from the proling results that the subroutines mxf6, mxf10 and
mxf12 for calculating the matrix-matrix product required approximately 43.7 %
of the time. Other subroutines (axhelm, cggo, hmh gmres, h1mg schwarz,
hsmg do fast and h1mg solve) for solving a preconditioned CG solver required
approximately 46.8 % of the total time. The matrix-matrix multiplication and
the linear algebra solvers dominate the execution time of Nek5000.
3.2
Matrix-Matrix Production
Gather-Scatter Operator
In the traditional nite element methods, the global matrix is typically assembled
by the distinct nodes associated with the global indices. However, in Nek5000,
the linear system is written as AL uL = b, where uL is the vector of values for
each node associated with local indices based on elements.
As an example, a mesh of 4 elements in 2-D is shown in Fig. 1. In this mesh
there are 9 global nodes (0, . . . , 8) and 16 local nodes (0, . . . , 15). One global
node may correspond to several local nodes. The solution of the linear system
61
u
L = QT uG = QT QuL
(5)
T
where the Boolean matrix Q is the gather operator and its transform Q is the
scatter operator. Notice that matrix Q is not explicitly implemented in Nek5000,
instead a local-global map lgl(local index, global index) is used. The localglobal map for the mesh in Fig. 2 is
procs: 0
local_indices: 1
global_indices: 1
0
3
3
0
5
3
0
4
4
0
7
4
0 || 1 1 1 1 1 1
9 || 10 11 13 12 14 15
7 || 1 4 4 5 5 7
To reduce MPI global communication, those nodes that share a global index in
the same processes are summed locally and then exchanged with other processors. For example, the parallel version of the gather-scatter operator (gs op) for
the value on global node 4 shared with local nodes 4, 7, 11, and 13 is
62
J. Gong et al.
u4L + u7L
u11 + u13
L
(Proc 0),
u4G = L
(Proc 1)
2
2
4
4
u (on Proc 0) + uG (on Proc 1)
u4G = G
, (MPI gs op)
2
4
7L = u4G (Proc 0),
u
11
13
(Proc 1)
u
4L = u
L =u
L = uG
u4G =
The OpenACC code of the preconditioned conjugate gradient solver CG for calculating the pressure eld is shown below.
!$ACC DATA PRESENT(r,w,z,d,p,h1,h2)
!$ACC& PRESENT(mask,mult,nel,ktype)
do iter=1,niter
call fdm_h1_acc(z,r,d,mask,mult,nel,ktype,w)
call crs_solve_h1_acc (w,r)
call add2_acc
(z,w,n)
call add2s1_acc (p,z,beta,n)
call axhelm_acc (w,p,h1,h2,imsh,isd)
call gs_op
(w,nx1,ny1,nz1)
call col2_acc
(w,mask,n)
63
rho = glsc3_acc(w,p,mult,n)
alpha=rtz1/rho; alphm=-alpha
call add2s2_acc (x,p,alpha,n)
call add2s2_acc (r,w,alphm,n)
enddo
The OpenACC version of the preconditioned Conjugate Gradients solver. In the
OpenACC implementation, the subroutine gs op needs to be called once to
exchange the interface data between GPUs.
Performance Results
13
Time%
5.6
6.3
0.89
8.7
8.7
1.0
14th
11.9
11.2
1.1
15th
15.5
11.8
1.3
|
Time | Imb. | Imb. |
Calls |Group
|
| Time | Time% |
|Function
100.0%
| 30.070742 |
-- |
-- | 6545592.0 |Total
|--------------------------------------------------------| 96.0% | 28.866030 |
-- |
-- | 5342163.0 |USER
64
J. Gong et al.
4.1
5.07
1.39
65
Figure 2 shows the color-coded application run-time behavior over the whole
run-time of 151.93 s for the CPU process (Master thread) and the GPU stream
(CUDA[0:2]). The gure shows the three main phases of the Nek5000 code
nek init (initialization), nek solve (main calculation), and nek post (postprocessing). These three phases can be easily distinguished by comparing the
dierent color-coded patterns in the timeline of the CPU process. All compiler
generated OpenACC kernels are grouped in CUDA Kernel. The overall run-time
of these kernels was 102.416 s (67.41 % of the application run-time) while the
GPU was idle for the rest of the time (32.59 % of the application run-time).
Fig. 2. Function-based, color-coded visualization of the CPU process and GPU stream
over the full application run-time and additional statistics of the groups of the functions
and the GPU kernels. In the timeline of the metric duration of the main iteration the
dynamic behavior of the duration of each iteration over is visualized.
Figure 3 shows the zoomed-in, color-coded run-time behavior of the application during the main calculation phase over ten Nek5000 computational iterations. It shows that the idle time of the GPU was reduced in comparison to the
whole run-time to 25.35 %. The ten dierent iterations have dierent duration,
varying from 15.16 s (rst iteration) to 12.32 s (tenth iteration). This can be seen
in the metric timeline of the metric duration over time.
The main GPU kernels are the matrix-matrix multiplication in the subroutine
axhelm taking 36.43 % of the run-time and a kernel that consumes 28.06 % of
66
J. Gong et al.
the run-time. 24.68 % of the GPU run-time was spent in kernels that are used to
do a mapping between the coarse and the ne mesh. An important result of this
analysis is that it shows that only one CUDA stream is used by the compilergenerated OpenACC code. For this reason, all the GPU kernels are executed
sequentially and therefore there is no overlapping of the CUDA kernels during
the execution. As a result, the achieved performance of the GPU relies on the
degree of vectorization of each OpenACC kernel, i.e. how many massively parallel
threads can be created within each kernel. The main CPU process routine is a
synchronization of CUDA called cuda StreamSynchronize and uses in fact 90 %
of the CPUs application run-time. This function is part of the group CUDA API,
which includes all CUDA API calls monitored on the CPU process and was
introduced by the compiler and used in the OpenACC regions.
Figure 4 presents the color-coded visualization of the rst iteration on the
Nek5000 simulation. This iteration has a duration of 15.154 s and the idle time
of the GPU for this phase is 3.441 s (22.7 %). The CPU process spent most of its
time in the CUDA synchronization routine (light blue color in Fig. 4). The most
time consuming kernels of the GPU are again the matrix-matrix computation
with 4.631 s and respectively 3.363 s. In future work, to improve the performance
of Nek5000 with OpenACC by decreasing the GPU idle time, it will be important
to use overlapping of kernels and/or host-device memory transfers.
67
Conclusions
The full Nek5000 code has been ported to multi-GPU systems using OpenACC
compiler directives. The work focused on porting the most time-consuming parts
of Nek5000 to the GPU systems, namely the matrix-matrix multiplication and
the preconditioned CG linear solver. The gather-scatter method with MPI operations has been redesigned in order to decrease the amount of data to transfer
between host and accelerator. A speed-up of 1.3 times was found on a single
node of a Cray XK6 when using OpenACC. On 512 nodes of the Titan supercomputer, the speed-up can be approached to 1.6 times. A performance analysis
of the Nek5000 code using Score-P and Vampir performance monitoring tools
was carried out. This study showed that overlapping of GPU kernels with hostaccelerator memory transfers would largely increase the performance of the OpenACC version of Nek5000 code. This will be part of future research.
Acknowledgment. This research has received funding from the Swedish e-Science
Research Centre (SeRC) and the European Communitys Seventh Framework Programme (ICT-2011.9.13) under Grant Agreement no. 287703, cresta.eu. We are grateful
for the computing time that was made available to us on the Raven system at Cray and
68
J. Gong et al.
on the Titan supercomputer at Oak Ridge National Laboratory within the INCITE
program. We would also like to thank Dr. George K. El Khoury for the benchmark
used in the paper.
References
1. Fischer, P.F., Lottes J.W., Kerkemeier S.G.: Nek5000 web page. http://nek5000.
mcs.anl.gov
2. The second nek5000 Users and Development Meeting, Zurich, Switzerland. http://
nek5000.mcs.anl.gov/index.php/Usermeeting2012
3. Patera, A.T.: A spectral element method for uid dynamics: laminar ow in a
channel expansion. J. Comput. Phys. 54(3), 468488 (1984)
4. OpenACC standard, June 2013. http://www.openacc-standard.org
5. Ansaloni R., Hart A.: Crays approach to heterogeneous computing. In: PARCO
(2011)
6. Markidis, S., Gong, J., Schliephake, M., Laure, E., Hart, A., Henty, D., Heisey,
K., Fischer, P.F.: OpenACC acceleration of Nek5000, spectral element code. Int.
J. High Perform. Comput. Appl. (IJHPCA)
7. Deville, M.O., Fischer, P.F., Mund, E.H.: High-Order Method for Incompressible
Fluid Flow. Cambridge University Press, Cambridge (2002)
8. Kn
upfer, A., R
ossel, C., Mey, D., Biersdor, S., Diethelm, K., Eschweiler, D.,
Geimer, M., Gerndt, M., Lorenz, D., Malony, A., Nagel, W.E., Oleynik, Y.,
Philippen, P., Saviankou, P., Schmidl, D., Shende, S., Tsch
uter, R., Wagner, M.,
Wesarg, B., Wolf, F.: Tools for High Performance Computing 2011, pp. 7991.
Springer, Heidelberg (2012). doi:10.1007/978-3-642-31476-6 7
9. Kn
upfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., M
uller,
M.S., Nagel, W.E.: The vampir performance analysis tool set. In: Resch, M., Keller,
R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 139155. Springer, Heidelberg (2008)
10. Schlatter, P., Khoury, G.K.E.: Turbulent ow in pipes. PDC Newsletter, no. 1
(2012)
Nek5000
OpenACC
GPU
Introduction
The use of Graphics Processing Units (GPUs) for general purpose in High Performance Computing (HPC) has been dramatically increased in recent years.
GPUs are nowadays broadly used to solve computational problems in a wide
range of areas such as engineering, computational chemistry or physics [1,2].
They have gained a vast popularity as a cost-eective platform in High Performance Computing and Scientic applications in the recent years due to their
parallel computation capabilities and computational power. Modern GPUs are
able to run thousands of hardware threads concurrently which allows applications
to decompose their workloads into the threads without introducing a signicant
overhead [3]. The market leader of these devices at present is NVIDIA with
their current generation Tesla [4] GPUs installed in a large number of petascale
c Springer International Publishing Switzerland 2015
S. Markidis and E. Laure (Eds.): EASC 2014, LNCS 8759, pp. 6981, 2015.
DOI: 10.1007/978-3-319-15976-8 5
70
L. Cebamanos et al.
2
2.1
Background
Nek5000
71
Nekbone
Nekbone is a standard program provided with the Nek5000 application and it has
been congured to capture the basic structure and user interface of the extensive Nek5000 software. It requires F77 and C compilers and it has been tested
and supported by IBM, Intel, PGI Portland and GNU compilers although other
compilers may be used. Nek5000 is a complex Navier-Stokes solver based on the
spectral element method, whereas Nekbone solves a Helmholtz equation in a box
using the same method. Nekbone exposes the main computational kernel to reveal
the essential elements of the algorithm-architectural coupling that is relevant to
Nek5000 [14], therefore our work here focuses only on the optimization and autotuning process of Nekbone since it is understood that any improvement achieved
on the computational structure of Nekbone could also be applied to Nek5000.
2.3
A standard program that executes some low level benchmarks without requiring
any input les is supplied with the Nek5000 distribution. It runs both computation and (matrix-matrix operations) and communication (ping-pong, reduction,
etc.) kernels. Although it would have been possible to comment out the communication operations of the benchmark, it was nally decided to extract the minimal amount of code needed to run the calculation benchmark. For the purpose
of this investigation, new kernels were to be added therefore a verication routine was introduced to ensure correctness. A reference solution is computed using
the simplest version of the kernels. All subsequent results are compared to this
reference by computing the RMS dierence, and if this exceeds a certain tolerance (set to 1.012 ), then an error is reported. All calculations are done in double
precision, although this is actually achieved by promoting reals to doubles using
compiler ags.
The existing DGEMM benchmark originally had the following structure:
1. Declare three arrays A, B and C, each containing M matrices
2. Loop over a high number of repetitions for timing purposes
72
L. Cebamanos et al.
73
calls: two separate calls for each of the three cases. These cases actually correspond to operations across dierent spatial dimensions of the 3D elements. In the
Nekbone kernel, the computational load is equally distributed between the three
kernel cases. In terms of the number of elements nel, the value of M in the benchmark is set as M = nel for cases 1 and 3, and M = nel N for case 2. The core
operation in Nekbone is implemented by a routine called ax e and considering that
the repetition of its inner operations are not relevant in terms of performance and
neither the order of the calls to the three dierent cases, we could re-write it in
the following form:
loop e = 1 to nel
Update e using kernel case 1
loop k = 1 to N
Update e using kernel case 2
end loop
Update e using kernel case 3
end loop
Given that nel and N are likely to be xed for many runs of Nek5000 (as they
correspond to the basic discretization parameters of the simulation) the way for
a user to optimize Nekbone performance using the benchmark is as follows:
1. Set N F LOAT based on nel; the precise choice is not important with respect to
performance, e.g. assuming that nel is large then it only has to be large enough
to ensure that data is read from memory and is not cache resident.
2. Run the benchmark and nd the routine with the best harmonic mean performance across all three cases.
3. Compile that single version and use in all calls.
OpenACC Kernels
74
L. Cebamanos et al.
Therefore, our new kernel routines using OpenACC accelerated code could be rewritten in the following form:
Call OpenACC kernel case 1
Call OpenACC kernel case 2
Call OpenACC kernel case 3
Update values
Call OpenACC update kernel case 1
Call OpenACC update kernel case 2
Call OpenACC update kernel case 3
and the new OpenACC kernels included now implement variations of OpenACC
optimized code
Algorithm 2. Simple OpenACC matrix-matrix computation
!$acc parallel loop present(a,b,c) private(i,j,k)
do j = 1, n3
do i = 1, n1
#ifdef SCALAR
tmp = 0.0
#else
c( i, j ) = 0.0
#ifdef SCALAR
tmp = tmp + a( i, k ) * b ( k, j )
#else
do k = 1, n2
c ( i, j ) = c ( i, j ) + a ( i, k ) * b ( k, j )
end do
#ifdef SCALAR
c( i, j ) = tmp
#endif
end do
end do
3.1
Auto-tuning Technology
75
implementation was developed within the project. This implementation can explore a tuning parameter space by repeatedly building and running an application.
The best run is chosen using a metric obtained from the program execution and
currently this is done by exhaustive search. To accomplish a tuning run the source
is appropriately preprocessed or compiled and an optimization process organized.
The tuning session is controlled by DSL either from a global conguration le or
embedded in application source.
The DSL is a component of an auto-tuning framework and at the highest level
it is assumed that this framework can optimize an application over a set of tuning
parameters. Some parameters we term here scenario characterization parameters
and these may for example, map to input parameters relating to problem size.
This is illustrated in Fig. 1.
For each scenario, we aim to pick the best values for a set of tuning parameters
(see Fig. 1: t1, t2, and t3). The tuning parameters will relate to build and runtime
optimization choices which we can choose to give for example the best runtime.
At its simplest, the auto-tuner framework can optimize over the tuning parameters, at the most complex it can build routines and applications choosing the best
tuning parameters for a set of scenario characterization parameters.
The structure for a DSL tuning parameters conguration le can be seen as
follow:
begin parameters
begin typing
<type-entity>
end typing
begin constraints
<constraint-entity>
end constraints
begin collections
<collection-entity>
end collections
76
L. Cebamanos et al.
begin dependencies
depend: <depend-list>
end dependencies
end parameters
The typing section allows parameters to be typed as int, real or label. The
set of allowed values of a parameter are dened in the constraints section. This
supports specic sets of values, ranges, parameter relationships and legality constraints. Parameters may be grouped into collections and the dependency section
allows us to say which parameters should be treated as dependent where dependlist is either a list of parameters or a list of collections.
For the purpose of this investigation we used this framework to set build parameters which chose code variants or values in OpenACC clauses. The parameters
targeted for the auto-tuning on each algorithm are the number of elements, nel;
matrix size, N ; scalar reduction, the number of OpenACC gangs and workers and
the OpenACC vector length. An example of DSL script can be seen in Appendix.
Performance Results
The performance tests of the stand-alone benchmark and NekBone version with
OpenACC have been carried out on a Cray XK6/XK7 system consisting of eighth
compute nodes that comprises a 2.1 GHz AMD Interlagos 16-core processor,
16GByte memory and one Kepler K20 NVIDIA Tesla GPU with 5 GByte of memory. Version 8.1 of the Cray Compilation Environment(CCE) supporting OpenACC was used and the computational performance is measured in Gops.
4.1
Benchmark
We rst tested the performance of our stand-alone benchmark based on the number of elements. It is expected that the performance characteristics for the GPU
will vary signicantly with nel. In Fig. 2 we present the results for cases 1 and 2
running a default version of the kernels, i.e. the simplest OpenACC accelerated
kernel which would let the compiler to take optimization decisions.
The performance of the auto-tuned version for cases 1 and 2 is also shown in
Fig. 3. The performance results of case 3 are not shown due to they are very similar to case 2. The performance results obtained from auto-tuning show signicant
improvements over the default option in all situations. Furthermore, there is very
little dierence between the auto-tuned performance for case 1, 2 and 3.
4.2
Nekbone
Our main goal is to investigate what eect the kernel auto-tuning has on overall Nekbone performance. Therefore, after obtaining the optimal parameter settings from auto-tuning our stand-alone benchmark, they are now introduced into
an OpenACC accelerated version of Nekbone. As useful reference values we have
50
50
40
40
Performance (Gflops)
Performance (Gflops)
30
20
10
77
N = 20
N = 18
N = 16
N = 14
N = 12
N = 10
N = 8
30
20
10
32
128
512
2048
8192
32
128
nel
512
2048
8192
nel
80
80
Performance (Gflops)
Performance (Gflops)
Auto-tuned case 1
100
60
40
20
N = 20
N = 18
N = 16
N = 14
N = 12
N = 10
N= 8
60
40
20
32
128
512
nel
2048
8192
32
128
512
2048
8192
nel
used the previous performance results obtained by Markidis et al. using an accelerated OpenACC hand-tuned version of Nekbone. This performance result of a
hand-tuned version can be seen on Fig. 4 (left). The maximum value of nel is often
smaller than the previous value of 81292 used in the kernel benchmarks and the
reason is that Nekbone uses more memory, and the application run into memory
limits on the GPU (generally at large N ).
To illustrate the eect of parameter tuning, Fig. 4 (right) shows the performance results of an auto-tuned version of Nekbone. This performance results
demonstrates that auto-tuning technologies can be able to achieve similar or even
improved performance results over hand-tuned codes. In Fig. 5 we have represented
78
L. Cebamanos et al.
Auto-tuned version
60
50
50
40
40
Performance (Gflops)
Performance (Gflops)
Hand-tuned version
60
30
20
10
N = 20
N = 18
N = 16
N = 14
N = 12
N = 10
N= 8
30
20
10
32
128
512
2048
8192
32
128
nel
512
2048
8192
nel
15
1.5
Ratio
Ratio
Auto-tuned/Default
20
10
N = 20
N = 18
N = 16
N = 14
N = 12
N = 10
N= 8
0.5
32
128
512
nel
2048
8192
32
128
512
2048
8192
nel
the ratio between of our auto-tuned performance results over the hand-tuned performance results achieved by Markidis et al. and and using default
OpenACC settings. It can be seen in Fig. 5 (right) that in some occasions the
auto-tuned optimized version of Nekbone has achieved up to 20 % of performance
improvement over the hand-tuned version. Note the dierence in scale between
the graphs shown in Fig. 5.
Thanks to the new Nekbone structure developed for this purpose and the exhaustive exploration of dierent parameter values carried out by the auto-tuner
we have accomplished a simpler, better structured and faster implementation of
79
Nekbone. Furthermore, the exploration of dierent OpenACC optimization algorithms has revealed that loop collapsing techniques have given the best performance improvement among all the other techniques listed in Sect. 3. Although
scalar reduction showed little performance improvement, the best performance
vector-length values were 128 and 256.
Conclusions
The focus of this work was on accelerating Nek5000 using OpenACC compiler
directives and auto-tuning technologies. Due to the complexity of Nek5000, our
experiments have been carried out on a simplied version of the Nek5000 code,
called Nekbone; and an extracted computational benchmark also based on
Nek5000. A naive implementation using OpenACC showed little performance
compared to an auto-tuned implementation where performance improvements of
over 2x have been achieved. In addition, we have developed an OpenACC accelerated auto-tuned version of Nekbone. In this paper we have demonstrated that
our auto-tuned version was able to reach, and some occasions improve, the performance accomplish by an OpenACC hand-tuned version.
begin configuration
begin tune
mode: scenarios
scenario-params: ALG
scope: VECTOR_LENGTH N NEL NUM_GANGS VECTOR_LENGTH
target: max
metric-source: file
postrun-metric-file: Output/output.$run_id
metric-placement: lastregexp
metric-regexp: tune run metric +(\S+)
end tune
end configuration
begin parameters
begin typing
label NUM_GANGS
label NUM_WORKERS
int VECTOR_LENGTH
int N
int NEL
int ALG
end typing
begin constraints
range NUM_GANGS none default none
range NUM_WORKERS none 1 2 4 8 16 32 default none
range VECTOR_LENGTH 128 256 512 1024 default 128
range ALG 101 102 103 104 105 106 107 108 109 110 default 101
range N 8 10 12 14 16 18 20 default 8
80
L. Cebamanos et al.
References
1. Egri, G., Fodor, Z., Hoelbling, C., Katz, S., Nogradi, D., Szabo, K.: Lattice QCD
as a video game. Comput. Phys. Commun. 177, 631639 (2007)
2. Yasuda, K.: J. Comput. Chem. 29, 334 (2007)
3. Fung, W.W.L., Aamodt, T.M.: Energy ecient GPU transactional memory via
space-time optimizations. ACM, MICRO-46, pp. 408420 (2013)
4. Nivia Tesla architecture (2014). http://www.nvidia.com/object/tesla-supercom
puting-solutions.html. Accesed 14 January 2014
5. The CUDA Toolkit (2014). https://developer.nvidia.com/cuda-downloads.
Accesed 14 January 2014
6. Coleman, D.M., Feldman, D.R.: Porting existing radiation code for GPU acceleration. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 6(6), 16 (2013)
7. Delgado, J., Gazolla, J., Clua, E., Masoud Sadjadi, S.: A case study on porting scientic applications to GPU/CUDA. J. Comput. Interdisc. Sci. 2(1), 311 (2011)
8. OpenMP 4.0 (2014). http://openmp.org/wp/. Accessed 14 January 2014
9. OpenACC. OpenACC Home Page (2014). http://openacc.org/. Accessed 14
January 2014
10. Hoshino, T., Maruyama, N., Matsuoka, S., Takaki, R.: CUDA vs OpenACC: performance case studies with kernel benchmarks and a memory-bound CFD application. In: IEEE International Symposium on Cluster Computing and the Grid,
pp. 136143 (2013)
11. Gray, A., Hart, A., Richardson, A., Stratford, K.: Lattice boltzmann for large-scale
gpu systems. In: PARCO, pp. 167174 (2011)
12. Chen, J.H., Choudhary, A., De Supinski, B., DeVries, M., Hawkes, E., Klasky, S.,
Liao, W., Ma, K., Mellor-Crummey, J., Podhorszki, N., et al.: Terascale direct
numerical simulations of turbulent combustion using s3d. Comput. Sci. Discov. 2,
1 (2009)
13. Fischer, P., Heisey, K., Kruse, J., Mullen, J., Tufo, H., Lottes, J.: Nek5000 Premier
(2014). http://www.csc.cs.colorado.edu/voran/nek/nekdoc/primer.pdf. Accessed
10 January 2014
81
14. Fischer, P., Heisey, K.: NEKBONE: Thermal Hydraulics mini-application. Nekbone Release 2.1 (2013). https://cesar.mcs.anl.gov/content/software/thermal
hydraulics. Accessed 10 January 2014
15. Markidis, S., Gong, J., Schliephake, M., Laure E., Hart, A., Henty, D., Heisey, P.,
Fischer, P.: OpenACC Acceleration of Nek5000, Spectral Element Code
16. Shin, J., Hall, M.W., Chame, J., Chen, C., Fischer, P.F., Hovland, P.D.: Speeding
up Nek5000 with autotuning and specialization. In: Proceedings of the 24th ACM
International Conference on Supercomputing, pp. 253262 (2010)
17. Patera, A.T.: A spectral element method for uid dynamics: laminar ow in a channel
expansion. J. Comput. Phys. 54(3), 468488 (1984)
18. Dongarra, J.J., Du Croz, J., Du, I.S., Hammarling, S.: Algorithm 679: a set of level
3 basic linear algebra subprograms. ACM Trans. Math. Soft. 16, 1828 (1990)
19. IBM Compilers (2014). http://www-03.ibm.com/software/products/en/sub
category/SW780. Accessed 15 January 2014
20. Intel Compilers (2014). http://software.intel.com/en-us/intel-compilers. Accessed
15 January 2014
21. The Portland Group (PGI). http://www.pgroup.com/. Accessed 15 January 2014
22. The GNU Compiler Collection. http://gcc.gnu.org. Accessed 15 January 2014
23. Richardson, H.: Domain specic language (DSL) for expressing parallel autotuning, CRESTA Project Deliverable D3.6.2 (2014). http://cresta-project.eu/
table/deliverables/year-1-deliverables/. Accessed 16 January 2014
24. Anderson, J.: Modern Compressible Flow: With Historical Perspective.
McGraw-Hill, New York (2003)
25. CRESTA Research Project (2014). http://cresta-project.eu/. Accessed 20 March
2014
Development Environment
for Exascale Applications
Introduction
86
T. Sterling et al.
highlights a challenge when programming using the conventional practice. Several alternatives to conventional practice have been developed to better address
the issues highlighted by SLOW by utilitizing lightweight concurrent threads
managed using synchronization primitives such as dataow and futures in order
to alter the application ow structure from being message-passing to becoming
message-driven.
However, the performance modeling necessary to understand and manage asynchrony eects at scale can be especially challenging for emerging programming
models that rely on lightweight concurrent threads. Trace-driven approaches for
such programming models tend to substantially alter the application execution
path itself while cycle-accurate simulations tend to be too expensive for co-design
eorts. While discrete event simulators have been successfully used for the performance modeling of many-tasking execution models before [4], they require both
an implementation of the execution model in the simulator as well as a skeleton
application implementation. This skeleton code has to preserve the dataow of
the original application while appropriately modeling the computational costs
of the full application in between communication requests.
A robust implementation of the execution model in the discrete event simulator and a close representation between the skeleton code and the full application
are both crucial in order to achieve accurate performance predictions from the discrete event simulator. A skeleton code which closely represents the computational
costs and dataow of the full application code can be especially dicult to achieve
because a signicant code fork is necessary in order to develop the skeleton code.
Updates and improvements made to the full application code are not automatically reected in the skeleton code and inconsistencies between the two codes are
easily introduced. Likewise, accuracy in implementing the execution model in the
event simulator is also dicult to achieve: modeling the contention on resources,
the variable overheads when using concurrent threads, the highly variable communication incidence rates, the network latency hiding, the thread schedulers and
associated contention, and the oversubscription behavior all contribute in complicating the implementation of the execution model in the discrete event simulator.
This paper presents a performance modeling case study for many-tasking
execution models which incorporates performance modeling directly into the
runtime system implementation of the execution model without requiring a skeleton code or application traces. A runtime system is the best equipped tool for
performance modeling an application as it comes with the necessary introspection capability, it does not require a skeleton code separate from the application for modeling, and is itself already a robust implementation of the execution
model it represents. For this case study, the performance modeling capability
of the HPX-5 runtime system is explored for a proxy application developed by
one of the US Department of Energy co-design centers: the Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) proxy application
code [1]. LULESH has been ported to multiple programming models, both emerging and traditional, and its scaling behavior has been extensively explored making it a good candidate for this case study. More importantly, the scientic kernel
87
Related Work
While there have been a large number of approaches to developing performance modeling techniques which are application independent, most of these
have centered around the Communicating Sequential Processes (CSP) execution model. Trace-driven approaches are a key component in many performance
modeling and co-design frameworks, including DUMPI in SST/Macro [11], LogGOPSim [10], and the Performance Modeling and Characterization (PMaC)
framework [6]. A key challenge in trace-driven approaches is the trace collection
overhead. Carrington et al. [5] demonstrate how to reduce the trace collection
overhead by extrapolating results to larger core count sizes using smaller core
count traces. While trace-based approaches generally do not require changes to
the user application and work well with the coarse-grained computation style
favored by CSP, trace collection overhead can signicantly alter the execution
88
T. Sterling et al.
Application
Application
Skeleton
Execution
Traces
Programming Model
(MPI, HPX, Charm++)
Programming Model
(HPX)
Execution Model
+ Simulator
(HPX runtime system)
Execution Model
(Runtime System)
Simulator
(SST, BigSim)
Machine Model
Machine Model
(Abstract or Actual)
Machine Model
Computation Emulation
Computation
Execution
Traces
Emulation
Simulation
Simulation
89
Performance Modeling
This case study targets performance modeling scenarios where the actual or
prototype Exascale node is available. Unlike traditional simulation approaches,
the proposed simulation methodology does not involve generating traces nor
a skeleton code but rather integrating the simulation capability with the runtime system. Figure 1 highlights the dierences between traditional simulation
approaches and what is proposed here. This alternative approach is motivated
by the goal of improving user access to performance modeling, the rapid increase
in the number of many-tasking execution models, and the ability for modern runtime systems to incorporate all necessary introspection mechanisms to properly
operate in a performance modeling simulation mode. Further motivation as to
why the runtime system is well suited for this type of modeling is provided in
Sect. 4.
For any application, the many-tasking runtime system has full and direct
access to the task phase information. When the runtime is operating in a simulation modality on prototype Exascale nodes, a sample of nodes is selected for
performing the application simulation. Other nodes that directly interact with
these nodes are also simulated but only for a small set of communication iterations to provide accurate message incidence rates for the sample nodes. Using
select iteration snapshots in the course of the application simulation, the runtime system uses these sample nodes to predict application performance at the
scale indicated by the user. While this approach does not require traces, it has
a disadvantage of not providing performance predictions for the entire duration
of application execution. The performance predictions are provided only for a
specic subset of communication iterations.
The approach is illustrated in Fig. 2. Each square and circle represents a
node in an Exascale simulation while arrows indicate communication. When the
runtime system is in simulation mode, a user dened set of sample nodes, indicated by the red outlined boxes, is selected for running the application. Nodes
which interact with this sample set, indicated by blue circles, are identied by
the runtime system accessing the node interaction data. The application is also
run on these nodes in order to provide correct incidence rates and phase information to the sample nodes but their runtime information, such as specic execution
times of various subroutines, is not used in the performance prediction. Network
communication is performed between all nodes that are running the application while a network model handles communication between circle nodes and
non-running ghost nodes, indicated by blue squares. Green arrows indicate network trac approximated by a network model, red arrows indicate real network
trac, and black arrows indicate trac not modeled.
The accuracy of the predictions relies on how well the sample nodes represent the overall state of the application. For static dataow applications which
are well-balanced, this would be easily achieved with a very small sample size.
For highly dynamic applications, it would not be unlikely to require terascale
computing in order to predict Exascale performance.
90
T. Sterling et al.
This runtime system based approach can be improved and rened in several ways. The number of buer nodes which provide incidence rate and node
interaction information to the sample set can be increased to improve accuracy.
Likewise, the introspection capability of the runtime system can be expanded
to directly model these phases and incidence rates while in full computation
mode and then later re-used in the simulation modality while still avoiding trace
collection. In this case study, we present results from the simplest performance
modeling approach where sample nodes operate in full computation mode with
all other nodes operating as ghost (non-computing) nodes. The following section
gives details about the runtime system selected for this case study and how runtime system capabilites are well suited for taking on the role of performance
modeling.
Fig. 2. A runtime system based performance modeling approach. Each square and
circle represents a node in an Exascale simulation while arrows indicate communication.
When the runtime system is in simulation mode, a user dened set of sample nodes,
indicated by the red outlined boxes, is selected for running the application. Nodes which
interact with this sample set, indicated by blue circles, are identied by the runtime
system accessing the node interaction data. The application is also run on these nodes
in order to provide correct incidence rates and phase information to the sample nodes
but their runtime information is not used in the performance prediction. Network
communication is performed between all nodes that are running the application while
a network model handles communication between circle nodes and non-running ghost
nodes, indicated by blue squares. Green arrows indicate network trac modeled by
a network model, red arrows indicate real network trac, and black arrows indicate
trac not modeled.
Performance modeling of Exascale systems requires development of fundamentally new approaches due to demands of both scale and complexity. The trace
based methodologies are infamous for generating prohibitively large volumes of
data when run on many nodes of a large system, necessitating the use of the
91
92
T. Sterling et al.
Threads and LCOs along with related data structures can be embedded in
ParalleX processes entities that hierarchically organize parallel computation and provide logical encapsulation for its individual components. Unlike
UNIX processes, they can span multiple localities (and therefore multiple address
spaces). Processes, threads, and LCOs may migrate between the nodes and are
globally addressable, permitting the programmer to access them from anywhere
in the system. This is controlled by the Active Global Address Space (AGAS),
a distributed service that maintains lookup tables storing physical locations of
all rst class objects of the computation. ParalleX functionality manifests itself
primarily in the runtime system layer, which, through its proximity to the application code permits additional optimizations and acts as an intermediate layer
for access to expensive (in terms of overhead and latency) OS kernel services.
ParalleX compliant runtime system implements introspection, supporting direct
access to integrated performance counters and enabling monitoring of application
activity. This is particularly valuable for low overhead collection of performance
data.
HPX-5 is a high performance runtime system that implements the ParalleX
model, providing the ability to run HPC applications at-scale and to simulate the performance characteristics of code without actually fully running the
application.
Written in C and assembly, the HPX-5 runtime system is focused primarily
on algorithmic correctness, performance, and stability. To achieve this, HPX-5
is developed with an extensive suite of tests that execute well known scientic
codes with published results and uses these to ensure correctness and stability.
The runtime is highly modular and is comprised of several components,
including:
A user-space thread manager made up of M:N coroutines similar to Python
Green Threads. HPX-5 threads are continously rebalanced across logical CPU
cores in a NUMA-aware way that ensures a high degree of continuous work.
An asynchronous network layer built on RDMA verbs capable of running on
InniBand, Cray Gemini, and Ethernet networks as well as in a non-networked
(SMP) environment.
A parcel dispatch system that routes messages between objects and makes runtime optimizations through direct integration with the nodes network interface controller (NIC).
A variety of distributed lock-free control structures, including futures and
logical gates that provide programmers with an easy-to-use environment in
which to dene application dataow.
An active global address space (AGAS) that automatically distributes and
balances data across all nodes in an HPC system.
Support for multi-core embedded architectures (such as ARM).
Instrumentation to perform simulations of application runs in a variety of
environments, using spec files that describe several well-known machines.
In addition to normal operation, the HPX-5 runtime supports a simulation mode in which it models performance of a full (non-skeleton) computation
93
ParalleX LULESH
94
T. Sterling et al.
Fig. 3.
95
Results
Strong and weak scaling results for HPX-5 LULESH are presented in this section
along with the runtime systems performance predictions. All computations and
simulations were performed on 16-node Xeon E5-2670 2.60 GHz based cluster
with an Inniband interconnect. The oversubscription factor for all distributed
cases was two; that is, the entire LULESH computational domain was partitioned
into twice as many subdomains as available cores.
Our simulation approach is most similar to SMPI [7] where online simulation
(or emulation) is performed on a subset of the nodes. The rest of the nodes
in the simulation are either ignored or simulated depending on the application requirements. In case of LULESH, we computed the global values oine
such that there were no message dependencies from the simulated nodes to
the emulated nodes. For structured communication patterns, we use periodic
boundary conditions to meet the receive depenences from the simulated nodes
to the emulated nodes. Since the pending receives can generate load on the emulated nodes, we are presently working on recovering these dependences through
oine traces. Communication is performed only between emulated nodes. For
network simulation, we used the LogP cost model [8] to calculate communication time for the simulated nodes. Under the assumption that each parcel is sent
96
T. Sterling et al.
using a single message1 , per the LogGP [3] model, a send was computed to take
(2 o) + (n 1)G + L cycles where L is the network latency, o is the overhead of
transmission and G is the gap per byte. The LoGP parameters for the 16-node
Xeon E5-2670 2.60 GHz based cluster were measured empirically for the above
experiments.
In Fig. 5, the workload was increased from 1 to 512 domains as the number of nodes were increased from 1 to 16. The simulator introduces some overhead since it has to inspect every message and either emulate or simulate it.
We found that the predicted value was within 25 % of the actual running time.
The strong-scaling results in Fig. 6 conrm the above observation. For the above
runs, each simulated workload was run with half the number of actual nodes.
Figure 7 shows the simulation accuracy of our online simulation approach. We
see that the accuracy improves (that is, the dierence between the emulated
and simulated value decreases) as the number of emulated nodes are increased.
This conrms the trade-o between simulation accuracy and the computation
requirements for the simulation. As stated previously, simulating the performance of the application at Exascale levels might demand considerable computation resources. Hence, such an approach where the accuracy can be bounded
by sampling a subset of the available nodes might be favorable.
1.4e+00
1.2e+00
1.0e+00
8.0e-01
6.0e-01
4.0e-01
Full Emulation
Computation
Simulation
2.0e-01
0.0e+00
1 2
16
Number of Nodes
Fig. 5. Weak scaling results for HPX-5 LULESH. Computation represents the actual
running time for a xed workload for 500 iterations. Full Emulation indicates the time
to perform full emulation of the workload using our hybrid emulation and simulation
approach. Simulation shows the running time predicted by the simulator.
Almost all messages were under 32K for our HPX-5 port of the LULESH application.
5e+00
97
Full Emulation
Computation
Simulation
1e+00
2e-01
1 2
16
Number of Nodes
Fig. 6. Strong scaling performance of HPX-5 LULESH across 16 nodes. The description
of the legend is same as the previous gure, Fig. 5.
6.5e+02
6.0e+02
5.5e+02
5.0e+02
4.5e+02
4.0e+02
3.5e+02
3.0e+02
2.5e+02
2.0e+02
1.5e+02
1.0e+02
1
16
Number of Nodes
Simulation
Computation
Fig. 7. Simulation accuracy as the number of emulated nodes are increased. A prediction is more accurate if the dierence between the computation and simulation times
is lower.
Conclusions
Eciency and scalability requirements for high performance computing applications has cultivated the development of new programming models which employ
ne and medium grain task parallelism creating challenges for performance modeling at Exascale. In particular, task-driven approaches cause signicant problems for runtime systems using lightweight concurrent threads while discrete
event simulators require skeleton codes which are dicult to reliably extract
98
T. Sterling et al.
from the full application codes. At the same time, runtime systems now regularly provide the introspection capability to reliably carry out performance
modeling within the runtime system itself. An approach to incorporating performance modeling in the runtime system has been described here for use in cases
where a prototype Exascale node is available for computation. Using a sampling approach in conjunction with a network model, a runtime system can be
quickly transformed into a performance modeling tool without requiring traces
nor discrete event simulation.
A case study has also been presented here where the LULESH proxy application has been ported to the HPX-5 runtime system and run in both of the
computation and simulation modalities provided by the runtime. The HPX-5
LULESH port illustrates all of the features of a many-tasking implementation,
including oversubscription, asynchrony management semantics, and active messages. Strong and weak scaling results were provided for comparison between
the computation and simulation modalities.
Incorporating performance modeling into modern runtime systems resolves
several issues when operating at Exascale while also simplifying co-design for
application developers. While such an approach is new and mostly untested, it
ultimately can remove one layer of separation between application development
and performance modeling for approaches employing ne and medium grain task
parallelism.
Acknowledgments. The authors acknowledge Benjamin Martin, Jackson DeBuhr,
Ezra Kissel, Luke DAlessandro, and Martin Swany for their technical assistance.
References
1. Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory.
Technical Report LLNL-TR-490254
2. Livermore unstructured lagrangian explicit shock hydrodynamics (lulesh). https://
codesign.llnl.gov/lulesh.php
3. Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: incorporating long messages into the LogP modelone step closer towards a realistic model for
parallel computation. In: Proceedings of the Seventh Annual ACM Symposium on
Parallel Algorithms and Architectures, SPAA 1995, pp. 95105. ACM, New York,
NY, USA (1995)
4. Anderson, M., Brodowicz, M., Kulkarni, A., Sterling, T.: Performance modeling of
gyrokinetic toroidal simulations for amany-tasking runtime system. In: Jarvis, S.A.,
Wright, S.A., Hammond, S.D. (eds.) High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation. LNCS, pp. 136157. Springer,
Heidelberg (2014)
5. Carrington, L., Laurenzano, M., Tiwari, A.: Inferring large-scale computation
behavior via trace extrapolation. In: Large-Scale Parallel Processing workshop
(IPDPS 2013)
6. Carrington, L., Snavely, A., Gao, X., Wolter, N.: A performance prediction framework for scientic applications. In: ICCS Workshop on Performance Modeling and
Analysis (PMA03), pp. 926935 (2003)
99
7. Clauss, P.-N., Stillwell, M., Genaud, S., Suter, F., Casanova, H., Quinson, M.: Single node on-line simulation of MPI applications with SMPI. In: Parallel Distributed
Processing Symposium (IPDPS), 2011 IEEE International, pp. 664675 (2011)
8. Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and
Practice Of Parallel Programming, PPOPP 1993, pp. 112. ACM, New York, NY,
USA (1993)
9. Gao, G., Sterling, T., Stevens, R., Hereld, M., Zhu, W.: ParalleX: a study of a new
parallel computation model. In: Parallel and Distributed Processing Symposium.
IPDPS 2007. IEEE International, pp. 16 (2007)
10. Hoeer, T., Schneider, T., Lumsdaine, A.: LogGOPSim - simulating large-scale
applications in the LogGOPS Model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 597604.
ACM, June 2010
11. Janssen, C.L., Adalsteinsson, H., Cranford, S., Kenny, J.P., Pinar, A., Evensky,
D.A., Mayo, J.: A simulator for large-scale parallel computer architectures. IJDST
1(2), 5773 (2010)
12. Sottile, M., Dakshinamurthy, A., Hendry, G., Dechev, D.: Semi-automatic extraction of software skeletons for benchmarking large-scale parallel applications. In:
Proceedings of the 2013 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS 2013, pp. 110. ACM, New York, NY, USA (2013)
13. Spaord, K.L., Vetter, J.S.: Aspen: a domain specic language for performance
modeling. In: Proceedings of the International Conference on High Performance
Computing, Networking, Storage and Analysis, SC 2012, pp. 84:184:11. IEEE
Computer Society Press, Los Alamitos, CA, USA (2012)
14. Sterling, T.: Towards a ParalleX-enabled Exascale Architecture. Presentation to
the DOE Architecture 2 Workshop, 10 August 2011
15. Totoni, E., Bhatele, A., Bohm, E., Jain, N., Mendes, C., Mokos, R., Zheng, G.,
Kale, L.: Simulation-based performance analysis and tuning for a two-level directly
connected system. In: Proceedings of the 17th IEEE International Conference on
Parallel and Distributed Systems, December 2011
16. Zheng, G., Wilmarth, T., Lawlor, O.S., Kale, L.V., Adve, S., Padua, D., Geubelle,
P.: Performance modeling and programming environments for Petaops computers
and the Blue Gene machine. In: NSF Next Generation Systems Program Workshop,
18th International Parallel and Distributed Processing Symposium(IPDPS), p. 197.
IEEE Press, Santa Fe, New Mexico, April 2004
Introduction
101
it comes to understanding the full range of complexity of this problem. One case
in point is the recent study by Beckman et al. [2] where the authors dene and
identify noise asynchrony as a key property in understanding how noise degrades
application performance. Furthermore, they show that asynchronous noise has a
much greater negative impact on the performance of global collective operations
than synchronized noise.
We have observed similar phenomenon in previous work, in which a classic
bulk-synchronous nearest neighbor time stepping algorithm was compared to a
new, resilient formulation [5]. In this case, application runtime in the presence
of Gaussian random noise was consistently underestimated by our predictions.
We attributed this discrepancy to a failure to properly account for complex
asynchronous-type properties characteristic of randomly distributed noise.
To our knowledge, no one has attempted to quantify asynchronous noise or
its impact on the performance of bulk-synchronous codes. This realization has
motivated our current study, where we present a simple quantitative model for
asynchronous noise. Furthermore, we demonstrate that the mechanisms underlying asynchronous noise result in bounding behavior of the performance of bulksynchronous algorithms in the presence of arbitrary and non-deterministic noise.
This work provides a valuable and necessary rst step towards developing
closed-form analytical models of the runtimes of general scientic computations
in the presence of computational noise. Such work can provide valuable insight to
hardware vendors and system software and middleware developers in designing
the next generation hardware architectures and runtime environments. Moreover
these insights themselves have intrinsic value at a time where computational
noise is becoming an increasingly prominent and necessary evil on the path to
exascale.
In order to properly examine how the performance of bulk-synchronous computations are impacted by asynchronous noise, it is necessary to rigorously dene
102
A. Hammouda et al.
Table 1. Table of the Dening Variables of Noise.
Variable Term
Denition
Period
Asynchrony
Lag
what is meant by noise broadly, and then specically what is meant by asynchronous noise.
As done in a previous study [5], we refer to each individual delay in an
applications execution caused by an event external to the application itself as a
detour. Noise then refers to the aggregate phenomena of every detour over some
period of time.
If every detour occurs with a specied period, asynchrony can be thought of
as the extent to which adjacent processes experience a phase dierence between
each others periods. This in turn will aect the extent to which detours on
neighboring processes overlap. Each processor experiences its rst detour after
a lag, , dened by
if process is odd
(1)
+ T o/w.
Here, 0 i nprocs1 and nprocs are the number of processes involved in a simulation. The full parameter space governing the manifestation of asynchronous
noise is given in Table 1. For an illustration of asynchrony, see Fig. 1.
Experiments
We now examine the impact of asynchronous noise on bulk-synchronous algorithms. In order to simulate both constant frequency noise and Gaussian distributed noise, we utilize a set of noise generation utilities, developed initially for
previous work [5], documentation of which can be found with its source code
online [9]. As a representative bulk-synchronous algorithm, we utilize an explicit
time implementation of the 2D heat equation as a simple representative stencil
computation. Following [5], we refer to the traditional bulk synchronous implementation as the classic algorithm. Each experiment is parameterized by the
number of timesteps, nsteps, the number of processes, nprocs, and the computation time for a single timestep in seconds, C. While C is determined by other
underlying algorithmic features, C is a more useful quantity since we are ultimately concerned with the runtimes of bulk-synchronous codes.
103
The experiments in this section are carried out on the Argonne Leadership
Computing Facilitys Cetus machine; an IBM BG/Q with 1600 MHz PowerPC
A2 cores, 1 GB RAM per core, 16 cores per node, and a 5D Torus Proprietary
Network interconnect. Each experiment uses a stencil size of 2, 5002 points per
process which resulted in the classic algorithm step duration of C = 0.626 s. Furthermore each experiment is run on 5, 016 processes, for nsteps = 100. All communication uses the eager protocol, and MPI asynchronous progress is enabled
so as to avoid analyzing the complicating secondary impacts of asynchrony on
rendezvous handshakes and other pieces of communication overhead.
3.1
Relative Runtime
5
T=5C
T=10C
4
1
0
0.2
0.4
0.6
0.8
Fig. 2. The Relative Runtime of the classic algorithm for various levels of asynchrony,
. Relative Runtime is just the runtime of each data point divided by the runtime of the
very rst data point for each curve (where = 0, T = 5C and T = 10C, respectively).
In these experiments = T , and experiments where T = 5C and 10C are plotted.
104
A. Hammouda et al.
In this experiment, we examine the eects that the frequency of detours has on
the performance of the classic algorithm for both completely synchronous noise
and completely asynchronous noise ( = 0.0 and = 1.0 respectively). Here,
frequency is simply the inverse period between detours ( 1 ). The results of these
experiments are plotted in Fig. 3. The gure plots the period between detours,
105
Runtime (seconds)
350
300
250
200
150
100
50
10
12
14
16
Fig. 3. The Runtime of the classic algorithm, given a range of constant period detours.
is plotted in units of C1 . Every simulation has the same detour duration throughout
with a value of T = 5C.
In this experiment, we return to the original question can we explain a bulksynchronous algorithms performance in the presence of Gaussian distributed
noise? We replace the constant frequency detours employed in previous experiments with randomly sampled spacings. Furthermore, the lag on each process
(formerly given by Eq. (1)) is given by sampling from a uniform distribution,
seeded by the process rank. After the rst detour, a Gaussian distribution is
106
A. Hammouda et al.
350
300
250
200
150
100
50
10
12
14
16
Runtime (seconds)
Runtime (seconds)
350
300
250
200
150
100
50
10
12
14
16
(a) Deterministic asynchrony compared to Gaussian noise (b) Deterministic asynchrony compared to Gaussian noise
without limits on the number of detours.
with limits on the number of detours.
Fig. 4. The Runtime of the classic algorithm, given a range of constant period detours
and random period detours (with a constant period of and an average period of
respectively). and are plotted on the same axis in units of C1 . This allows for a comparison between deterministic noise simulations with predened levels of asynchrony
and nondeterministic noise simulations with less predictable levels of asynchrony. Every
simulation has the same detour duration throughout with a value of T = 5C.
107
108
A. Hammouda et al.
randomly distributed detours. The illustration of Fig. 5 further shows that what
is completely noiseless execution time for constant frequency asynchronous noise
is diminished by the randomly occurring detour in the T time window. This
explanation further explains why completely asynchronous noise does bound the
performance of the classic algorithm for the case where = = T . In this case,
the time window of T = 0. Understanding this data, and our hypothesis
requires further analysis and research.
Conclusions
This study has presented a model for asynchronous noise, and examined the
impact that such noise has on the runtimes of nearest neighbor synchronizing
bulk-synchronous codes. The analysis of these runtimes has indicated that asynchrony acts as a bounding property of the performance of bulk-synchronous
algorithms in the presence of arbitrary noise proles, be they deterministic or
non-deterministic. That having been said, the model of asynchrony developed in
this study cannot explain all of the performance observed. The models limitations need to be further explored, and rened. The power of these results, and the
promise of further research in this vein is that we have identied a deterministic
mechanism of what more often than not results from very randomly occurring
sources of performance degradation in HPC applications. Understanding this
mechanism gives some insight into how best to work around it.
Liscense
The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (Argonne). Argonne, a U.S. Department
of Energy Oce of Science laboratory, is operated under Contract No. DEAC02-06CH11357. The U.S. Government retains for itself, and others acting on
its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to
reproduce, prepare derivative works, distribute copies to the public, and perform
publicly and display publicly, by or on behalf of the Government.
Acknowledgements. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Oce of
Science of the U.S. Department of Energy under contract DE-AC02-06CH11357.
This research used the University of Delawares Chimera computer, funded by U.S.
National Science Foundation award CNS-0958512. S.F. Siegel was supported by NSF
award CCF-0953210.
References
1. Agarwal, S., Garg, R., Vishnoi, N.K.: The impact of noise on the scaling of collectives: a theoretical approach. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna,
V.K. (eds.) HiPC 2005. LNCS, vol. 3769, pp. 280289. Springer, Heidelberg (2005)
109
2. Beckman, P., Iskra, K., Yoshii, K., Coghlan, S.: The inuence of operating systems on the performance of collective operations at extreme scale. In: 2006 IEEE
International Conference on Cluster Computing, pp. 112 (2006)
3. Brown, D.L., Messina, P., Beckman, P., Keyes, D., Vetter, J., Anitescu, M., Bell,
J., Brightwell, R., Chamberlain, B., Estep, D., Geist, A., Hendrickson, B., Heroux, M., Lusk, R., Morrison, J., Pinar, A., Shalf, J., Shephard, M.: Cross cutting
technologies for computing at the exascale. Technical report, U.S. Department of
Energy (DOE) Oce of Advanced Scientic Computing Research and the National
Nuclear Security Administration, June 2010
4. Garg, R., De, P.: Impact of Noise on scaling of collectives: an empirical evaluation.
In: Robert, Y., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2006.
LNCS, vol. 4297, pp. 460471. Springer, Heidelberg (2006)
5. Hammouda, A., Siegel, A., Siegel, S.: Noise-tolerant explicit stencil computations
for nonuniform process execution rates. ACM Trans. Parallel Comput. (2014,
Accepted)
6. Hoeer, T., Schneider, T., Lumsdaine, A.: Characterizing the inuence of system noise on large-scale applications by simulation. In: Proceedings of the 2010
ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 111. IEEE Computer Society, Washington,
DC, USA (2010). http://dx.doi.org/10.1109/SC.2010.12
7. Lipman, J., Stout, Q.F.: Analysis of delays caused by local synchronization. SIAM
J. Comput. 39(8), 38603884 (2010). http://dx.doi.org/10.1137/080723090
8. Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer
performance: Achieving optimal performance on the 8,192 processors of ASCI Q.
In: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, SC 2003,
pp. 55. ACM, New York, NY, USA (2003). http://doi.acm.org/10.1145/1048935.
1050204
9. Siegel, A., Siegel, S., Hammouda, A.: Sythetic noise utilities (2014). https://
bitbucket.org/adamhammouda3/iutils
10. Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji,
P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P.,
Debardeleben, N.A., Diniz, P., Engelmann, C., Erez, M., Fazzari, S., Geist, A.,
Gupta, R., Johnson, F., Krishnamoorthy, S., Leyer, S., Liberty, D., Mitra, S.,
Munson, T.S., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in
exascale computing . Int. J. High Perform. Comput. (2013)
11. Tsafrir, D., Etsion, Y., Feitelson, D.G., Kirkpatrick, S.: System noise, OS clock
ticks, and ne-grained parallel applications. In: Proceedings of the 19th annual
international conference on Supercomputing, ICS 2005, pp. 303312. ACM, New
York, NY, USA (2005). http://doi.acm.org/10.1145/1088149.1088190
12. Vishnoi, N.K.: The impact of noise on the scaling of collectives: the nearest neighbor
model [extended abstract]. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna,
V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 476487. Springer, Heidelberg (2007)
Technische Universit
at Dresden, 01062 Dresden, Germany
{tobias.hilbrich,michael.wagner2,wolfgang.nagel}@tu-dresden.de
2
RWTH Aachen University, 52056 Aachen, Germany
{protze,mueller}@rz.rwth-aachen.de
3
JARA High-Performance Computing, 52062 Aachen, Germany
4
Lawrence Livermore National Laboratory, Livermore, CA 94551, USA
{schulzm,bronis}@llnl.gov
Introduction
l0
l1
111
l2
Intralayer
Direction
T
T
T
T
T0,2
Application
Direction
Root
Direction
T
T
We develop the Generic Tools Infrastructure (GTI) [8] to simplify the development of such scalable runtime tools, in particular tools that analyze large
numbers of events (function invocations or communication events) in Message
Passing Interface (MPI) [15] applications. Those tools analyze events for use
cases such as performance optimization or debugging. Performance analysis tools
like Vampir [17] and Scalasca [5] use traces to store events during the runtime of
an application and then apply a post-mortem analysis. However, tool exclusive
computing resources and a Tree-Based Overlay Network (TBON) abstraction
allow tools built upon GTI to analyze such event data already during the runtime of an application; in other words online.
GTI uses extra processes as additional compute resources for the tool itself.
These tool processescalled places in GTIcan analyze events outside of the
critical path of the application. Additionally, GTI organizes places in hierarchy
layers that can apply stepwise event analysis (TBON layout), e.g., all application
processes provide an event with an integer value and the hierarchy layers sum
these events up until the root of the layout retrieves a global sum. This combination of event ooading, analysis outside the critical path, and hierarchic event
analysis enables wide ranges of scalable tools. Figure 1(a) illustrates the layout
of a GTI tool for four application processesrepresented as circles with labels
T0,0 T0,3 and three tool places T1,0 , T1,1 , and T2,0 . The lines between the circles
indicate the communication channels for events, e.g., the application process T0,0
would usually forward events to tool place T1,0 for analysis. The tool places can
analyze events from the application processes, but also use the communication
capabilities of the layout to exchange information with each other.
The GTI-based tool MUST [7] analyzes all communication operations of
an application to reveal MPI usage errors. The tool applies a comparatively
expensive event analysis as part of its deadlock detection scheme. Thus, the
112
T. Hilbrich et al.
envent handling and analysis cost of MUST may exceed the original cost of
the communication operations on the application. Under such a scenario an
online event analysis tool like MUST can consume increasing amounts of memory and may fail due to memory exhaustion. Even on a compute system with
24 GB of main memory per compute nodeshared between 12 coresMUST
repeatedly exhausted memory for one benchmark application in a study of its
deadlock detection capabilities [7]. This paper describes and studies two techniques for TBON-based event analysis tools to avoid memory exhaustion problems. Specically, these techniques avoid storing data into les since the use of
the I/O subsystem imposes further challenges at scale [11,22]. This research may
particularly enable new tool workows for Exascale-level compute systems that
increase challenges around massively parallel I/O system use. An increasing use
of online tools could circumvent the challenges that these systems impose onto
traditional post-mortem tools.
Section 2 rst presents related work and Sect. 3 then details our assumptions
for the communication channels of a TBON and renes our problem statement.
Section 4 contains our rst technique, a heuristic that provides tool places a
communication channel selection that oers a tune-able selection between performance and memory consumption. Section 5 then describes our second technique
that temporarily pauses the execution of an application to let a tool catch up
with its event analysis. We implement these techniques in our tool infrastructure
GTI and evaluate it with the runtime MPI correctness tool MUST that previously failed for some SPEC MPI2007 benchmarks. An application study with
MPI2007 and the NAS Parallel Benchmarks (NPB) evaluates our techniques at
up to 16,384 processes and avoids memory exhaustion in practice (Sect. 6).
Related Work
113
with application scale [14] are a related problem that can limit the applicability
of an online tool.
File system traces represent an alternative to our techniques that target
reduced memory needs during event analysis. Our analyses could store temporary event information into traces to avoid memory exhaustion. Tools such as
Vampir [17] and Scalasca [5] successfully employ traces for their performance
analysis. However, le systems can impose scalability challenges [11,22] as well.
Various approaches exist to mitigate the eect of this bottleneck, e.g., trace
reduction [21], trace compression [19], and I/O forwarding [11].
Figure 1(a) illustrates a TBON layout. For GTI, application processes and tool
places use up to three dierent communication directions as Fig. 1(b) illustrates.
The application direction allows a place to receive events that travel from the
application processes towards the root, the root direction allows a place to receive
events that travel from the root towards the application processes (usually control and steering), and the intralayer direction provides GTI tools a point-topoint communication means within a hierarchy layer. The latter communication
direction facilitates tool analyses such as point-to-point message matching for
which pure TBON layouts could limit scalability [10]. The arrows in Fig. 1(b)
illustrate that tool places can probe any communication channel from any of
these three communication directions to receive a new event. Each communication channel is bidirectional and has a certain event capacity. That is, if an
application process or a tool place sends an event over a channel it can continue
its execution before the receiver side handled the event, as long as the capacity of
the channel suces to store the new event. If a communication channel reaches
its capacity it will block any subsequent send operations until the receiver side
drains some events from the channel. In GTI, this capacity depends on the selection of the communication system, which can either be optimized for bandwidth,
oering high capacities, or latency, oering only low capacities.
Analysis algorithms such as point-to-point message matching [10] or deadlock
analysis [7], as well as tool infrastructure services such as order preserving event
aggregation [9] can consume increasing amounts of memory if newly received
events do not satisfy certain conditions. In such scenarios, the channel selection of
a tool place can heavily impact the memory consumption of a tool. We illustrate
this with MPI point-to-point message matching as an example analysis that
searches for pairs of send and receive events with matching message envelopes. If
a new send/receive event arrives and no matching receive/send is available, then
the analysis stores information on the new event in a matching table, i.e., memory
consumption increases. Otherwise, if a new send/receive event completes a pair
a matching receive/send event was present in the matching tablethe analysis
can remove the latter event from the table. Thus, the memory consumption
of the analysis decreases. This analysis enables correctness tools like MUST to
implement MPI type matching checks that can reveal incorrect data transfers.
114
T. Hilbrich et al.
As an example, a single tool place could receive events from all application processes in order to match MPI point-to-point operations; in other words,
the tool uses a TBON that consists of the application processes and a root.
In that case, the single tool place exclusively uses the application communication direction and only needs to select which application process to receive an
event from. A round-robin scheme eciently handles homogeneous applications
where all MPI processes execute similar events, such as the example pattern
in Fig. 2(a). Given that all channels provide an event when probed, the matching table of the point-to-point matching analysis would store at most p operations for a round-robin channel selection. The analysis reaches this peak after it
handled an MPI Isend event from each process. At the same time, application
processes can exhibit dierent MPI operations such as in the communication
pattern of Fig. 2(b). This example1 uses process triples where two processes
send to the third process, which in turn receives the two send operations. A
round-robin scheme would behave poorly for this example since one process in
each triple issues twice as many operations than the other processes. The matching table could use up to iterations ( p3 ) entries for unmatched send operations
for the round-robin approach. In practices, functional decomposition and border
processes for domain decompositions can cause dierent MPI operation workloads, such a in the example of Fig. 2(b).
In summary, the memory consumption of an analysis depends on the channel
selection scheme of the tool places, the communication pattern of the application, the capacity of the communication channels, and the analysis algorithm.
1
115
The previous example illustrated the impact of the communication pattern. The
capacity of a communication channel together with the number of synchronization
points in the application also impacts the memory consumption of tool analyses. Once a channel reaches its capacity, no further events can be processed
causing the application process to be blocked. This will then indirectly block
other processes in their synchronization operations, leading to a cascading eect.
Blocked processes can continue their execution once higher hierarchy layers of
the tool drain some events from the communication channels.
Selection Heuristic
Application Pause
The channel selection heuristic attempts to receive events that will not increase
memory, but bases its selection on past behavior. GTI incorporates a second technique to avoid memory exhaustion when the heuristic fails to restrict memory
116
T. Hilbrich et al.
usage. GTI-based tools can request an application pause such that application
processes will not generate new events. A place should invoke such a request if
its memory usage exceeds a threshold . Once the application is paused, tool
places can process all existing events to reduce their memory usage. For applications that synchronize within some regular interval, any intermediate execution state of the application should have a limited number of open operations
(e.g., unmatched communications) for which analyses need to store information.
As a result, memory consumption of analyses can decrease towards the memory demand for these open operations, which should be far below the original
threshold that caused a place to request an application pause. Once the memory
usage of a place that requested an application pause decreases below a second
threshold ( < ), it will request that the application should be resumed.
GTI handles this technique with events that any place can inject. These tool
specic events travel either along the application or the root communication
direction. Four events implement the technique:
requestPause:
A tool place injects this event if an analysis exceeds its memory threshold,
Tool places forward these events towards the root of the TBON,
broadcastPause:
The root of the TBON injects this event when it received one more
requestPause events than requestResume events,
The root broadcasts the event towards the application processes,
When an application process receives this event it waits until it receives
a broadcastResume event.
requestResume:
Tool places inject this event if they injected a requestPause event and
their memory usage decreases below
Tool places forward these events towards the root of the TBON,
broadcastResume:
The root of the TBON injects this event when it received as many request
Resume events as it received requestPause events,
The root broadcasts the event towards the application processes.
This handling continuously votes for an application pause. The root of the
TBON manages the voting and holds an application pause until all places that
previously requested a pause agree to resuming the application. The implementation in GTI uses a scalable event aggregation on all levels of the TBON to
combine requestPause and requestResume events.
Application Study
117
We implement our techniques in GTI and use the distributed deadlock detection in MUST as an expensive tool analysis that keeps a queue of active MPI
operations for deadlock detection. We use the size of this queue to both apply
the penalty of our heuristic and to request an application pause, where we
use values of = 106 events and = 2 events in all runs with our techniques.
As kernel we select sp since it combines high communication frequency with
longer runtime. We use problem size D at up to 4,096 processes and size E at
up to 16,384 processes; hence, the dip at 8,192 in Fig. 3(a). Figure 3 shows the
application slowdown (as runtime with MUST divided by the runtime of a reference run) and the maximum queue size of MUSTs analysis for increasing scales.
We compare ve dierent channel selections where we use two static approaches
(previous version of GTI) and three selections with our new techniques that dier
in their choices for and . The static selection intra-root-app selects channels
in rounds where it rst tries to receive an event from the intralayer direction,
afterwardsirrespective of whether it received an eventit tries to receive from
the root direction, and nally it tries to receive from the application direction.
This scheme is a compromise between a performance impact due to unnecessary
probes and serving all three directions. The second static selection app||intraroot receives events from the application direction whenever possible and only
investigates the other directions if no application event is available. This scheme
tries to avoid blocked application processes that satisfy their communication
channel capacity towards low tool overhead. The selections with our techniques
118
T. Hilbrich et al.
119
A second set of experiments uses the Sierra system at the Lawrence Livermore
National Laboratory, a Linux cluster with 1,944 nodes of two 6 core Xeon 5660
processors each (24 GB of main memory per node, and a QDR InniBand interconnect). We run the lref data set of the SPEC MPI2007 [16] (v2.0) benchmark
suite on up to 2,048 cores2 on this system to study less homogeneous applications.
Particularly, these applications are derived from real world applications and provide a challenging test case. We select the applications 121.pop2, 128.GAPgeofem, 137.lu, and 143.dleslie for our runs since they particularly stress MUST
or even caused memory exhaustion previously. Figure 4(a) and (c) present application slowdown and maximum queue length for our previous version of GTI
and MUST that uses the static selection intra-root-app. The irregular communications in both 121.pop2 and 128.GAPgeofem cause MUST to exhaust memory
even at 256 processes. Figure 4(b) and (d) present application slowdowns and
maximum queue sizes for our techniques with = = 1. The heuristic suces
to handle 121.pop2 at 256 processes without the application pause technique,
i.e., it adapts better than intra-root-app to the communication pattern of this
application. The application pause technique avoids memory exhaustion for the
remaining runs of 121.pop2 and 128.GAPgeofem. The numbers above/below the
bars in Fig. 4(d) indicate the number of pauses that each run uses. The gure
also highlights that processing all remaining non-application events during an
application pause does not cause excessive increases in the maximum queue size
for the MPI2007 applications. The highest queue size for these runs was about
5 % above .
Conclusions
120
T. Hilbrich et al.
exhaust memory otherwise and application studies on two dierent compute systems show its practicability. Particularly, this technique allows MUST to handle
applications for which it previously failed, e.g., 121.pop2 and 128.GAPgeofem.
Thus, our approach increases the applicability of runtime correctness tools such
as MUST.
We implement both techniques in the open source tool infrastructure GTI
that targets ecient development of online tools. Increased scalability and availability of online tools for tasks such as performance analysis and debugging are
an essential step to provide an alternative for trace-based tool workows, which
are increasingly impacted by I/O limitations.
Acknowledgments. We thank the ASC Tri-Labs and the Los Alamos National Laboratory for their friendly support. Part of this work was performed under the auspices
of the U.S. Department of Energy by Lawrence Livermore National Laboratory under
Contract DE-AC52-07NA27344. (LLNL-CONF-652119). This work has been supported
by the CRESTA project that has received funding from the European Communitys
Seventh Framework Programme (ICT-2011.9.13) under Grant Agreement no. 287703.
References
1. Arnold, D.C., Ahn, D.H., de Supinski, B.R., Lee, G.L., Miller, B.P., Schulz, M.:
Stack trace analysis for large scale debugging. In: International Parallel and Distributed Processing Symposium (2007)
2. Bailey, D.H., Dagum, L., Barszcz, E., Simon, H.D.: NAS parallel benchmark
results. Technical report, IEEE Parallel and Distributed Technology (1992)
3. Besnard, J.-B., Perache, M., Jalby, W.: Event streaming for online performance
measurements reduction. In: 42nd International Conference on Parallel Processing,
ICPP 2013, pp. 985994 (2013)
4. Buntinas, D., Bosilca, G., Graham, R.L., Vallee, G., Watson, G.R.: A scalable
tools communications infrastructure. In: Proceedings of the 2008 22nd International Symposium on High Performance Computing Systems and Applications,
HPCS 2008, pp. 3339. IEEE Computer Society, Washington (2008)
121
9. Hilbrich, T., M
uller, M.S., Schulz, M., de Supinski, B.R.: Order preserving event
aggregation in TBONs. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra,
J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 1928. Springer, Heidelberg (2011)
10. Hilbrich, T., Protze, J., de Supinski, B.R., Schulz, M., M
uller, M.S., Nagel, W.E.:
Intralayer communication for tree-based overlay networks. In: 42nd International
Conference on Parallel Processing (ICPP), Fourth International Workshop on Parallel Software Tools and Tool Infrastructures, pp. 9951003. IEEE Computer Society Press, Los Alamitos (2013)
11. Ilsche, T., Schuchart, J., Cope, J., Kimpe, D., Jones, T., Kn
upfer, A., Iskra, K.,
Ross, R., Nagel, W.E., Poole, S.: Enabling event tracing at leadership-class scale
through I/O forwarding middleware. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012,
pp. 4960. ACM, New York (2012)
12. Jun, T.H., Watson, G.R.: Scalable Communication Infrastructure (2013). http://
wiki.eclipse.org/PTP/designs/SCI Accessed 30 April 2013
13. Krell Institute. The Component Based Tool Infrastructure (2014). http://
sourceforge.net/projects/cbtf/ Accessed 19 January 2014
14. Lee, G.L., Ahn, D.H., Arnold, D.C., de Supinski, B.R., Legendre, M., Miller, B.P.,
Schulz, M., Liblit, B.: Lessons learned at 208K: towards debugging millions of
cores. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC
2008, pp. 26:126:9. IEEE Press, Piscataway (2008)
15. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard,
Version 3.0 (2012). http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Accessed 27 November 2013
16. M
uller, M.S., van Waveren, M., Lieberman, R., Whitney, B., Saito, H., Kumaran,
K., Baron, J., Brantley, W.C., Parrott, C., Elken, T., Feng, H., Ponder, C.: SPEC
MPI2007 - an application benchmark suite for parallel systems using MPI. Concurrency Comput. Pract. Exp. 22(2), 191205 (2010)
17. Nagel, W.E., Arnold, A., Weber, M., Hoppe, H.C., Solchenbach, K.: VAMPIR:
visualization and analysis of MPI resources. Supercomputer 12(1), 6980 (1996)
18. Nataraj, A., Malony, A.D., Morris, A., Arnold, D.C., Miller, B.P.: A framework
for scalable, parallel performance monitoring. Concurrency Comput. Pract. Exp.
22(6), 720735 (2010)
19. Noeth, M., Mueller, F., Schulz, M., de Supinski, B.R.: Scalable compression and
replay of communication traces in massively parallel environments. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007, pp. 6970
(2007)
20. Roth, P.C., Arnold, D.C., Miller, B.P.: MRNet: a software-based multicast/reduction network for scalable tools. In: Proceedings of the 2003 ACM/IEEE
Conference on Supercomputing, SC 2003. ACM, New York (2003)
21. Wagner, M., Kn
upfer, A., Nagel, W.E.: Hierarchical memory buering techniques
for an in-memory event tracing extension to the open trace format 2. In: 42nd
International Conference on Parallel Processing, ICPP 2013, pp. 970976 (2013)
22. Wylie, B.J.N., Geimer, M., Mohr, B., B
ohme, D., Szebenyi, Z., Wolf, F.: Largescale performance analysis of Sweep3D with the Scalasca toolset. Parallel Process.
Lett. 20(04), 397414 (2010)
Abstract. We introduce novel ideas involving aspect-oriented instrumentation, Multi-Faceted Program Monitoring, as well as novel techniques for a selective and detailed event-based application performance
analysis, with an eye toward exascale. We give special attention to the
spatial, temporal, and level-of-detail aspects of the three important
phases of compile-time ltering, application execution, and runtime ltering. We use an event-based monitoring approach to allow selected and
focused performance analysis.
Keywords: Multi-Faceted program monitoring Aspect-oriented
instrumentation Selective event tracing Vampir Performance analysis
123
Selective Monitoring
124
J. Doleschal et al.
2.2
125
126
J. Doleschal et al.
Example of Use
Selective Visualisation
127
Fig. 3. Colour-coded performance visualisation of Gromacs monitored with the traditional monitoring approach running on a Cray XC30 with four nodes, with each node
hosting one MPI process with six OpenMP CPU threads and two GPU CUDA streams
for an interval of 2.3 s. On each process MPI functions, OpenMP regions, CUDA kernels, and application functions are monitored and the amount of data monitored per
node ranges from 523 MByte to 1126 MByte.
Fig. 4. Colour-coded performance visualisation of Gromacs monitored with the Multifaceted Program Monitoring approach running on a Cray XC30 with four nodes, with
each node hosting one MPI process with six OpenMP CPU threads and two GPU
CUDA streams for the same interval. On each node dierent level of details are monitored and the amount of data was reduced by 70.5 up to 97.2 percent.
128
J. Doleschal et al.
The performance visualiser Vampir [4] allows to load and analyse spatially,
temporally selected data, i.e., the user can select and deselect specic processes
and threads for the analysis and analyse only selected phases of the monitoring
run. For this, the native trace data has to be enriched with so-called snapshot
information, i.e., information about the state of the application at a certain
point of the measurement run, to enable a consistent stack view and a consistent
message matching of the trace information. With a strategy presented in [10],
we additionally ensure a correct matching of send or receive events even under
the presence of missing MPI message events.
Another level of selective trace analysis could be the analysis of trace data
dependent on the level of detail. Using information about the stack level and
the duration of events, the performance analyser and visualiser could be able to
regard or neglect performance information. This strategy will of course aect the
inclusive and exclusive metric information of an event but allows to analyse and
visualize dierent levels of detail (coarse grained vs. ne grained information). In
addition, in combination with a trace format organized in a similar way, like the
hierarchical in-memory buers [11], or with knowledge about the distribution of
events over the dierent stack levels, this selected level of detail strategy can
be used to load, analyse, and visualise only a given percentage of the original
monitored trace information.
Selective monitoring and visualisation are key prerequisites for a detailed exascale
performance analysis. We will therefore research the strategies and techniques
presented in this paper in more detail in the near future. The instrumentation prototype created with InterAspect encourages us to develop a production
quality GCC instrumentation plug-in for Score-P. It will have the least instrumentation overhead of any compiler vendor provided instrumentation we know
of. Results for these measurements will be provided in the future.
Acknowledgment. This research has received funding from the European Communitys Seventh Framework Programme (ICT-2011.9.13) under Grant Agreement no.
287703, cresta.eu.
References
1. Kn
upfer, A., R
ossel, C., an Mey, D., Biersdor, S., Diethelm, K., Eschweiler, D.,
Geimer, M., Gerndt, M., Lorenz, D., Malony, A., Nagel, W.E., Oleynik, Y.,
Philippen, P., Saviankou, P., Schmidl, D., Shende, S., Tsch
uter, R., Wagner, M.,
Wesarg, B., Wolf, F.: Score-P: a joint performance measurement run-time
infrastructure for Periscope, Scalasca, TAU, and Vampir. In: Brunst, H., M
uller,
M.S., Nagel, W.E., Resch, M.M. (eds.) Tools for High Performance Computing
2011, pp. 7991. Springer, Heidelberg (2012)
2. Shende, S., Malony, A.D.: The TAU parallel performance system. Int. J. High
Perform. Comput. Appl. 20(2), 287331 (2006). SAGE Publications
129
3. BSC: Extrae User guide manual for version 2.5.0 (2014). http://www.bsc.es/sites/
default/les/public/computer science/performance tools/extrae-2.5.0-user-guide.
pdf
4. Kn
upfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., M
uller,
M.S., Nagel, W.E.: The Vampir Performance analysis tool set. In: Tools for High
Performance Computing, pp. 139155 (2008)
5. Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High
Perform. Comput. Appl. 20, 287331 (2006)
6. Mohr, B., Malony, A.D., Shende, S., Wolf, F.: Design and prototype of a performance tool interface for openmp. J. Supercomput. 23, 105128 (2002)
7. Buck, B., Hollingsworth, J.K.: An api for runtime code patching. Int. J. High
Perform. Comput. Appl. 14, 317329 (2000)
8. Seyster, J., Dixit, K., Huang, X., Grosu, R., Havelund, K., Smolka, S.A., Stoller,
S.D., Zadok, E.: Interaspect: aspect-oriented instrumentation with GCC. Formal
Meth. Syst. Des. 41, 295320 (2012)
9. Hess, B., Kutzner, C., van der Spoel, D., Lindahl, E.: GROMACS 4: algorithms
for highly ecient, load-balanced, and scalable molecular simulation. J. Chem.
Theory Comput. 4(3), 435447 (2008). Please check and conrm the inserted year
for reference [9]
10. Wagner, M., Doleschal, J., Nagel, W.E., Kn
upfer, A.: Runtime message uniquication for accurate communication analysis on incomplete mpi event traces. In:
Proceedings of the 20th European MPI Users Group Meeting, EuroMPI 2013,
pp. 123128 (2013)
11. Wagner, M., Kn
upfer, A., Nagel, W.E.: Hierarchical memory buering techniques
for an in-memory event tracing extension to the open trace format 2. In: 2013 42nd
International Conference on Parallel Processing (ICPP), pp. 970976 (2013)
Introduction
131
132
Related Work
Sur et. al. developed ecient routines for personalized all-to-all exchange on
Inniband clusters [10]. They use Inniband RDMA operations combined with
hypercube algorithms and achieved speedup factors of three for short messages
of 32 B on 16 nodes.
Li et. al. use Innibands virtual lanes for the improvement of collective MPI
operations in multi-core clusters [6]. These virtual lanes are used for balancing
multiple send requests active at the same time and to increase the throughput
for small messages. This implementation showed a performance improvement of
1020 %.
Li et. al. analyse the inuence of synchronisation messages on the communication performance. Those messages are used in collective operations to control
of the exchange of large messages [7]. They found that contention of synchronisation messages accounts for a large portion of the operations overall latency.
Their algorithm optimises the exchange and achieved improvements of 25 % for
messages between 32 and 64 kB length.
Tu et. al. propose a model of the memory-hierarchy in multi-core clusters that
uses horizontal and vertical levels [11]. Their experimental results show that this
model is capable to predict the communication costs of collective operations
more accurately than it was possible before. They developed a methodology to
optimize collective operations and demonstrated it with the implementation of
a multi-core aware broadcast operation.
The crystal router as developed by Fox et. al. [4] is an algorithm that allows
sending messages of arbitrary length between arbitrary nodes in a hypercube
network. It is advantageous especially in irregular applications where the exact
nature of the communication is not known before it occurs or where the message
emergence changes dynamically.
Communication operations in hypercube networks are often implemented by
routing algorithms that iterate over the dimensions of the cube and execute in
each step one point-to-point communication operation with the partner node at
the other end of the respective edge. As explained for example in [5], the result of
the binary xor function with the processor numbers of sender and receiver node
as arguments provides a routing path that can be used to transport the message.
Therefore, messages can be delivered in algorithms following this pattern from
each node to each other node in at most d communication steps where d is the
dimensionality of the hypercube network. In our implementation, we interpret
MPI processes as nodes of a hypercube network and use MPI ranks as processor
numbers.
133
It has been proven that such a choice of paths provides load balancing in
the communication of several typical applications as well as it is optimal if all
processors are used in a load balanced way [5]. The crystal router has been
developed to handle one typical situation of processes in hypercube networks. In
each process, there is a set of messages, which must be sent to other processes.
Destination processes expect messages, but they know neither exactly how many
messages will arrive nor from which processes they will be sent. Nevertheless,
the communication happens for many algorithms typically in communication
phases between computations in a time-synchronised manner. One example is
the irregularity in the communication of molecular dynamics algorithms. The
real amount of data that has to be communicated between neighbouring subdomains is not known before the data exchange itself. Another example of slightly
irregular communication can be found in nite element calculations where the
meshes must be decomposed over several processors. This decomposition will
be perfect only to a certain degree. Therefore, the communication between the
nodes holding the dierent subdomains will show some load-imbalance.
Algorithm 1 explains how the transport of messages between arbitrary
processes works. First, all messages are stored in a buer for outgoing
messages of the sender process (msg out). During the iteration over the dierent
channels (i.e. the bits of rank numbers), some messages will be transmitted in
each iteration step according to their routing path. For that, those messages
that must be transferred through a certain channel will be copied from msg out
to a common transfer buer (msg next). The buer msg next of each process
will be exchanged through the active channel of the current iteration step with
the respective buer of a partner process. Thereafter, all messages that had to
be routed from this partner over this channel can be found in msg next. They
will be inspected there. Messages that are addressed to the receiving process will
be copied into the buer for incoming messages (msg in) from where they can
be accessed by the application code later. Messages that have to be forwarded
further in one of the following iteration steps will be kept and put into msg out.
Algorithm 1. Pseudocode of the crystal router algorithm, adapted from [4].
begin crystal_router
declare buer msg_out; /* buffer for messages to send
declare buer msg_in;
/* buffer for received messages
declare buer msg_next; /* buffer for messages to send
/* in the next communication step
for each msg in msg_out do
if dest_rank(msg) == myrank then
copy msg into msg in;
end for
for each dimension of the hypercube i = 0,...,d-1 do
for each message msg in msg_out do
if (dest_rank(msg)&myrank)2i then
*/
*/
*/
*/
134
135
MPI library of Lindgren. The benchmarked operation is a personalized all-toall communication that is provided as MPI Alltoallv. The crystal router based
implementation is called Cr Alltoallv. The benchmark has been setup in such
a way, that each MPI process communicates with its 26 nearest neighbours. The
results for runs with 256 and 512 processes are shown in Fig. 1. The results for
1024 and 2048 processes are shown in Fig. 2. Finally, Fig. 3 provides results for
4096 and 8192 processes.
The crystal router based implementation Cr Alltoallv is much faster than
MPI Alltoallv in runs of all sizes especially for short, latency-bound messages.
For example, 85 s are needed for a Cr Alltoallv operation that lets each rank
exchange 8 Bytes with its partner processes in a run with 256 processes. The
operation takes 273 s for 8192 processes. The ratio of these times is 1 : 3.2.
The same operation needs 3 227 s for 256 processes and 187 000 s with 8192
processes with the function MPI Alltoallv. The ratio of the times is 1 : 58. This
result demonstrates that sparse communication patterns involving all processes
of a parallel program can be realised eciently by the crysral router.
The speed advantage of the crystal router becomes smaller for longer messages. The speeds of the MPI system function and of the crystal router are almost
equal for the longest messages of 128 kB in the smallest test of 256 processes. The
speed dierence increases for this message length with an increasing processor
count and reaches a factor of 19 for the largest run utilising 8192 processes.
Furthermore, the benchmarks show that the number of communication partners respectively the size of the stride do not noticeably inuence the duration of
the operation for the MPI system function. The crystal router implementation
contrastingly is more sensitive to these parameters. Figure 4 shows measurements
for a varying stride length utilizing 2048 processes and transmitting messages
Fig. 1. Benchmark of personalized all-to-all communication implemented with the crystal router based function Cr Alltoallv and the MPI function MPI Alltoallv. Each
process sends and receives data from 26 neighbouring processes. The measurements
have been executed with 256 respectively 512 processes.
136
Fig. 2. Benchmark of personalized all-to-all communication implemented with the crystal router based function Cr Alltoallv and the MPI function MPI Alltoallv. Each
process sends and receives data from 26 neighbouring processes. The measurements
have been executed with 1024 respectively 2048 processes.
Fig. 3. Benchmark of personalized all-to-all communication implemented with the crystal router based function Cr Alltoallv and the MPI function MPI Alltoallv. Each
process sends and receives data from 26 neighbouring processes. The measurements
have been executed with 4096 respectively 8192 processes.
of 8 resp. 512 byte length. The crystal router needs an increasing runtime for
increasing strides. This reects that the increasing stride length between the
communications causes increasing data amounts that must be transfered the
processes that are located on other numa nodes, on other sockets and on other
nodes. For example, the time needed for the communication operation with a
stride of 24 (i.e. each process communicates only with processes that reside
on other nodes) is compared to a 1-stride 59 % longer for messages of 8 byte
length, and it needs 51 % more time for messages of 512 byte length. Such a
137
systematic trend could not be observed with the MPI routine. Its variability is
clearly smaller than 10 %.
Figure 5 presents a benchmark that has been executed with 256 processes.
Here, the number of communication partners of the processes has been varied.
The MPI system routine again does not show signicant variations in their runtime. The crystal router implementation needs longer runtimes for an increasing
number of communication partners per process. The result reects the increasing
communication volume that has to be processed by the constant number of
processors.
138
We evaluated the original crystal router algorithm in an implementation of a personalized all-to-all communication on a recent computer architecture. It shows
139
References
1. EU FP7 project CRESTA. http://cresta-project.eu/
2. Swedish e-Science Research Centre (SeRC). http://www.e-science.se/
3. Alverson, R., Roweth, D., Kaplan, L.: The gemini system interconnect. In: 2010
IEEE 18th Annual Symposium on High Performance Interconnects (HOTI),
pp. 8387, 1820 August 2010
4. Fox, G.C., et al.: Solving Problems on Concurrent Processors: General Techniques
and Regular Problems. Prentice Hall, Englewood Clis (1988)
5. Grama, A.: Introduction to Parallel Computing. Addison-Wesley, Harlow (2003)
6. Li, B., Huo, Z., Zhang, P., Meng, D.: Multiple virtual lanes-aware MPI collective
communication in multi-core clusters. In: 2009 International Conference on High
Performance Computing (HiPC), pp. 304311, 16-19 December 2009
7. Li, Q., Huo, Z., Sun, N.: Optimizing MPI alltoall communication of large messages in multicore clusters. In: 2011 12th International Conference on Parallel and
Distributed Computing, Applications and Technologies (PDCAT), pp. 257262,
2022 October 2011
8. Schliephake, M., Aguilar, X., Laure, E.: Design and implementation of a runtime system for parallel numerical simulations on large-scale clusters. In: Procedia
Computer Science, Proceedings of the International Conference on Computational
Science, ICCS 2011, vol. 4, pp. 21052114 (2011)
140
Abstract. Vistle is a scalable distributed implementation of the visualization pipeline. Modules are realized as MPI processes on a cluster.
Within a node, dierent modules communicate via shared memory. TCP
is used for communication between clusters.
Vistle targets especially interactive visualization in immersive virtual
environments. For low latency, a combination of parallel remote and local
rendering is possible.
Keywords: Distributed visualization Architecture
visualization In-situ visualization Virtual reality
Hybrid parallel
Overview
Related Work
Data parallelism is available in several distributed systems based on the visualization pipeline: VisIt [2] and ParaView [13] rely on algorithms implemented by
VTK [12] for many of their modules, while EnSight [3] has dedicated implementations. They all implement a client-server architecture, which only allows for
restricted distributed processing: data objects can travel from one remote cluster server to a local display client system, but they cannot be routed between
c Springer International Publishing Switzerland 2015
S. Markidis and E. Laure (Eds.): EASC 2014, LNCS 8759, pp. 141147, 2015.
DOI: 10.1007/978-3-319-15976-8 11
142
M. Aum
uller
Process Model
Data Management
All data objects are created in shared memory managed by Boost. Interprocess
[1]. This minimizes the communication overhead and data replication necessary
for Vistles multi-process model. As the function pointers stored in the virtual
function table of C++ classes are valid only within the address space of a single
process, virtual methods cannot be called for objects residing in shared memory. For the class hierarchy of shared memory data objects, there is a parallel
hierarchy of proxy accessor objects, as shown in Fig. 2. Polymorphic behavior is
143
Fig. 1. Process layout, control ow and data ow within a single cluster: controller
and modules are realized as MPI processes. Within a node, shared memory queues are
used to route control messages through the controller; if necessary, they are routed via
MPI through rank 0 of the controller to other ranks. Down-stream modules retrieve
their input data from shared memory after being passed an object handle.
Fig. 2. Parallel class hierarchies for data objects residing in shared memory and accessor objects providing polymorphic behavior for modules.
restored by creating a corresponding proxy object within each process accessing a shared memory object. Life time of data objects is managed with reference
counting. Caching of input objects for modules is implemented by simply keeping
a reference to the objects.
The most important component of data objects are scalar arrays. They provide an interface compatible with STLs vector [6]. As an optimization for the
common case of storing large arrays of scalar values, they are not initialized during allocation, as most often these default values would have to be overwritten
immediately. These arrays are reference counted individually, such that shallow copies of data objects are possible and data arrays can be referenced from
several data objects. This allows to e. g. reuse the coordinate array for both an
unstructured volume grid and a corresponding boundary surface.
144
M. Aum
uller
The central instance for managing the execution is the controller. Its main task
is to handle events and manage control ow between modules. Messages for
this purpose are rather small and have a xed maximum size. MPI is used for
transmitting them from the controllers rank 0 to other ranks. Within a rank,
they are forwarded using shared memory message queues. The controller polls
MPI and message queues in shared memory on the main thread. TCP is used
for communicating them to user interfaces and other clusters. They are used
to launch modules, trigger their execution, announce the availability of newly
created data objects, transmit parameter values and communicate the execution
state of a module.
Work ow descriptions are stored as Python scripts and are interpreted by
the controller.
Modules
Modules are implemented by deriving from the module base class. During construction, a module should dene its input and output ports as well as its parameters. For every tuple of objects received at its inputs, the compute() method
of a module is called. By default, compute() is only invoked on the node where
the data object has been received. In order to avoid synchronization overhead,
MPI communication is only possible if a module explicitly opts in to parallel
invocation of compute() on all ranks. If only a nal reduction operation has
to be performed after all blocks of a data set have been processed, a reduce()
method can be implemented by modules. Compared to parallel invocation of
compute(), this has lower synchronization overhead.
145
User Interfaces
User interfaces attach to or detach from a Vistle session dynamically at runtime. User interfaces connect to the controllers rank 0. For attaching to sessions
started as a batch job on a system with limited network connectivity, the controller will connect to a proxy at a known location, where user interfaces can
attach to instead. Upon connection, the current state of the session is communicated to the user interface. From then on, the state is tracked by observing
Vistle messages that induce state changes. An arbitrary number of UIs can be
attached at any time, thus facilitating simple collaboration. Graphical and command line/scripting user interfaces can be mixed at will. Their state always
remains synchronized.
Graphical UIs provide an explicit representation of data ow: this makes the
congured visualization pipeline easy to understand.
Rendering
First Results
Performance of the system was evaluated with the visualization of the simulation of a pump turbine. The simulation was conducted by the Institute of Fluid
Mechanics and Hydraulic Machinery at the University of Stuttgart with OpenFOAM on 128 processors. Accordingly, the data set was decomposed into 128
blocks. This also limits the amount of parallelism that can be reached. Figure 3
shows runtime and parallel eciency. Isosurface extraction is interactive at rates
of more than 20/s and runtime does not increase until full parallelism is reached.
While this suggests that the approach is suitable for in-situ visualization, the
impact on the performance of a simulation will have to be assessed specically
for each case: often, the simulation will have to be suspended while its state is
captured, the visualization might compete for memory with the simulation, and
the visualization will claim processor time slices from the simulation as it will be
scheduled on the same cores. However, these costs are only relevant when in-situ
visualization is actively used, as Vistles modular design requires only a small
component for interfacing with the visualization tool to remain in memory all
the time.
146
M. Aum
uller
10
Not all features described here are already implemented. The most signicant
gap is the lack of most distributed features: only user interfaces and display
modules can run remotely. Also, support for structured grids is still missing.
Current projects are the handling of halo cells in order to support algorithms
which require data from neighboring cells. The next mile stones that we aim to
achieve are to couple the system to OpenFOAM and to provide the infrastructure for algorithms which require tight coupling between the MPI processes of a
module and, building on that, the implementation of a particle tracer for decomposed data sets that are spread across the nodes of a cluster. Additionally, the
scalability of the system will be improved by making better use of OpenMP and
acceleration hardware.
Acknowledgments. This work has been supported in part by the CRESTA project
that has received funding from the European Communitys Seventh Framework Programme (ICT-2011.9.13).
References
1. Abrahams, D., et al.: BOOST C++ Libraries. http://www.boost.org. Accessed 28
Jan 2014
2. Ahern, S., Childs, H., Brugger, E., Whitlock, B., Meredith, J.: VisIt: an end-user
tool for visualizing and analyzing very large data. In: Proceedings of SciDAC (2011)
3. Frank, R., Krogh, M.F.: The EnSight visualization application. In: Bethel, E.W.,
Childs, H., Hansen, C. (eds.) High Performance Visualization-Enabling ExtremeScale Scientic Insight, pp. 429442. Chapman & Hall/CRC, Salt Lake City (2012)
4. Garth, C., Joy, K.I.: Fast, memory-ecient cell location in unstructured grids for
visualization. IEEE Trans. Vis. Comput. Graph. 16(6), 15411550 (2010)
147
5. Hoberock, J., Bell, N.: Thrust: A Parallel Template Library (2010). http://thrust.
github.io/, version 1.7.0. Accessed 28 Jan 2014
6. Josuttis, N.M.: The C++ Standard Library. A Tutorial and Reference, 2nd edn.
Addison-Wesley Professional, Boston (2012)
7. Lo, L.T., Ahrens, J., Sewell, C.: PISTON: a portable cross-platform framework for
data-parallel visualization operators. In: EGPGV, pp. 1120 (2012)
8. Moreland, K.: A survey of visualization pipelines. IEEE Trans. Vis. Comput.
Graph. 19(3), 367378 (2013). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp
=&arnumber=6212499&contentType=Journals+%26+Magazines&matchBoolean
%3Dtrue%26rowsPerPage%3D30%26searchField%3DSearch All%26queryText%3D%28p Title%3A%22a+survey+of+visualization+pipelines%22%29
9. Moreland, K., Kendall, W., Peterka, T., Huang, J.: An image compositing solution at scale. In: 2011 International Conference for High Performance Computing,
Networking, Storage and Analysis (SC), pp. 110 (2011)
10. Niebling, F., Aum
uller, M., Kimble, S., Kopf, C., Woessner, U.: Vistle Repository
on GitHub. https://github.com/vistle/vistle. Accessed 28 Jan 2014
11. Rantzau, D., Lang, U.: A scalable virtual environment for large scale scientic
data analysis. Future Gener. Comput. Syst.-Int. J. Grid Comput. Theory Methods
Appl. 14(34), 215222 (1998)
12. Schroeder, W., Martin, K., Lorensen, B.: The Visualization Toolkit. An ObjectOriented Approach to 3D Graphics. Kitware Inc., Clifton Park (2006)
13. Squillacote, A.: The Paraview Guide. Kitware Inc., Clifton Park (2008)
14. Wagner, C., Flatken, M., Chen, F., Gerndt, A., Hansen, C.D., Hagen, H.: Interactive hybrid remote rendering for multi-pipe powerwall systems. In: Geiger, C.,
Herder, J., Vierjahn, T. (eds.) Virtuelle und Erweiterte Realit
at - 9. Workshop der
GI-Fachgruppe VR/AR, pp. 155166. Shaker Verlag, Aachen (2012)
15. Whitlock, B., Favre, J.M., Meredith, J.S.: Parallel in situ coupling of simulation
with a fully featured visualization system. In: EGPGV, pp. 101109, April 2011
16. Wierse, A., Lang, U., R
uhle, R.: A system architecture for data-oriented
visualization. In: Lee, J.P., Grinstein, Georges G. (eds.) VisualizationWS 1993. LNCS, vol. 871, pp. 148159. Springer, Heidelberg (1994).
http://www.springerlink.com/index/10.1007/BFb0021151
Author Index
Bernabeu, Miguel O. 28
Bohan, P. Kevin 85
Brodowicz, Maciej 85
Brunst, Holger 122
Cebamanos, Luis 69
Chacra, David Abou 28
Coveney, Peter V. 28
de Supinski, Bronis R. 110
Doleschal, Jens 57, 122
Fischer, Paul 57
Forget, Benoit 39
Gong, Jing 57
Groen, Derek 28
Hammouda, Adam 100
Hart, Alistair 57, 69
Henningson, Dan 57
Henty, David 57, 69
Hess, Berk 3
Hilbrich, Tobias 110
Jaros, Jiri 28
Josey, Colin 39
Markidis, Stefano 57
Mller, Matthias S. 110
Nagel, Wolfgang E. 110, 122
Nash, Rupert W. 28
Pll, Szilrd 3
Peplinski, Adam 57
Protze, Joachim 110
Richardson, Harvey 69
Schlatter, Philipp 57
Schliephake, Michael 57, 130
Schulz, Martin 110
Siegel, Andrew R. 39
Siegel, Andrew 100
Siegel, Stephen 100
Sterling, Thomas 85
Tramm, John R. 39
Wagner, Michael 110
Wesarg, Bert 122
William, Thomas 122
Zhang, Bo 85
Ziegenbalg, Johannes 122