The concept of parallel programming is to develop a code that operates efficiently across multiple central processing units (CPU) and completes the task at hand much faster than what could be done if only one CPU was adopted. This is in many ways independent of the programming language you choose to develop your project. Before delving into the details of parallel programming I thought I would give a brief overview of the available programming languages.
Programming languages are typically referred to as belonging to a particular generation, with each generation more user-friendly than the next. However, this typically comes at the cost of reduced performance. Third generation languages (3GL) including both procedural (C), and object oriented languages (C++, FORTRAN) are at present the work horses used for generating data in massively parallel scientific computing. Forth generation languages (4GL) abstract away more of the details with higher level mathematical algorithms built in (eg: linear algebra, optimisation, visualisation). This reduces the human time required to develop the code, but the programs typically run slower in comparison to 3GL codes. Example 4GL languages include Python, Matlab and R, which are often used to write codes to post-process data generated by 3GL codes. As an aside, to combat the reduced performance of 4GL programs one can replace the slow parts with code written in a 3GL (eg: replace slow Python code with FORTRAN compiled using f2py, or with C compiled using cython).
Regardless of the choice of language, the first step in developing a parallel program is to
think about how to efficiently split up the problem at hand into
separate tasks. The slow part in parallel codes is the communication
between each of the processors, so its best to think about how to
split up the problem in order to minimise the amount of communication
required. The same code is seen by each CPU, with each
CPU knowing only the total number of processors and its local
processor number. You then build the logic of the code so that it
behaves differently depending on the value of the local CPU number.
For
example take the code used to calculate the principal components in a
previous post. The
first stage in calculating the principal components is to multiply
two flow fields together at each point in space, then sum together the result at all the points. This process needs to be
repeated for every combination of flow fields. There are two ways to
split up the problem. Say for example we have 100 flow fields
consisting of 1,000,000 points in space and we want to process them
over 4 CPUs. The first way to split up the problem would be to send
the first 100/4 = 25 flow fields to the first CPU, then the next 25 to the next CPU and so on, as illustrated in the figure below. However, to do the multiplication of the
1st field with the 26th field we would need to send all 1,000,000
points of the 1st field from CPU 1 to CPU 2, and so on for all of the
other combinations.
The
second way would be to send the first 1,000,000/4=250,000 points of all
of the fields to the first CPU, then the next 250,000 points to the
next CPU and so on, again as illustrated in the figure below. This way all of the multiplication combinations
can be handled without any communication. Each CPU does all of
the multiplications and sums the result within the domain it has access to. The local value of this sum
is then added together across all of the CPUs. This means for each field pair, only one number needs to be shared, which is a significant reduction in
the amount of communication required as compared to the previous approach.
In
terms of the technical implementation, the main ways of handling this
communication is to use MPI (Message
Passing Interface). It is a programming
language independent communication protocol. To program with MPI you
just need to include the appropriate library, whether you are coding
in C, C++, FORTRAN, Python, or any of the other supported languages. In
the first parallelisation example discussed above, each of the
snapshots are sent to and from each CPU using the commands MPISend and MPIReceive. In the second approach to add together each
of the individual sums, you use the MPIReduce command. In fact the “reduce”
part of the “map/reduce” algorithm, used in database software such as Hadoop, is in fact the MPIReduce command.
Modern High Performance Computing (HPC) clusters, however, are typically more
complex and comprise of many CPUs each with multiple cores. Whilst
MPI can also be used to handle the communication between cores on a
given CPU, a more efficient and faster way of handling this
inter-core communication is to use OpenMP (Open Multi-Processing).
OpenMP is newer than MPI, and hence is currently supported in only C, C++ and FORTRAN.
Advanced
scientific computing codes are programmed using hybrid MPI and
OpenMP parallelisations. An example of a hybrid code is the one I have
been using to undertake the numerical simulations of wall bounded turbulence
developed by my colleagues from the Universidad de Madrid [1,2]. This
particular code solves the fluid flow (Navier-Stokes) equations for
the flow over a flat surface in a three-dimensional rectangular
volume. For each MPI process the physical domain is split up into a
series of blocks oriented in the flow direction. Within each block
OpenMP parallelisation is used to split the block up in the wall
normal direction. This allows each of the cores to solve the
equations in different regions of the flow simultaneously, and
communicate with each other when necessary. This code has been
benchmarked on various clusters and has been shown to increase in
speed proportionally with an increase in the number of cores used to
parallelise it
[3].
The
video below is a fly-through of an example flow field generated by
this code which I presented at a recent conference [4]. You are flying in the direction of the flow. The
structures represent the individual eddies (or vorticites) in the
simulation. It is actually a composite of two different simulations.
The green structures on the right are from the simulation of the flow
over a flat surface. The red structures on the left are from a
simulation in which I implemented a change to the code such that the
flow thinks that it is flowing over a curved surface, like that of an
aircraft wing, or wind turbine blade. You will notice at the end of the
domain the red vortex structures are located further away from the
surface. These simulations contain over 30 billion degrees of freedom, and were parallelised over 32,000 cores. This research was funded by the ARC, and the computational resources were provided by iVEC, NCI and PRACE.
One of the recent trends in HPC has been to develop clusters using graphics processing units (GPU) instead of (or in addition to) CPUs. Whilst a CPU has several cores, a GPU has several hundred cores. GPUs are incredibly fast for linear algebra operations, as they are built to operate on all elements of a vector simultaneously. For example multiplying the RGB (red,green,blue) values of pixels in an image to those of another image. Programming for GPUs also require a different communication protocol (OpenCL). GPUs are not only used for graphics processing, but also for scientific computing more broadly. They are ideally suited for processes that do not require a great deal of communication between cores. They also have less on board memory as compared to a CPU. The types of simulations discussed in this post require a significant amount of communication between cores, so at this stage GPUs are not the ideal tool for these types of problems.
References:
[1] Simens, M. P., Jimenez, J. Hoyas, S. & Mizuno, Y., 2009, A high-resolution code for turbulent boundary layers, J. Comp. Phys. Vol. 228, pp 4218-4231.
[2] Borrell, G., Sillero, J. & Jimenez, J. 2013, A code for direct numerical simulation of turbulent boundary layers at high Reynolds numbers in BG/P supercomputers, Comp. Fluids, Vol. 80, pp 37-43.
[4] Kitsios,
V., Atkinson, C., Sillero, J.A., Borrell, G. Gungor, A.G., Jiménez,
J. & Soria, J., Boundary condition development for an equilibrium
adverse pressure gradient turbulent boundary layer at the verge of
separation, IUTAM Symposium on advances in computation modeling and
control of transitional and turbulent flows, Goa, India, 15-18
December, 2014.
No comments:
Post a Comment