Introduction
Modern High Performance Systems are highly parallel, and almost every non-trivial program in existence uses some form of non-sequential execution. If you are not writing parallel progrms then you are not unlocking the full potential of your hardware system, this has been true since the early 2000s when increasing clock speed plateaued and hardware vendors started putting more cores into their CPUs. At the same time modern C++1 has become the standard programming language to program these systems and in coming years standard C++ will also be used to program them as well (unlike CUDA, OpenCL which use a seperate kernel language which must be compiled with a device compiler2). How prevelent C++ is in modern HPC ecosystem is that almost all deep learning frameworks, high performance runtime systems, compilers, etc are implemented in modern C++. Standard Parallelism support in modern C++ is called stdpar3, or PSTL (Parallel STL). This blog assumes basic knowledge of modern C++, and some knowledge of HPC and heterogeanous programming.
C++, STL and stdpar
STL or Standard Template Library is the main standard library for C++ filled with useful, battle tested reuseable programs. A C++ user writes their program using STL as the building block of their system.
Host vs Device
stdpar in Host compilers
stdpar in Device compilers
SYCL and C++
nvc++ stdpar
AdaptiveCpp stdpar and device offloading
intel oneDPL stdpar and dpc++
Data Parallel Programming with C++ std::simd
In C++26, the standard has introduced support for vector types using std::simd
4, which you can also use under the experimental namespace (although there are some differences between them). Kokkos library also has experimental support for std::simd
as well.
#include <experimental/simd>
#include <vector>
#include <cstddef>
namespace stdx = std::experimental;
void vaddf(float *a,float *b,float *c, std::size_t size) {
using simd_t = stdx::native_simd<float>;
std::size_t width = simd_t::size();
std::size_t i = 0;
for (; i + width < size; i += width) {
simd_t va(a + i, stdx::element_aligned); // load
simd_t vb(b + i, stdx::element_aligned); // load
simd_t vr = va + vb; // vector add
.copy_to(c + i, stdx::element_aligned) // store
vr}
// residual elements
for (; i < size; i++) {
[i] = a[i] + b[i];
c}
}