Introduction
Modern High Performance Systems are highly parallel, and almost every non-trivial program in existence uses some form of non-sequential execution. If you are not writing parallel progrms then you are not unlocking the full potential of your hardware system, this has been true since the early 2000s when increasing clock speed plateaued and hardware vendors started putting more cores into their CPUs. At the same time modern C++1 has become the standard programming language to program these systems and in coming years standard C++ will also be used to program them as well (unlike CUDA, OpenCL which use a seperate kernel language which must be compiled with a device compiler2). How prevelent C++ is in modern HPC ecosystem is evident from the fact that almost all deep learning frameworks, high performance runtime systems, compilers, etc are implemented in modern C++. Standard Parallelism support in modern C++ is called stdpar3, or PSTL (Parallel STL). The goal of this blog post is to introduce parallel programming in C++, and how modern C++ features are helping make it easier for engineers to build performant systems on top of them. This blog post is mainly about two things, PSTL (also known as stdpar) and SYCL (an alternative to CUDA/HIP), later parts of the blog will also look into other C++ solutions such as C++23 std::simd, std::execution (also known as senders-recivers), Kokkos and HPX, oneTBB, and taskflow.
C++, STL and stdpar
STL or Standard Template Library is the main standard library for C++ filled with useful, battle tested reuseable algorithms and data structures. A C++ user writes their program using STL as the building block of their system. This allows for code reuse and friction-less development. stdpar which is now known as PSTL is the parallel implementation of some of the algorithms in the C++ STL which allows users to compose their programs using parallel builing blocks if they wish to choose it (the default is sequential). PSTL’s support in two most used C++ compilers clang and gcc is merky where gcc supports PSTL via intel oneTBB and clang only has experimental support for PSTL4.
Host vs Device
stdpar in Host compilers
stdpar in Device compilers
SYCL and C++
nvc++ stdpar
AdaptiveCpp stdpar and device offloading
intel oneDPL stdpar and dpc++
Data Parallel Programming with C++ std::simd
In C++26, the standard has introduced support for vector types using std::simd5, which you can also use under the experimental namespace (although there are some differences between them). Kokkos library also has experimental support for std::simd as well.
#include <experimental/simd>
#include <vector>
#include <cstddef>
namespace stdx = std::experimental;
void vaddf(float *a,float *b,float *c, std::size_t size) {
using simd_t = stdx::native_simd<float>;
std::size_t width = simd_t::size();
std::size_t i = 0;
for (; i + width < size; i += width) {
simd_t va(a + i, stdx::element_aligned); // load
simd_t vb(b + i, stdx::element_aligned); // load
simd_t vr = va + vb; // vector add
vr.copy_to(c + i, stdx::element_aligned) // store
}
// residual elements
for (; i < size; i++) {
c[i] = a[i] + b[i];
}
}