C++ native Standard Parallel Programming: PSTL and SYCL

Introduction

Modern High Performance Systems are highly parallel, and almost every non-trivial program in existence uses some form of non-sequential execution. If you are not writing parallel progrms then you are not unlocking the full potential of your hardware system, this has been true since the early 2000s when increasing clock speed plateaued and hardware vendors started putting more cores into their CPUs. At the same time modern C++¹ has become the standard programming language to program these systems and in coming years standard C++ will also be used to program them as well (unlike CUDA, OpenCL which use a seperate kernel language which must be compiled with a device compiler²). How prevelent C++ is in modern HPC ecosystem is that almost all deep learning frameworks, high performance runtime systems, compilers, etc are implemented in modern C++. Standard Parallelism support in modern C++ is called stdpar³, or PSTL (Parallel STL). This blog assumes basic knowledge of modern C++, and some knowledge of HPC and heterogeanous programming.

C++, STL and stdpar

STL or Standard Template Library is the main standard library for C++ filled with useful, battle tested reuseable programs. A C++ user writes their program using STL as the building block of their system.

Host vs Device

stdpar in Host compilers

stdpar in Device compilers

SYCL and C++

nvc++ stdpar

AdaptiveCpp stdpar and device offloading

intel oneDPL stdpar and dpc++

Data Parallel Programming with C++ `std::simd`

In C++26, the standard has introduced support for vector types using std::simd⁴, which you can also use under the experimental namespace (although there are some differences between them). Kokkos library also has experimental support for std::simd as well.


#include <experimental/simd>
#include <vector>
#include <cstddef>

namespace stdx = std::experimental;

void vaddf(float *a,float *b,float *c, std::size_t size) {
    using simd_t = stdx::native_simd<float>;
    std::size_t width = simd_t::size();
    std::size_t i = 0;
    for (; i + width < size; i += width) {
        simd_t va(a + i, stdx::element_aligned); // load
        simd_t vb(b + i, stdx::element_aligned); // load
        simd_t vr = va + vb; // vector add
        vr.copy_to(c + i, stdx::element_aligned) // store
    }
    // residual elements
    for (; i < size; i++) {
        c[i] = a[i] + b[i];
    }
}

Footnotes

Modern C++ here referes to C++11, C++14, C++17 and up↩︎
Device here means any accelerator such as GPU or NPUs these days↩︎
I will use stdpar and pstl interchangebly↩︎
https://en.cppreference.com/w/cpp/experimental/simd.html↩︎