├── README.rst ├── se_and_openmp.rst └── se_plugin_interface.rst /README.rst: -------------------------------------------------------------------------------- 1 | .. Using backticks indicates inline code. 2 | .. default-role:: code 3 | 4 | At Google we're doing a lot of work on parallel programming models for CPUs, GPUs and other platforms. One place where we're investing a lot are parallel libraries, especially those closely tied to compiler technology like runtime and math libraries. We would like to develop these in the open, and the natural place seems to be as a subproject in LLVM if others in the community are interested. 5 | 6 | Initially, we'd like to open source our StreamExecutor runtime library, which is used for simplifying the management of data-parallel workflows on accelerator devices and can also be extended to support other hardware platforms. We'd like to teach Clang to use StreamExecutor when targeting CUDA and work on other integrations, but that makes much more sense if it is part of the LLVM project. 7 | 8 | However, we think the LLVM subproject should be organized as a set of several libraries with StreamExecutor as just the first instance. As just one example of how creating a unified parallelism subproject could help with code sharing, the StreamExecutor library contains some nice wrappers around the CUDA driver API and OpenCL API that create a unified API for managing all kinds of GPU devices. This unified GPU wrapper would be broadly applicable for libraries that need to communicate with GPU devices. 9 | 10 | Of course, there is already an LLVM subproject for a parallel runtime library: OpenMP! So there is a question of how it would fit into this picture. Eventually, it might make sense to pull in the OpenMP project as a library in this proposed new subproject. In particular, there is a good chance that OpenMP and StreamExecutor could share code for offloading to GPUs and managing workloads on those devices. This is discussed at the end of the StreamExecutor documentation below. However, if it turns out that the needs of OpenMP are too specialized to fit well in a generic parallelism project, then it may make sense to leave OpenMP as a separate LLVM subproject so it can focus on serving the particular needs of OpenMP. 11 | 12 | Documentation for the StreamExecutor library that is being proposed for open-sourcing is included below to give a sense of what it is, in order to give context for how it might fit into a general parallelism LLVM subproject. 13 | 14 | What do folks think? Is there general interest in something like this? If so, we can start working on getting a project in place and sketching out a skeleton for how it would be organized, as well as contributing StreamExecutor to it. We're happy to iterate on the particulars to figure out what works for the community. 15 | 16 | 17 | ============================================= 18 | StreamExecutor Runtime Library Documentation 19 | ============================================= 20 | 21 | 22 | What is StreamExecutor? 23 | ======================== 24 | 25 | **StreamExecutor** is a unified wrapper around the **CUDA** and **OpenCL** host-side programming models (runtimes). It lets host code target either CUDA or OpenCL devices with identically-functioning data-parallel kernels. StreamExecutor manages the execution of concurrent work targeting the accelerator similarly to how an Executor_ from the Google APIs client library manages the execution of concurrent work on the host. 26 | 27 | .. _Executor: http://google.github.io/google-api-cpp-client/latest/doxygen/classgoogleapis_1_1thread_1_1Executor.html 28 | 29 | StreamExecutor is currently used as the runtime for the vast majority of Google's internal GPGPU applications, and a snapshot of it is included in the open-source TensorFlow_ project, where it serves as the GPGPU runtime. 30 | 31 | .. _TensorFlow: https://www.tensorflow.org 32 | 33 | It is currently proposed that StreamExecutor itself be independently open-sourced. As part of that proposal, this document describes the basics of its design and explains why it would fit in well as an LLVM subproject. 34 | 35 | 36 | ------------------- 37 | Key points 38 | ------------------- 39 | 40 | StreamExecutor: 41 | 42 | * abstracts the underlying accelerator platform (avoids locking you into a single vendor, and lets you write code without thinking about which platform you'll be running on). 43 | * provides an open-source alternative to the CUDA runtime library. 44 | * gives users a stream management model whose terminology matches that of the CUDA programming model. 45 | * makes use of modern C++ to create a safe, efficient, easy-to-use programming interface. 46 | 47 | StreamExecutor makes it easy to: 48 | 49 | * move data between host and accelerator (and also between peer accelerators). 50 | * execute data-parallel kernels written in the OpenCL or CUDA kernel languages. 51 | * inspect the capabilities of a GPU-like device at runtime. 52 | * manage multiple devices. 53 | 54 | 55 | -------------------------------- 56 | Example code snippet 57 | -------------------------------- 58 | 59 | The StreamExecutor API uses abstractions that will be familiar to those who have worked with other GPU APIs: **Streams**, **Timers**, and **Kernels**. Its API is *fluent*, meaning that it allows the user to chain together a sequence of related operations on a stream, as in the following code snippet: 60 | 61 | .. code-block:: c++ 62 | 63 | se::Stream stream(executor); 64 | se::Timer timer(executor); 65 | stream.InitWithTimer(&timer) 66 | .ThenStartTimer(&timer) 67 | .ThenLaunch(se::ThreadDim(dim_block_x, dim_block_y), 68 | se::BlockDim(dim_grid_x, dim_grid_y), 69 | my_kernel, 70 | arg0, arg1, arg2) 71 | .ThenStopTimer(&timer) 72 | .BlockHostUntilDone(); 73 | 74 | The name of the kernel being launched in the snippet above is `my_kernel` and the arguments being passed to the kernel are `arg0`, `arg1`, and `arg2`. Kernels with any number of arguments of any types are supported, and the number and types of the arguments is checked at compile time. 75 | 76 | How does it work? 77 | ======================= 78 | 79 | 80 | -------------------------------- 81 | Detailed example 82 | -------------------------------- 83 | 84 | The following example shows how we can use StreamExecutor to create a `TypedKernel` instance, associate device code with that instance, and then use that instance to schedule work on an accelerator device. 85 | 86 | .. code-block:: c++ 87 | 88 | #include 89 | 90 | #include "stream_executor.h" 91 | 92 | namespace se = streamexecutor; 93 | 94 | // A PTX string defining a CUDA kernel. 95 | // 96 | // This PTX string represents a kernel that takes two arguments: an input value 97 | // and an output pointer. The input value is a floating point number. The output 98 | // value is a pointer to a floating point value in device memory. The output 99 | // pointer is where the output from the kernel will be written. 100 | // 101 | // The kernel adds a fixed floating point value to the input and writes the 102 | // result to the output location. 103 | static constexpr const char *KERNEL_PTX = R"( 104 | .version 3.1 105 | .target sm_20 106 | .address_size 64 107 | .visible .entry add_mystery_value( 108 | .param .f32 float_literal, 109 | .param .u64 result_loc 110 | ) { 111 | .reg .u64 %rl<2>; 112 | .reg .f32 %f<2>; 113 | ld.param.f32 %f1, [float_literal]; 114 | ld.param.u64 %rl1, [result_loc]; 115 | add.f32 %f1, %f1, 123.0; 116 | st.f32 [%rl1], %f1; 117 | ret; 118 | } 119 | )"; 120 | 121 | // The number of arguments expected by the kernel described in 122 | // KERNEL_PTX_TEMPLATE. 123 | static constexpr int KERNEL_ARITY = 2; 124 | 125 | // The name of the kernel described in KERNEL_PTX. 126 | static constexpr const char *KERNEL_NAME = "add_mystery_value"; 127 | 128 | // The value added to the input in the kernel described in KERNEL_PTX. 129 | static constexpr float MYSTERY_VALUE = 123.0f; 130 | 131 | int main(int argc, char *argv[]) { 132 | // Get a CUDA Platform object. (Other platforms such as OpenCL are also 133 | // supported.) 134 | se::Platform *platform = 135 | se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie(); 136 | 137 | // Get a StreamExecutor for the chosen Platform. Multiple devices are 138 | // supported, we indicate here that we want to run on device 0. 139 | const int device_ordinal = 0; 140 | se::StreamExecutor *executor = 141 | platform->ExecutorForDevice(device_ordinal).ValueOrDie(); 142 | 143 | // Create a MultiKernelLoaderSpec, which knows where to find the code for our 144 | // kernel. In this case, the code is stored in memory as a PTX string. 145 | // 146 | // Note that the "arity" and name specified here must match "arity" and name 147 | // of the kernel defined in the PTX string. 148 | se::MultiKernelLoaderSpec kernel_loader_spec(KERNEL_ARITY); 149 | kernel_loader_spec.AddCudaPtxInMemory(KERNEL_PTX, KERNEL_NAME); 150 | 151 | // Next create a kernel handle, which we will associate with our kernel code 152 | // (i.e., the PTX string). The type of this handle is a bit verbose, so we 153 | // create an alias for it. 154 | // 155 | // This specific type represents a kernel that takes two arguments: a floating 156 | // point value and a pointer to a floating point value in device memory. 157 | // 158 | // A type like this is nice to have because it enables static type checking of 159 | // kernel arguments when we enqueue work on a stream. 160 | using KernelType = se::TypedKernel *>; 161 | 162 | // Now instantiate an object of the specific kernel type we declared above. 163 | // The kernel object is not yet connected with the device code that we want it 164 | // to run (that happens with the call to GetKernel below), so it cannot be 165 | // used to execute work on the device yet. 166 | // 167 | // However, the kernel object is not completely empty when it is created. From 168 | // the StreamExecutor passed into its constructor it knows which platform it 169 | // is targeted for, and it also knows which device it will run on. 170 | KernelType kernel(executor); 171 | 172 | // Use the MultiKernelLoaderSpec defined above to load the kernel code onto 173 | // the device pointed to by the kernel object and to make that kernel object a 174 | // handle to the kernel code loaded on that device. 175 | // 176 | // The MultiKernelLoaderSpec may contain code for several different platforms, 177 | // but the kernel object has an associated platform, so there is no confusion 178 | // about which code should be loaded. 179 | // 180 | // After this call the kernel object can be used to launch its kernel on its 181 | // device. 182 | executor->GetKernel(kernel_loader_spec, &kernel); 183 | 184 | // Allocate memory in the device memory space to hold the result of the kernel 185 | // call. This memory will be freed when this object goes out of scope. 186 | se::ScopedDeviceMemory result = executor->AllocateOwnedScalar(); 187 | 188 | // Create a stream on which to schedule device operations. 189 | se::Stream stream(executor); 190 | 191 | // Schedule a kernel launch on the new stream and block until the kernel 192 | // completes. The kernel call executes asynchronously on the device, so we 193 | // could do more work on the host before calling BlockHostUntilDone. 194 | const float kernel_input_argument = 42.5f; 195 | stream.Init() 196 | .ThenLaunch(se::ThreadDim(), se::BlockDim(), kernel, 197 | kernel_input_argument, result.ptr()) 198 | .BlockHostUntilDone(); 199 | 200 | // Copy the result of the kernel call from device back to the host. 201 | float host_result = 0.0f; 202 | executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result), 203 | &host_result); 204 | 205 | // Verify that the correct result was computed. 206 | assert((kernel_input_argument + MYSTERY_VALUE) == host_result); 207 | } 208 | 209 | 210 | -------------------------------- 211 | Kernel Loader Specs 212 | -------------------------------- 213 | 214 | An instance of the class `MultiKernelLoaderSpec` is used to encapsulate knowledge of where the device code for a kernel is stored and what format it is in. Given a `MultiKernelLoaderSpec` and an uninitialized `TypedKernel`, calling the `StreamExecutor::GetKernel` method will load the code onto the device and associate the `TypedKernel` instance with that loaded code. So, in order to initialize a `TypedKernel` instance, it is first necessary to create a `MultiKernelLoaderSpec`. 215 | 216 | A `MultiKernelLoaderSpec` supports a different method for adding device code 217 | for each combination of platform, format, and storage location. The following 218 | table shows some examples: 219 | 220 | =========== ======= =========== ========================= 221 | Platform Format Location Setter 222 | =========== ======= =========== ========================= 223 | CUDA PTX disk `AddCudaPtxOnDisk` 224 | CUDA PTX memory `AddCudaPtxInMemory` 225 | CUDA cubin disk `AddCudaCubinOnDisk` 226 | CUDA cubin memory `AddCudaCubinInMemory` 227 | OpenCL text disk `AddOpenCLTextOnDisk` 228 | OpenCL text memory `AddOpenCLTextInMemory` 229 | OpenCL binary disk `AddOpenCLBinaryOnDisk` 230 | OpenCL binary memory `AddOpenCLBinaryInMemory` 231 | =========== ======= =========== ========================= 232 | 233 | The specific method used in the example is `AddCudaPtxInMemory`, but all other methods are used similarly. 234 | 235 | 236 | ------------------------------------ 237 | Compiler Support for StreamExecutor 238 | ------------------------------------ 239 | 240 | 241 | General strategies 242 | ------------------- 243 | 244 | For illustrative purposes, the PTX code in the example is written by hand and appears as a string literal in the source code file, but it is far more typical for the kernel code to be expressed in a high level language like CUDA C++ or OpenCL C and for the device machine code to be generated by a compiler. 245 | 246 | There are several ways we can load compiled device code using StreamExecutor. 247 | 248 | One possibility is that the build system could write the compiled device code to a file on disk. This can then be added to a `MultiKernelLoaderSpec` by using one of the `OnDisk` setters. 249 | 250 | Another option is to add a feature to the compiler which embeds the compiled device code into the host executable and provides some symbol (probably with a name based on the name of the kernel) that allows the host code to refer to the embedded code data. 251 | 252 | In fact, as discussed below, in the current use of StreamExecutor inside Google, the compiler goes even further and generates an instance of `MultiKernelLoaderSpec` for each kernel. This means the application author doesn't have to know anything about how or where the compiler decided to store the compiled device code, but instead gets a pre-made loader object that handles all those details. 253 | 254 | 255 | Compiler-generated code makes things safe 256 | -------------------------------------------- 257 | 258 | Two of the steps in the example above are dangerous because they lack static safety checks: instantiating the `MultiKernelLoaderSpec` and specializing the `TypedKernel` class template. This section discusses how compiler support for StreamExecutor can make these steps safe. 259 | 260 | Instantiating a `MultiKernelLoaderSpec` requires specifying a three things: 261 | 262 | 1. the kernel *arity* (number of parameters), 263 | 2. the kernel name, 264 | 3. a string containing the device machine code for the kernel (either as assembly, or some sort of object file). 265 | 266 | The problem with this is that the kernel name and the number of parameters is already fully determined by the kernel's machine code. In the best case scenario the *arity* and name arguments passed to the `MultiKernelLoaderSpec` methods match the information in the machine code and are simply redundant, but in the worst case these arguments contradict the information in the machine code and we get a runtime error when we try to load the kernel.. 267 | 268 | The second unsafe operation is specifying the kernel parameter types as type arguments to the `TypedKernel` class template. The specified types must match the types defined in the kernel machine code, but again there is no compile-time checking that these types match. Failure to match these types will result in a runtime error when the kernel is launched. 269 | 270 | We would like the compiler to perform these checks for the application author, so as to eliminate this source of runtime errors. In particular, we want the compiler to create an appropriate `MultiKernelLoaderSpec` instance and `TypedKernel` specialization for each kernel definition. 271 | 272 | One of the main goals of open-sourcing StreamExecutor is to let us add this code generation capability to Clang, when the user has chosen to use StreamExecutor as their runtime for accelerator operations. 273 | 274 | Google has been using an internally developed CUDA compiler based on Clang called **gpucc** that generates code for StreamExecutor in this way. The code below shows how the example above would be written using gpucc to generate the unsafe parts of the code. 275 | 276 | The kernel is defined in a high-level language (CUDA C++ in this example) in its own file: 277 | 278 | .. code-block:: c++ 279 | 280 | // File: add_mystery_value.cu 281 | 282 | __global__ void add_mystery_value(float input, float *output) { 283 | *output = input + 42.0f; 284 | } 285 | 286 | The host code is defined in another file: 287 | 288 | .. code-block:: c++ 289 | 290 | // File: example_host_code.cc 291 | 292 | #include 293 | 294 | #include "stream_executor.h" 295 | 296 | // This header is generated by the gpucc compiler and it contains the 297 | // definitions of gpucc::kernel::AddMysteryValue and 298 | // gpucc::spec::add_mystery_value(). 299 | // 300 | // The name of this header file is derived from the name of the file containing 301 | // the kernel code. The trailing ".cu" is replaced with ".gpu.h". 302 | #include "add_mystery_value.gpu.h" 303 | 304 | namespace se = streamexecutor; 305 | 306 | int main(int argc, char *argv[]) { 307 | se::Platform *platform = 308 | se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie(); 309 | 310 | const int device_ordinal = 0; 311 | se::StreamExecutor *executor = 312 | platform->ExecutorForDevice(device_ordinal).ValueOrDie(); 313 | 314 | // AddMysteryValue is an instance of TypedKernel generated by gpucc. The 315 | // template arguments are chosen by the compiler to match the parameters of 316 | // the add_mystery_value kernel. 317 | gpucc::kernel::AddMysteryValue kernel(executor); 318 | 319 | // gpucc::spec::add_mystery_value() is generated by gpucc. It returns a 320 | // MultiKernelLoaderSpec that knows how to find the compiled code for the 321 | // add_mystery_value kernel. 322 | executor->GetKernel(gpucc::spec::add_mystery_value(), &kernel); 323 | 324 | se::ScopedDeviceMemory result = executor->AllocateOwnedScalar(); 325 | se::Stream stream(executor); 326 | 327 | const float kernel_input_argument = 42.5f; 328 | 329 | stream.Init() 330 | .ThenLaunch(se::ThreadDim(), se::BlockDim(), kernel, 331 | kernel_input_argument, result.ptr()) 332 | .BlockHostUntilDone(); 333 | 334 | float host_result = 0.0f; 335 | executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result), 336 | &host_result); 337 | 338 | assert((kernel_input_argument + 42.0f) == host_result); 339 | } 340 | 341 | This support from the compiler makes the use of StreamExecutor safe and easy. 342 | 343 | 344 | Compiler support for triple angle bracket kernel launches 345 | ---------------------------------------------------------- 346 | 347 | For even greater ease of use, Google's gpucc CUDA compiler also supports an integrated mode that looks like NVIDIA's `CUDA programming model`_,which uses triple angle brackets (`<<<>>>`) to launch kernels. 348 | 349 | .. _CUDA programming model: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels 350 | 351 | .. code-block:: c++ 352 | :emphasize-lines: 22 353 | 354 | #include 355 | 356 | #include "stream_executor.h" 357 | 358 | namespace se = streamexecutor; 359 | 360 | __global__ void add_mystery_value(float input, float *output) { 361 | *output = input + 42.0f; 362 | } 363 | 364 | int main(int argc, char *argv[]) { 365 | se::Platform *platform = 366 | se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie(); 367 | 368 | const int device_ordinal = 0; 369 | se::StreamExecutor *executor = 370 | platform->ExecutorForDevice(device_ordinal).ValueOrDie(); 371 | 372 | se::ScopedDeviceMemory result = executor->AllocateOwnedScalar(); 373 | 374 | const float kernel_input_argument = 42.5f; 375 | add_mystery_value<<<1, 1>>>(kernel_input_argument, *result.ptr()); 376 | 377 | float host_result = 0.0f; 378 | executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result), 379 | &host_result); 380 | 381 | assert((kernel_input_argument + 42.0f) == host_result); 382 | } 383 | 384 | Under the hood, gpucc converts the triple angle bracket kernel call into a series of calls to the StreamExecutor library similar to the calls seen in the previous examples. 385 | 386 | Clang currently supports the triple angle bracket kernel call syntax for CUDA compilation by replacing a triple angle bracket call with calls to the NVIDIA CUDA runtime library, but it would be easy to add a compiler flag to tell Clang to emit calls to the StreamExecutor library instead. There are several benefits to supporting this mode of compilation in Clang: 387 | 388 | .. _benefits: 389 | 390 | * StreamExecutor is a high-level, modern C++ API, so is easier to use and less prone to error than the NVIDIA CUDA runtime and the OpenCL runtime. 391 | * StreamExecutor will be open-source software, so GPU code will not have to depend on opaque binary blobs like the NVIDIA CUDA runtime library. 392 | * Using StreamExecutor as the runtime would allow for easy extension of the triple angle bracket kernel launch syntax to support different accelerator programming models. 393 | 394 | 395 | Supporting other platforms 396 | =========================== 397 | 398 | StreamExecutor currently supports CUDA and OpenCL platforms out-of-the-box, but it uses a platform plugin architecture that makes it easy to add new platforms at any time. The CUDA and OpenCL platforms are both implemented as platform plugins in this way, so they serve as good examples for future platform developers of how to write these kinds of plugins. 399 | 400 | 401 | Canned operations 402 | ================== 403 | 404 | StreamExecutor provides several predefined kernels for common data-parallel operations. The supported classes of operations are: 405 | 406 | * BLAS: basic linear algebra subprograms, 407 | * DNN: deep neural networks, 408 | * FFT: fast Fourier transforms, and 409 | * RNG: random number generation. 410 | 411 | Here is an example of using a canned operation to perform random number generation: 412 | 413 | .. code-block:: c++ 414 | :emphasize-lines: 12-13,17,34-35 415 | 416 | #include 417 | 418 | #include "cuda/cuda_rng.h" 419 | #include "stream_executor.h" 420 | 421 | namespace se = streamexecutor; 422 | 423 | int main(int argc, char *argv[]) { 424 | se::Platform *platform = 425 | se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie(); 426 | 427 | se::PluginConfig plugin_config; 428 | plugin_config.SetRng(se::cuda::kCuRandPlugin); 429 | 430 | const int device_ordinal = 0; 431 | se::StreamExecutor *executor = 432 | platform->ExecutorForDeviceWithPluginConfig(device_ordinal, plugin_config) 433 | .ValueOrDie(); 434 | 435 | const uint8 seed[] = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 436 | 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf}; 437 | constexpr uint64 random_element_count = 1024; 438 | 439 | using HostArray = std::array; 440 | 441 | HostArray host_memory; 442 | const size_t data_size = host_memory.size() * sizeof(HostArray::value_type); 443 | 444 | se::ScopedDeviceMemory device_memory = 445 | executor->AllocateOwnedArray(random_element_count); 446 | 447 | se::Stream stream(executor); 448 | stream.Init() 449 | .ThenSetRngSeed(seed, sizeof(seed)) 450 | .ThenPopulateRandUniform(device_memory.ptr()) 451 | .BlockHostUntilDone(); 452 | 453 | executor->SynchronousMemcpyD2H(*device_memory.ptr(), data_size, 454 | host_memory.data()); 455 | } 456 | 457 | Each platform plugin can define its own canned operation plugins for these operations or choose to leave any of them unimplemented. 458 | 459 | 460 | Contrast with OpenMP 461 | ===================== 462 | 463 | Recent versions of OpenMP also provide a high-level, easy-to-use interface for running data-parallel workloads on an accelerator device. One big difference between OpenMP's approach and that of StreamExecutor is that OpenMP generates both the kernel code that runs on the device and the host-side code needed to launch the kernel, whereas StreamExecutor only generates the host-side code. While the OpenMP model provides the convenience of allowing the author to write their kernel code in standard C/C++, the StreamExecutor model allows for the use of any kernel language (e.g. CUDA C++ or OpenCL C). This lets authors use platform-specific features that are only present in platform-specific kernel definition languages. 464 | 465 | The philosophy of StreamExecutor is that performance is critical on the device, but less so on the host. As a result, no attempt is made to use a high-level device abstraction during device code generation. Instead, the high-level abstraction provided by StreamExecutor is used only for the host-side code that moves data and launches kernels. This host-side work is tedious and is not performance critical, so it benefits from being wrapped in a high-level library that can support a wide range of platforms in an easily extensible manner. 466 | 467 | 468 | Cooperation with OpenMP 469 | ======================== 470 | 471 | The Clang OpenMP community is currently in the process of `designing their implementation`_ of offloading support. They will want the compiler to convert the various standardized target-oriented OpenMP pragmas into device code to execute on an accelerator and host code to load and run that device code. StreamExecutor may provide a convenient API for OpenMP to use to generate their host-side code. 472 | 473 | .. _designing their implementation: https://drive.google.com/a/google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view 474 | 475 | In addition to the benefits_ that all users of StreamExecutor enjoy over the alternative host-side runtime libraries, OpenMP and StreamExecutor may mutually benefit by sharing work to support new platforms. If OpenMP makes use of StreamExecutor, then it should be simple for OpenMP to add support for any new platforms that StreamExecutor supports in the future. Similarly, for any platforms OpenMP would like to target, they may add that support in StreamExecutor and take advantage of the knowledge of platform support in the StreamExecutor community. The resulting new platform support would then be available not just within OpenMP, but also to any user of StreamExecutor. 476 | 477 | Although OpenMP and StreamExecutor support different programming models, some of the work they perform under the hood will likely be very similar. By sharing code and domain expertise, both projects will be improved and strengthened as their capabilities are expanded. The StreamExecutor community looks forward to much collaboration and discussion with OpenMP about the best places and ways to cooperate. 478 | -------------------------------------------------------------------------------- /se_and_openmp.rst: -------------------------------------------------------------------------------- 1 | .. Using backticks indicates inline code. 2 | .. default-role:: code 3 | 4 | ================================ 5 | StreamExecutor and libomptarget 6 | ================================ 7 | 8 | 9 | ------------ 10 | Introduction 11 | ------------ 12 | 13 | **StreamExecutor** and **libomptarget** are libraries that are both meant to 14 | solve the problem of providing runtime support for offloading computational 15 | work to an accelerator device. The libomptarget library is already hosted 16 | within the OpenMP LLVM subproject, and there is currently a proposal to create 17 | another LLVM subproject containing StreamExecutor. To avoid maintaining 18 | duplicate functionality in LLVM, it has further been proposed that 19 | StreamExecutor implement its platform plugins as thin wrappers around 20 | libomptarget RTL instances. This document explains why that proposal does not 21 | work given the current APIs of the two libraries, and talks about cases where 22 | it might make sense. 23 | 24 | Despite the similarities between the two libraries, the libomptarget RTL API 25 | does not support the notion of streams of execution so it cannot be used to 26 | implement general StreamExecutor platforms. 27 | 28 | If the libomptarget RTL interface is extended to support streams in the future, 29 | it may then become feasible to implement StreamExecutor on top of libomptarget, 30 | but even then there would still be a question of whether the amount of 31 | duplicate code saved by having StreamExecutor call into libomptarget would be 32 | enough to balance out the extra code that would be needed in StreamExecutor to 33 | adapt the libomptarget API to work with its own API. 34 | 35 | To take the example of CUDA, both libomptarget and StreamExecutor have code for 36 | very similar wrappers around the CUDA driver API, but in each case this wrapper 37 | code is just meant to adapt the CUDA driver API to the API of the wrapper 38 | library. It would not make sense to have StreamExecutor use libomptarget’s CUDA 39 | wrapper because then StreamExecutor would just have to add code to adapt from 40 | libomptarget’s API rather than CUDA’s driver API. An extra layer of wrapping 41 | would be added and no reduction in code size or complexity would be achieved. 42 | 43 | On the other hand, if there are cases where a runtime library that doesn’t 44 | support streams is exposed only as a libomptarget RTL instance, then it would 45 | make sense for StreamExecutor to wrap the libomptarget implementation in order 46 | to provide support for that platform. For cases like those, the StreamExecutor 47 | implementation might insist that a `nullptr` is always passed for the stream 48 | argument, or StreamExecutor might introduce other methods that don’t require a 49 | stream argument. The StreamExecutor project would be very open to changes like 50 | this. 51 | 52 | For these reasons, it would make more sense at this time for StreamExecutor to 53 | keep its current implementations of the CUDA and OpenCL platforms (which 54 | support streams) rather than attempting to implement those platforms in terms 55 | of libomptarget. 56 | 57 | The sections below describe the similarities and differences between the two 58 | library interfaces in more detail. 59 | 60 | 61 | ---------------------------------------- 62 | Comparison of runtime library interfaces 63 | ---------------------------------------- 64 | 65 | This section describes the parallels between the StreamExecutor platform plugin 66 | interface and the libomptarget RTL interface, and explains the significant 67 | differences that prevent StreamExecutor from implementing its platforms as thin 68 | wrappers around libomptarget RTL targets. 69 | 70 | 71 | Storing handles to device code 72 | ============================== 73 | StreamExecutor's `KernelBase` and libomptarget's `__tgt_offload_entry` are both 74 | types designed to hold references to device code loaded on a device. Both types 75 | store the name of the kernel they point to and a handle that can be used to 76 | refer to the loaded code. Additionally, the `KernelBase` class also stores the 77 | number of arguments expected by the kernel it points to. 78 | 79 | While it would be possible to have `KernelBase` store some of its data as a 80 | `__tgt_offload_entry` internally, it would just add an extra layer of 81 | abstraction and wouldn't simplify any code. 82 | 83 | 84 | Referencing device code blobs 85 | ============================= 86 | StreamExecutor's `MultiKernelLoaderSpec` and libomptarget's 87 | `__tgt_device_image` types are both designed to be wrappers for `void *` 88 | pointers to compiled device code. 89 | 90 | Whereas a `MultiKernelLoaderSpec` instance only manages the code for a single 91 | kernel function, a `__tgt_device_image` instance manages the code for any 92 | number of device functions and global variables. An instance of 93 | `MultiKernelLoaderSpec` can store code for the same kernel in several different 94 | forms. In particular, this allows a `MultiKernelLoaderSpec` to hold several 95 | different PTX versions of the code for different comput capabilities. In 96 | contrast, a single `__tgt_device_image` stores only one binary blob that must 97 | be loaded onto the device as a unit. A `MultiKernelLoaderSpec` can reference a 98 | file name rather than a memory pointer for its device code, whereas a 99 | `__tgt_device_image` is restricted to referencing memory pointers. 100 | 101 | A `MultiKernelLoaderSpec` keeps track of the name of its kernel and the number 102 | of arguments that kernel takes. A `__tgt_device_image` keeps track of the names 103 | of its kernels, but not the number of arguments they take. 104 | 105 | In StreamExecutor terms, a `__tgt_device_image` is like a combination of 106 | several `MultiKernelLoaderSpec` instances which all store their data in the 107 | same format, and a corresponding set of `KernelBase` objects. 108 | 109 | Both `MultiKernelLoaderSpec` and `__tgt_device_image` work best when their 110 | instances are created by the compiler. The compiler can make sure the names of 111 | the kernels and the number of arguments (in the case of 112 | `MultiKernelLoaderSpec`) are set correctly. The compiler can also handle the 113 | creation of the device code and can set up the pointers in the wrapper class to 114 | point to that data. 115 | 116 | The implementation of `__tgt_device_image` is already fully specified, so it 117 | cannot be implemented in terms of `MultiKernelLoaderSpec`. It is conceivable 118 | that `MultiKernelLoaderSpec` could be implemented as a set of 119 | `__tgt_device_image` instances with an additional field to keep track of the 120 | number of kernel arguments, but this wouldn't support the case of kernel code 121 | stored in a file. Even so, it doesn't seem like a good fit because 122 | `__tgt_device_image` is just a handful of pointers and only a few of them would 123 | be used by `MultiKernelLoaderSpec`. 124 | 125 | 126 | Loading device code onto a device 127 | ================================= 128 | StreamExecutor's `GetKernel` method and libomptarget's `__tgt_rtl_load_binary` 129 | method are both used to load device code onto a device. 130 | 131 | `GetKernel` takes a `MultiKernelLoaderSpec` and a `KernelBase` pointer, while 132 | `__tgt_rtl_load_binary` takes an argument of the analogous type, 133 | `__tgt_device_image`. The `GetKernel` method sets up its `KernelBase` argument 134 | to be a proper handle to the loaded code, whereas the `__tgt_rtl_load_binary` 135 | function returns a `tgt_target_table`, which is really just an array of 136 | `__tgt_offload_entry`, so the return value is analogous to an array of 137 | `KernelBase` objects. These two methods are very close analogs. 138 | 139 | It may be possible to implement `GetKernel` in terms of 140 | `__tgt_rtl_load_binary`. 141 | 142 | 143 | Managing device memory 144 | ====================== 145 | StreamExecutor has `void *Allocate(size_t)` for allocating device memory and 146 | `void Deallocate(DeviceMemoryBase *)` for deallocating device memory. The 147 | analogous methods for libomptarget are `void *__tgt_rtl_data_alloc(int32_t 148 | device_id, int64_t size)` and `int32_t __tgt_rtl_data_delete(int32_t device_id, 149 | void *target_ptr)`. These functions are basically identical, and either set 150 | could be implemented in terms of the others. 151 | 152 | For copying data between the host and device, however, the functionality is not 153 | so similar. StreamExecutor has `Memcpy(StreamInterface *, void *, const 154 | DeviceMemoryBase &, size_t)` for copying from the host to the device, and 155 | `Memcpy(StreamInterface *, DeviceMemoryBase &, const void *, size_t)` for 156 | copying from the device to the host. On the other hand, libomptarget has 157 | `int32_t __tgt_rtl_data_submit(int32_t device_id, void *target_ptr, void 158 | *host_ptr, int64_t size)` and `int32_t __tgt_rtl_data_retrieve(int32_t 159 | device_id, void *host_ptr, void *target_ptr, int64_t size)`. 160 | 161 | The single difference is that the StreamExecutor methods take a stream argument 162 | and the libomptarget methods do not. This is an extremely important difference 163 | because asynchronous data movement is a very important aspect of the 164 | StreamExecutor interface and has a very large effect on program performance. 165 | Without support for streams, it doesn't seem possible to implement the 166 | StreamExecutor memory copying functions in terms of their libomptarget 167 | counterparts. 168 | 169 | 170 | Launching kernels on the device 171 | =============================== 172 | StreamExecutor has the method `Launch(StreamInterface *, const ThreadDim &, 173 | const BlockDim &, const KernelBase &, KernelArgsArrayBase)` and libomptarget 174 | has `__tgt_rtl_run_target_team_region` which takes the device ID, a handle for 175 | the device code on the device, an array of pointers to the kernel arguments, 176 | the number of teams, and the number of threads. 177 | 178 | The arguments are basically the same except that the StreamExecutor method 179 | again takes a stream parameter, which allows for overlapping compute and data 180 | motion. Just as in the case of memory copy, this prevents the StreamExecutor 181 | kernel launch function from being implemented in terms of its libomptarget 182 | counterpart. 183 | -------------------------------------------------------------------------------- /se_plugin_interface.rst: -------------------------------------------------------------------------------- 1 | .. default-role:: code 2 | 3 | ======================================================= 4 | StreamExecutor Plugin Interfaces 5 | ======================================================= 6 | This is a sketch of the platform plugin interface in StreamExecutor. A 7 | developer who wants to support a new platform must implement the classes 8 | defined in the **Interface Types** section. 9 | 10 | 11 | Basic Types 12 | =========== 13 | These types are defined here in order to create the vocabulary needed for 14 | defining the interfaces below. 15 | 16 | `Status` 17 | An object that signals whether an operation succeeded, and contains error 18 | information if the operation failed. 19 | 20 | `ThreadDim` 21 | Dimensions of thread collection over which a parallel operation is run. 22 | Really just three integers specifying three dimension sizes. 23 | 24 | `BlockDim` 25 | Same as `ThreadDim` but for blocks of threads. 26 | 27 | `DeviceMemoryBase` 28 | Wrapper type for a raw device memory pointer. Has methods to get the number 29 | of bytes held at the address, and to get the raw pointer itself. 30 | 31 | `KernelBase` 32 | Holds a pointer to a `KernelInterface` (defined below). See `GetKernel` 33 | below for details of how such an object is created. 34 | 35 | `KernelArgsArrayBase` 36 | An object that holds all the arguments to be passed to a kernel. It has a 37 | method to get the number of arguments, and a method to get a pointer to the 38 | array of argument addresses. 39 | 40 | `MultiKernelLoaderSpec` 41 | An object that knows where the compiled device code for a given kernel is 42 | stored, and in which format. Supports device code stored in a file by 43 | storing the name of the file. Supports device code stored in memory by 44 | storing a pointer to the memory. The *Multi* in the name expresses the fact 45 | that it can store different memory pointers and file names for the same 46 | kernel because it might store code for the same kernel in several different 47 | formats or compiled for different platforms (e.g. CUDA and OpenCL). See 48 | `GetKernel` below for details of how an object of this type is used to load 49 | a kernel. 50 | 51 | 52 | Interface Types 53 | ================= 54 | 55 | 56 | ----------------- 57 | StreamInterface 58 | ----------------- 59 | Opaque handle for a single stream corresponding to a specific device. Each 60 | platform-specific implementation will store its own host representation of a 61 | stream. For example, the CUDA implementation stores a `CUstream` as defined in 62 | the CUDA driver API. 63 | 64 | 65 | ----------------- 66 | KernelInterface 67 | ----------------- 68 | Opaque handle to device code for a single kernel loaded on a specific device. 69 | Each platform-specific implementation will store its own host representation of 70 | a kernel. For example, the CUDA implementation stores a `CUmodule` and a 71 | `CUfunction` as defined in the CUDA driver API. 72 | 73 | 74 | ------------------------- 75 | StreamExecutorInterface 76 | ------------------------- 77 | An object that manages a single accelerator device. In all the methods below, 78 | an implementation-specific `StreamExecutorInterface` can dig into the 79 | implementation-specific details of the `StreamInterface` and `KernelInterface` 80 | objects it deals with. So, for instance, when a CUDA `StreamInterface` is asked 81 | to launch a kernel and is passed a `StreamInterface` and a `KernelInterface` it 82 | can reach inside those objects to get the `CUstream`, `CUmodule`, and 83 | `CUfunction` instances they contain. 84 | 85 | Methods 86 | -------- 87 | `int PlatformDeviceCount()` 88 | Gets the number of devices this StreamExecutor can manage. 89 | 90 | `Status Init(int device_ordinal)` 91 | Takes an device ordinal integer and initializes the StreamExecutor to 92 | manage the device with that number. For CUDA this involves creating a 93 | context on the device. 94 | 95 | `Status GetKernel(const MultiKernelLoaderSpec &spec, KernelBase *kernel)` 96 | Loads the device code specified by `spec` onto the device managed by this 97 | StreamExecutor and sets up the kernel object pointed to by `kernel` to be a 98 | handle for the loaded device code. The `MultiKernelLoaderSpec` basically 99 | provides a `void*` pointer to the compiled device code, so the 100 | implementation of this method has to handle the loading of a binary blob 101 | onto the device and storing a handled to that loaded blob in a `KernelBase` 102 | instance. 103 | 104 | `StreamInterface *GetStreamImplementation()` 105 | Returns a new instance of a `StreamInterface` for this executor. 106 | 107 | `void *Allocate(size_t size)` 108 | Allocates the given number of bytes on the device. 109 | 110 | `void Deallocate(DeviceMemoryBase *mem)` 111 | Deallocates the memory on the device at this address. 112 | 113 | `Status Launch(StreamInterface *s, const ThreadDim &t, const BlockDim &b, const KernelBase &k, KernelArgsArrayBase &args)` 114 | Launches the kernel pointed to by `k` on the stream `s` with thread a block 115 | dimensions given by `t` and `b`, respectively, and passing args specified 116 | by `args`. 117 | 118 | `Status BlockHostUntilDone(StreamInterface *s)` 119 | Waits until all activity on the given stream is completed. 120 | 121 | `Status SynchronizeAllActivity()` 122 | Waits until all activity on this device is completed. 123 | 124 | `Status Memcpy(StreamInterface *s, void *host_dst, const DeviceMemoryBase &device_src, size_t size)` 125 | Copies data from device to host. 126 | 127 | `Status Memcpy(StreamInterface *s, DeviceMemoryBase &device_dst, const void *host_src, size_t size)` 128 | Copies data from host to device. 129 | 130 | `Status MemcpyDeviceToDevice(StreamInterface *s, DeviceMemoryBase *device_dst, const DeviceMemoryBase *device_src, size_t size)` 131 | Copies data from device to device. 132 | 133 | `Status HostCallback(StreamInterface *s, std::function callback)` 134 | Executes a host function. 135 | --------------------------------------------------------------------------------