├── .gitignore ├── README.md ├── building-a-brain ├── BuildingABrain.ipynb └── README.md └── even-easier-cuda ├── An_Even_Easier_Introduction_to_CUDA.ipynb └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | **/*.ipynb_checkpoints/ 2 | .ipynb_checkpoints 3 | */.ipynb_checkpoints/* 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NVIDIA DLI Notebooks 2 | 3 | This repository contains notebooks authored by the [NVIDIA DLI](https://nvidia.com/dli) for learning deep learning, data science, and accelerated computing. 4 | 5 | ## Deep Learning Notebooks 6 | 7 | - **[_Building a Brain in 10 Minutes_](https://github.com/NVDLI/notebooks/tree/master/building-a-brain):** This notebook explores the biological and psychological inspirations to the world's first neural networks. 8 | 9 | ## Accelerated Computing Notebooks 10 | 11 | - **[_An Even Easier Introduction to CUDA_](https://github.com/NVDLI/notebooks/tree/master/even-easier-cuda):** This notebook accompanies Mark Harris's popular blog post [_An Even Easier Introduction to CUDA_](https://developer.nvidia.com/blog/even-easier-introduction-cuda/), teaching you the basics of CUDA programming. 12 | 13 | ## NVIDIA DLI Catalog 14 | 15 | If you enjoy these notebooks, we recommend you check out the [DLI's full catalog of courses](nvidia.com/dli) which cover a much broader range of topics, in much greater depth, with dedicated GPU resources and a more sophisticated programming environment. 16 | -------------------------------------------------------------------------------- /building-a-brain/README.md: -------------------------------------------------------------------------------- 1 | # Building a Brain in 10 Minutes 2 | 3 | ![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg) 4 | 5 | This notebook explores the biological and psychological inspirations to the world's first neural networks. It can be run directly in Google Colaboratory. 6 | 7 | ## Learning Objectives 8 | 9 | The goals of this exercise include: 10 | - Exploring how nueral networks use data to learn 11 | - Understanding the math behind a neuron 12 | 13 | ## Prerequisites 14 | 15 | Anyone can run the code to see how it works, but to get the most out of this content, we recommend: 16 | - An understanding of fundamental programming concepts in [Python 3](https://wiki.python.org/moin/BeginnersGuide) such as functions, loops, dictionaries, and arrays. 17 | - An understanding of how to compute a [regression line](http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm). 18 | 19 | ## Followup Materials 20 | 21 | Want to learn how to build more state-of-the-art models? Check up the follow-up to this article, [Getting Started with Deep Learning](https://courses.nvidia.com/courses/course-v1:DLI+S-FX-01+V1/about) or our other online courses at [NVIDIA DLI](https://www.nvidia.com/en-us/training/online/). 22 | -------------------------------------------------------------------------------- /even-easier-cuda/An_Even_Easier_Introduction_to_CUDA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "An Even Easier Introduction to CUDA.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [] 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | }, 17 | "accelerator": "GPU" 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "2WkOA4mcN7Hj" 24 | }, 25 | "source": [ 26 | "# An Even Easier Introduction to CUDA" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": { 32 | "id": "vuOcUi0fvogW" 33 | }, 34 | "source": [ 35 | "This notebook accompanies Mark Harris's popular blog post [_An Even Easier Introduction to CUDA_](https://developer.nvidia.com/blog/even-easier-introduction-cuda/).\n", 36 | "\n", 37 | "If you enjoy this notebook and want to learn more, the [NVIDIA DLI](https://nvidia.com/dli) offers several in depth CUDA Programming courses.\n", 38 | "\n", 39 | "For those of you just starting out, please consider [_Fundamentals of Accelerated Computing with CUDA C/C++_](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-01+V1/about) which provides dedicated GPU resources, a more sophisticated programming environment, use of the [NVIDIA Nsight Systems™](https://developer.nvidia.com/nsight-systems) visual profiler, dozens of interactive exercises, detailed presentations, over 8 hours of material, and the ability to earn a DLI Certificate of Competency.\n", 40 | "\n", 41 | "Similarly, for Python programmers, please consider [_Fundamentals of Accelerated Computing with CUDA Python_](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-02+V1/about).\n", 42 | "\n", 43 | "For more intermediate and advance CUDA programming materials, please check out the _Accelerated Computing_ section of the NVIDIA DLI [self-paced catalog](https://www.nvidia.com/en-us/training/online/)." 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": { 49 | "id": "V1C6GK_MO5er" 50 | }, 51 | "source": [ 52 | "

" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": { 58 | "id": "IcmbR8lZPLRv" 59 | }, 60 | "source": [ 61 | "This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA. I wrote a previous [“Easy Introduction”](https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/) to CUDA in 2013 that has been very popular over the years. But CUDA programming has gotten easier, and GPUs have gotten much faster, so it’s time for an updated (and even easier) introduction.\n", 62 | "\n", 63 | "CUDA C++ is just one of the ways you can create massively parallel applications with CUDA. It lets you use the powerful C++ programming language to develop high performance algorithms accelerated by thousands of parallel threads running on GPUs. Many developers have accelerated their computation- and bandwidth-hungry applications this way, including the libraries and frameworks that underpin the ongoing revolution in artificial intelligence known as [Deep Learning](https://developer.nvidia.com/deep-learning).\n", 64 | "\n", 65 | "So, you’ve heard about CUDA and you are interested in learning how to use it in your own applications. If you are a C or C++ programmer, this blog post should give you a good start. To follow along, you’ll need a computer with an CUDA-capable GPU (Windows, Mac, or Linux, and any NVIDIA GPU should do), or a cloud instance with GPUs (AWS, Azure, IBM SoftLayer, and other cloud service providers have them). You’ll also need the free [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) installed.\n", 66 | "\n", 67 | "Let's get started!" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": { 73 | "id": "vDQ9ycz0Qfyf" 74 | }, 75 | "source": [ 76 | "

" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": { 82 | "id": "wH9Rfms_QtXF" 83 | }, 84 | "source": [ 85 | "## Starting Simple" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": { 91 | "id": "n5-iUihBQvQt" 92 | }, 93 | "source": [ 94 | "We’ll start with a simple C++ program that adds the elements of two arrays with a million elements each." 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "metadata": { 100 | "id": "nc-gBqLDQ7AC" 101 | }, 102 | "source": [ 103 | "%%writefile add.cpp\n", 104 | "\n", 105 | "#include \n", 106 | "#include \n", 107 | "\n", 108 | "// function to add the elements of two arrays\n", 109 | "void add(int n, float *x, float *y)\n", 110 | "{\n", 111 | " for (int i = 0; i < n; i++)\n", 112 | " y[i] = x[i] + y[i];\n", 113 | "}\n", 114 | "\n", 115 | "int main(void)\n", 116 | "{\n", 117 | " int N = 1<<20; // 1M elements\n", 118 | "\n", 119 | " float *x = new float[N];\n", 120 | " float *y = new float[N];\n", 121 | "\n", 122 | " // initialize x and y arrays on the host\n", 123 | " for (int i = 0; i < N; i++) {\n", 124 | " x[i] = 1.0f;\n", 125 | " y[i] = 2.0f;\n", 126 | " }\n", 127 | "\n", 128 | " // Run kernel on 1M elements on the CPU\n", 129 | " add(N, x, y);\n", 130 | "\n", 131 | " // Check for errors (all values should be 3.0f)\n", 132 | " float maxError = 0.0f;\n", 133 | " for (int i = 0; i < N; i++)\n", 134 | " maxError = fmax(maxError, fabs(y[i]-3.0f));\n", 135 | " std::cout << \"Max error: \" << maxError << std::endl;\n", 136 | "\n", 137 | " // Free memory\n", 138 | " delete [] x;\n", 139 | " delete [] y;\n", 140 | "\n", 141 | " return 0;\n", 142 | "}" 143 | ], 144 | "execution_count": null, 145 | "outputs": [] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": { 150 | "id": "gw6DsX4uRHMg" 151 | }, 152 | "source": [ 153 | "Executing the above cell will save its contents to the file add.cpp.\n", 154 | "\n", 155 | "The following cell will compile and run this C++ program." 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "metadata": { 161 | "id": "gNpH54M_RbAU" 162 | }, 163 | "source": [ 164 | "%%shell\n", 165 | "g++ add.cpp -o add" 166 | ], 167 | "execution_count": null, 168 | "outputs": [] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": { 173 | "id": "I6V2tGPYRi3l" 174 | }, 175 | "source": [ 176 | "Then run it:" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "metadata": { 182 | "id": "QmA4ACe5RuiU" 183 | }, 184 | "source": [ 185 | "%%shell\n", 186 | "./add" 187 | ], 188 | "execution_count": null, 189 | "outputs": [] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": { 194 | "id": "9IAWYlniR153" 195 | }, 196 | "source": [ 197 | "As expected, it prints that there was no error in the summation and then exits. Now I want to get this computation running (in parallel) on the many cores of a GPU. It’s actually pretty easy to take the first steps.\n", 198 | "\n", 199 | "First, I just have to turn our `add` function into a function that the GPU can run, called a *kernel* in CUDA. To do this, all I have to do is add the specifier `__global__` to the function, which tells the CUDA C++ compiler that this is a function that runs on the GPU and can be called from CPU code." 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": { 205 | "id": "heY-lpzjSHfB" 206 | }, 207 | "source": [ 208 | "```cpp\n", 209 | "// CUDA Kernel function to add the elements of two arrays on the GPU\n", 210 | "__global__\n", 211 | "void add(int n, float *x, float *y)\n", 212 | "{\n", 213 | " for (int i = 0; i < n; i++)\n", 214 | " y[i] = x[i] + y[i];\n", 215 | "}\n", 216 | "```" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": { 222 | "id": "kozMbHdpSKNu" 223 | }, 224 | "source": [ 225 | "These `__global__` functions are known as *kernels*, and code that runs on the GPU is often called *device code*, while code that runs on the CPU is *host code*." 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": { 231 | "id": "VhnBGGU-SWiN" 232 | }, 233 | "source": [ 234 | "## Memory Allocation in CUDA" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": { 240 | "id": "RvIDRBk2SbqA" 241 | }, 242 | "source": [ 243 | "To compute on the GPU, I need to allocate memory accessible by the GPU. [Unified Memory](https://developer.nvidia.com/blog/unified-memory-in-cuda-6/) in CUDA makes this easy by providing a single memory space accessible by all GPUs and CPUs in your system. To allocate data in unified memory, call `cudaMallocManaged()`, which returns a pointer that you can access from host (CPU) code or device (GPU) code. To free the data, just pass the pointer to `cudaFree()`.\n", 244 | "\n", 245 | "I just need to replace the calls to `new` in the code above with calls to `cudaMallocManaged()`, and replace calls to `delete []` with calls to `cudaFree`." 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": { 251 | "id": "IxCut_urS46H" 252 | }, 253 | "source": [ 254 | "```cpp\n", 255 | " // Allocate Unified Memory -- accessible from CPU or GPU\n", 256 | " float *x, *y;\n", 257 | " cudaMallocManaged(&x, N*sizeof(float));\n", 258 | " cudaMallocManaged(&y, N*sizeof(float));\n", 259 | "\n", 260 | " ...\n", 261 | "\n", 262 | " // Free memory\n", 263 | " cudaFree(x);\n", 264 | " cudaFree(y);\n", 265 | "```" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": { 271 | "id": "2oEf2B-1S-1V" 272 | }, 273 | "source": [ 274 | "Finally, I need to *launch* the `add()` kernel, which invokes it on the GPU. CUDA kernel launches are specified using the triple angle bracket syntax `<<< >>>`. I just have to add it to the call to `add` before the parameter list." 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": { 280 | "id": "bqTJlvWLS7iW" 281 | }, 282 | "source": [ 283 | "```cpp\n", 284 | "add<<<1, 1>>>(N, x, y);\n", 285 | "```" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": { 291 | "id": "RGf0ZiTOTTHU" 292 | }, 293 | "source": [ 294 | "Easy! I’ll get into the details of what goes inside the angle brackets soon; for now all you need to know is that this line launches one GPU thread to run `add()`.\n", 295 | "\n", 296 | "Just one more thing: I need the CPU to wait until the kernel is done before it accesses the results (because CUDA kernel launches don’t block the calling CPU thread). To do this I just call `cudaDeviceSynchronize()` before doing the final error checking on the CPU.\n", 297 | "\n", 298 | "Here’s the complete code:" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "metadata": { 304 | "id": "K8bYDM7kYT7S" 305 | }, 306 | "source": [ 307 | "%%writefile add.cu\n", 308 | "\n", 309 | "#include \n", 310 | "#include \n", 311 | "// Kernel function to add the elements of two arrays\n", 312 | "__global__\n", 313 | "void add(int n, float *x, float *y)\n", 314 | "{\n", 315 | " for (int i = 0; i < n; i++)\n", 316 | " y[i] = x[i] + y[i];\n", 317 | "}\n", 318 | "\n", 319 | "int main(void)\n", 320 | "{\n", 321 | " int N = 1<<20\n", 322 | " ;\n", 323 | " float *x, *y;\n", 324 | "\n", 325 | " // Allocate Unified Memory – accessible from CPU or GPU\n", 326 | " cudaMallocManaged(&x, N*sizeof(float));\n", 327 | " cudaMallocManaged(&y, N*sizeof(float));\n", 328 | "\n", 329 | " // initialize x and y arrays on the host\n", 330 | " for (int i = 0; i < N; i++) {\n", 331 | " x[i] = 1.0f;\n", 332 | " y[i] = 2.0f;\n", 333 | " }\n", 334 | "\n", 335 | " // Run kernel on 1M elements on the GPU\n", 336 | " add<<<1, 1>>>(N, x, y);\n", 337 | "\n", 338 | " // Wait for GPU to finish before accessing on host\n", 339 | " cudaDeviceSynchronize();\n", 340 | "\n", 341 | " // Check for errors (all values should be 3.0f)\n", 342 | " float maxError = 0.0f;\n", 343 | " for (int i = 0; i < N; i++)\n", 344 | " maxError = fmax(maxError, fabs(y[i]-3.0f));\n", 345 | " std::cout << \"Max error: \" << maxError << std::endl;\n", 346 | "\n", 347 | " // Free memory\n", 348 | " cudaFree(x);\n", 349 | " cudaFree(y);\n", 350 | " \n", 351 | " return 0;\n", 352 | "}" 353 | ], 354 | "execution_count": null, 355 | "outputs": [] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "metadata": { 360 | "id": "TjLGGp0oYeEc" 361 | }, 362 | "source": [ 363 | "%%shell\n", 364 | "\n", 365 | "nvcc add.cu -o add_cuda\n", 366 | "./add_cuda" 367 | ], 368 | "execution_count": null, 369 | "outputs": [] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": { 374 | "id": "6ATssEzEYqGx" 375 | }, 376 | "source": [ 377 | "This is only a first step, because as written, this kernel is only correct for a single thread, since every thread that runs it will perform the add on the whole array. Moreover, there is a [race condition](https://en.wikipedia.org/wiki/Race_condition) since multiple parallel threads would both read and write the same locations." 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": { 383 | "id": "3kKpDoZ-YzJ8" 384 | }, 385 | "source": [ 386 | "## Profile it!" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": { 392 | "id": "r-BC-CWVZglt" 393 | }, 394 | "source": [ 395 | "I think the simplest way to find out how long the kernel takes to run is to run it with `nvprof`, the command line GPU profiler that comes with the CUDA Toolkit. Just type `nvprof ./add_cuda` on the command line:" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "metadata": { 401 | "id": "gtfQLWwYZpfV" 402 | }, 403 | "source": [ 404 | "%%shell\n", 405 | "\n", 406 | "nvprof ./add_cuda" 407 | ], 408 | "execution_count": null, 409 | "outputs": [] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": { 414 | "id": "F9Dn4ZV-Z_UJ" 415 | }, 416 | "source": [ 417 | "The above will show the single call to `add`. Your timing may vary depending on the GPU allocated to you by Colab. To see the current GPU allocated to you run the following cell and look in the `Name` column where you might see, for example `Tesla T4`:" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "metadata": { 423 | "id": "TrYmwVZfaPqz" 424 | }, 425 | "source": [ 426 | "%%shell\n", 427 | "\n", 428 | "nvidia-smi" 429 | ], 430 | "execution_count": null, 431 | "outputs": [] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": { 436 | "id": "-MWYteAVadCs" 437 | }, 438 | "source": [ 439 | "Let's make it faster with parallelism." 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": { 445 | "id": "SaiMC73Falvb" 446 | }, 447 | "source": [ 448 | "## Picking up the Threads" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": { 454 | "id": "KDFuBr_2apuJ" 455 | }, 456 | "source": [ 457 | "Now that you’ve run a kernel with one thread that does some computation, how do you make it parallel? The key is in CUDA’s `<<<1, 1>>>` syntax. This is called the execution configuration, and it tells the CUDA runtime how many parallel threads to use for the launch on the GPU. There are two parameters here, but let’s start by changing the second one: the number of threads in a thread block. CUDA GPUs run kernels using blocks of threads that are a multiple of 32 in size, so 256 threads is a reasonable size to choose." 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": { 463 | "id": "a2Pmyj0KavgB" 464 | }, 465 | "source": [ 466 | "```cpp\n", 467 | "add<<<1, 256>>>(N, x, y);\n", 468 | "```" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": { 474 | "id": "oAYpH9Ctay5G" 475 | }, 476 | "source": [ 477 | "If I run the code with only this change, it will do the computation once per thread, rather than spreading the computation across the parallel threads. To do it properly, I need to modify the kernel. CUDA C++ provides keywords that let kernels get the indices of the running threads. Specifically, `threadIdx.x` contains the index of the current thread within its block, and `blockDim.x` contains the number of threads in the block. I’ll just modify the loop to stride through the array with parallel threads." 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": { 483 | "id": "TSiqhFK_a6N3" 484 | }, 485 | "source": [ 486 | "```cpp\n", 487 | "__global__\n", 488 | "void add(int n, float *x, float *y)\n", 489 | "{\n", 490 | " int index = threadIdx.x;\n", 491 | " int stride = blockDim.x;\n", 492 | " for (int i = index; i < n; i += stride)\n", 493 | " y[i] = x[i] + y[i];\n", 494 | "}\n", 495 | "```" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": { 501 | "id": "_7mYcBzOa9zR" 502 | }, 503 | "source": [ 504 | "The `add` function hasn’t changed that much. In fact, setting `index` to 0 and `stride` to 1 makes it semantically identical to the first version.\n", 505 | "\n", 506 | "Here we save the file as add_block.cu and compile and run it in `nvprof` again." 507 | ] 508 | }, 509 | { 510 | "cell_type": "code", 511 | "metadata": { 512 | "id": "goCKY9QNbPZ-" 513 | }, 514 | "source": [ 515 | "%%writefile add_block.cu\n", 516 | "\n", 517 | "#include \n", 518 | "#include \n", 519 | "\n", 520 | "// Kernel function to add the elements of two arrays\n", 521 | "__global__\n", 522 | "void add(int n, float *x, float *y)\n", 523 | "{\n", 524 | " int index = threadIdx.x;\n", 525 | " int stride = blockDim.x;\n", 526 | " for (int i = index; i < n; i += stride)\n", 527 | " y[i] = x[i] + y[i];\n", 528 | "}\n", 529 | "\n", 530 | "int main(void)\n", 531 | "{\n", 532 | " int N = 1<<20;\n", 533 | " float *x, *y;\n", 534 | "\n", 535 | " // Allocate Unified Memory – accessible from CPU or GPU\n", 536 | " cudaMallocManaged(&x, N*sizeof(float));\n", 537 | " cudaMallocManaged(&y, N*sizeof(float));\n", 538 | "\n", 539 | " // initialize x and y arrays on the host\n", 540 | " for (int i = 0; i < N; i++) {\n", 541 | " x[i] = 1.0f;\n", 542 | " y[i] = 2.0f;\n", 543 | " }\n", 544 | "\n", 545 | " // Run kernel on 1M elements on the GPU\n", 546 | " add<<<1, 256>>>(N, x, y);\n", 547 | "\n", 548 | " // Wait for GPU to finish before accessing on host\n", 549 | " cudaDeviceSynchronize();\n", 550 | "\n", 551 | " // Check for errors (all values should be 3.0f)\n", 552 | " float maxError = 0.0f;\n", 553 | " for (int i = 0; i < N; i++)\n", 554 | " maxError = fmax(maxError, fabs(y[i]-3.0f));\n", 555 | " std::cout << \"Max error: \" << maxError << std::endl;\n", 556 | "\n", 557 | " // Free memory\n", 558 | " cudaFree(x);\n", 559 | " cudaFree(y);\n", 560 | " \n", 561 | " return 0;\n", 562 | "}" 563 | ], 564 | "execution_count": null, 565 | "outputs": [] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "metadata": { 570 | "id": "l9cmfbcVbYgD" 571 | }, 572 | "source": [ 573 | "%%shell\n", 574 | "\n", 575 | "nvcc add_block.cu -o add_block\n", 576 | "nvprof ./add_block" 577 | ], 578 | "execution_count": null, 579 | "outputs": [] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": { 584 | "id": "Fo5KaV3Nba7g" 585 | }, 586 | "source": [ 587 | "That’s a big speedup (compare the time for the `add` kernel by looking at the `GPU activities` field), but not surprising since I went from 1 thread to 256 threads. Let’s keep going to get even more performance." 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": { 593 | "id": "YtgQWOyMcPfn" 594 | }, 595 | "source": [ 596 | "## Out of the Blocks" 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "metadata": { 602 | "id": "wAoFGwmbcRbN" 603 | }, 604 | "source": [ 605 | "CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors, or SMs. Each SM can run multiple concurrent thread blocks. As an example, a Tesla P100 GPU based on the [Pascal GPU Architecture](https://developer.nvidia.com/blog/inside-pascal/) has 56 SMs, each capable of supporting up to 2048 active threads. To take full advantage of all these threads, I should launch the kernel with multiple thread blocks.\n", 606 | "\n", 607 | "By now you may have guessed that the first parameter of the execution configuration specifies the number of thread blocks. Together, the blocks of parallel threads make up what is known as the *grid*. Since I have `N` elements to process, and 256 threads per block, I just need to calculate the number of blocks to get at least `N` threads. I simply divide `N` by the block size (being careful to round up in case `N` is not a multiple of `blockSize`)." 608 | ] 609 | }, 610 | { 611 | "cell_type": "markdown", 612 | "metadata": { 613 | "id": "AnI2II2ockgC" 614 | }, 615 | "source": [ 616 | "```cpp\n", 617 | "int blockSize = 256;\n", 618 | "int numBlocks = (N + blockSize - 1) / blockSize;\n", 619 | "add<<>>(N, x, y);\n", 620 | "```" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": { 626 | "id": "ayq2MJZLctY0" 627 | }, 628 | "source": [ 629 | "

" 630 | ] 631 | }, 632 | { 633 | "cell_type": "markdown", 634 | "metadata": { 635 | "id": "fZduP7RWc3Je" 636 | }, 637 | "source": [ 638 | "I also need to update the kernel code to take into account the entire grid of thread blocks. CUDA provides `gridDim.x`, which contains the number of blocks in the grid, and `blockIdx.x`, which contains the index of the current thread block in the grid. Figure 1 illustrates the the approach to indexing into an array (one-dimensional) in CUDA using `blockDim.x`, `gridDim.x`, and `threadIdx.x`. The idea is that each thread gets its index by computing the offset to the beginning of its block (the block index times the block size: `blockIdx.x * blockDim.x`) and adding the thread’s index within the block (`threadIdx.x`). The code `blockIdx.x * blockDim.x + threadIdx.x` is idiomatic CUDA." 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": { 644 | "id": "6cI2WLEAeG5y" 645 | }, 646 | "source": [ 647 | "```cpp\n", 648 | "__global__\n", 649 | "void add(int n, float *x, float *y)\n", 650 | "{\n", 651 | " int index = blockIdx.x * blockDim.x + threadIdx.x;\n", 652 | " int stride = blockDim.x * gridDim.x;\n", 653 | " for (int i = index; i < n; i += stride)\n", 654 | " y[i] = x[i] + y[i];\n", 655 | "}\n", 656 | "```" 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": { 662 | "id": "83hC-rCLdPHC" 663 | }, 664 | "source": [ 665 | "The updated kernel also sets stride to the total number of threads in the grid (`blockDim.x * gridDim.x`). This type of loop in a CUDA kernel is often called a [*grid-stride*](https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/) loop.\n", 666 | "\n", 667 | "Save the file as `add_grid.cu` and compile and run it in `nvprof` again." 668 | ] 669 | }, 670 | { 671 | "cell_type": "code", 672 | "metadata": { 673 | "id": "a7w-DHBRdhUC" 674 | }, 675 | "source": [ 676 | "%%writefile add_grid.cu\n", 677 | "\n", 678 | "#include \n", 679 | "#include \n", 680 | "\n", 681 | "// Kernel function to add the elements of two arrays\n", 682 | "__global__\n", 683 | "void add(int n, float *x, float *y)\n", 684 | "{\n", 685 | " int index = blockIdx.x * blockDim.x + threadIdx.x;\n", 686 | " int stride = blockDim.x * gridDim.x;\n", 687 | " for (int i = index; i < n; i += stride)\n", 688 | " y[i] = x[i] + y[i];\n", 689 | "}\n", 690 | "\n", 691 | "int main(void)\n", 692 | "{\n", 693 | " int N = 1<<20;\n", 694 | " float *x, *y;\n", 695 | "\n", 696 | " // Allocate Unified Memory – accessible from CPU or GPU\n", 697 | " cudaMallocManaged(&x, N*sizeof(float));\n", 698 | " cudaMallocManaged(&y, N*sizeof(float));\n", 699 | "\n", 700 | " // initialize x and y arrays on the host\n", 701 | " for (int i = 0; i < N; i++) {\n", 702 | " x[i] = 1.0f;\n", 703 | " y[i] = 2.0f;\n", 704 | " }\n", 705 | "\n", 706 | " // Run kernel on 1M elements on the GPU\n", 707 | " int blockSize = 256;\n", 708 | " int numBlocks = (N + blockSize - 1) / blockSize;\n", 709 | " add<<>>(N, x, y);\n", 710 | "\n", 711 | " // Wait for GPU to finish before accessing on host\n", 712 | " cudaDeviceSynchronize();\n", 713 | "\n", 714 | " // Check for errors (all values should be 3.0f)\n", 715 | " float maxError = 0.0f;\n", 716 | " for (int i = 0; i < N; i++)\n", 717 | " maxError = fmax(maxError, fabs(y[i]-3.0f));\n", 718 | " std::cout << \"Max error: \" << maxError << std::endl;\n", 719 | "\n", 720 | " // Free memory\n", 721 | " cudaFree(x);\n", 722 | " cudaFree(y);\n", 723 | " \n", 724 | " return 0;\n", 725 | "}" 726 | ], 727 | "execution_count": null, 728 | "outputs": [] 729 | }, 730 | { 731 | "cell_type": "code", 732 | "metadata": { 733 | "id": "FhcrktW9dw34" 734 | }, 735 | "source": [ 736 | "%%shell\n", 737 | "\n", 738 | "nvcc add_grid.cu -o add_grid\n", 739 | "nvprof ./add_grid" 740 | ], 741 | "execution_count": null, 742 | "outputs": [] 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": { 747 | "id": "Y7Tz-xo3d1oX" 748 | }, 749 | "source": [ 750 | "That's another big speedup from running multiple blocks! (Note your results may vary from the blog post due to whatever GPU you've been allocated by Colab. If you notice your speedups for the final example are not as drastic as those in the blog post, check out #4 in the *Exercises* section below.)" 751 | ] 752 | }, 753 | { 754 | "cell_type": "markdown", 755 | "metadata": { 756 | "id": "Ja5CiQZpicHC" 757 | }, 758 | "source": [ 759 | "## Exercises" 760 | ] 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "metadata": { 765 | "id": "BEijwk25id3t" 766 | }, 767 | "source": [ 768 | "To keep you going, here are a few things to try on your own.\n", 769 | "\n", 770 | "1. Browse the [CUDA Toolkit documentation](https://docs.nvidia.com/cuda/index.html). If you haven’t installed CUDA yet, check out the [Quick Start Guide](https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html) and the installation guides. Then browse the [Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) and the [Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html). There are also tuning guides for various architectures.\n", 771 | "2. Experiment with `printf()` inside the kernel. Try printing out the values of `threadIdx.x` and `blockIdx.x` for some or all of the threads. Do they print in sequential order? Why or why not?\n", 772 | "3. Print the value of `threadIdx.y` or `threadIdx.z` (or `blockIdx.y`) in the kernel. (Likewise for `blockDim` and `gridDim`). Why do these exist? How do you get them to take on values other than 0 (1 for the dims)?\n", 773 | "4. If you have access to a [Pascal-based GPU](https://developer.nvidia.com/blog/inside-pascal/), try running `add_grid.cu` on it. Is performance better or worse than the K80 results? Why? (Hint: read about [Pascal’s Page Migration Engine and the CUDA 8 Unified Memory API](https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/).) For a detailed answer to this question, see the post [Unified Memory for CUDA Beginners](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/)." 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "metadata": { 779 | "id": "NpWVnIPujp0K" 780 | }, 781 | "source": [ 782 | "## Where to From Here" 783 | ] 784 | }, 785 | { 786 | "cell_type": "markdown", 787 | "metadata": { 788 | "id": "nTyQePjlkRJ3" 789 | }, 790 | "source": [ 791 | "If you enjoyed this notebook and want to learn more, the [NVIDIA DLI](https://nvidia.com/dli) offers several in depth CUDA Programming courses.\n", 792 | "\n", 793 | "For those of you just starting out, please consider [_Fundamentals of Accelerated Computing with CUDA C/C++_](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-01+V1/about) which provides dedicated GPU resources, a more sophisticated programming environment, use of the [NVIDIA Nsight Systems™](https://developer.nvidia.com/nsight-systems) visual profiler, dozens of interactive exercises, detailed presentations, over 8 hours of material, and the ability to earn a DLI Certificate of Competency.\n", 794 | "\n", 795 | "Similarly, for Python programmers, please consider [_Fundamentals of Accelerated Computing with CUDA Python_](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-02+V1/about).\n", 796 | "\n", 797 | "For more intermediate and advance CUDA programming materials, please check out the _Accelerated Computing_ section of the NVIDIA DLI [self-paced catalog](https://www.nvidia.com/en-us/training/online/)." 798 | ] 799 | } 800 | ] 801 | } 802 | -------------------------------------------------------------------------------- /even-easier-cuda/README.md: -------------------------------------------------------------------------------- 1 | # An Even Easier Introduction to CUDA - Accompanying Notebook 2 | 3 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVDLI/notebooks/blob/master/even-easier-cuda/An_Even_Easier_Introduction_to_CUDA.ipynb) 4 | 5 | This notebook accompanies Mark Harris's popular blog post [_An Even Easier Introduction to CUDA_](https://developer.nvidia.com/blog/even-easier-introduction-cuda/) and can be run directly in [Google Colaboratory](https://colab.research.google.com/github/NVDLI/notebooks/blob/master/even-easier-cuda/An_Even_Easier_Introduction_to_CUDA.ipynb). 6 | 7 | ## Learning Objectives 8 | 9 | In this notebook you will learn the basics of writing massively parallel CUDA kernels to run on NVIDIA GPUs. By the time you complete it you will be able to: 10 | 11 | - Launch massively parallel CUDA Kernels on an NVIDIA GPU 12 | - Organize parallel thread execution for massive dataset sizes 13 | - Manage memory between the CPU and GPU 14 | - Profile your CUDA code to observe performance gains 15 | 16 | ## Prerequisites 17 | 18 | This notebook does not require you to write novel code, but to best understand its details, you should already have familiarity with: 19 | 20 | - Writing, compiling and running C or C++ code 21 | 22 | ## Followup Materials 23 | 24 | If you enjoyed this notebook and want to learn more, the [NVIDIA DLI](https://nvidia.com/dli) offers several in depth CUDA Programming courses. 25 | 26 | For those of you just starting out, please consider [_Fundamentals of Accelerated Computing with CUDA C/C++_](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-01+V1/about) which provides dedicated GPU resources, a more sophisticated programming environment, use of the [NVIDIA Nsight Systems™](https://developer.nvidia.com/nsight-systems) visual profiler, dozens of interactive exercises, detailed presentations, over 8 hours of material, and the ability to earn a DLI Certificate of Competency. 27 | 28 | Similarly, for Python programmers, please consider [_Fundamentals of Accelerated Computing with CUDA Python_](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-02+V1/about). 29 | 30 | For more intermediate and advance CUDA programming materials, please check out the _Accelerated Computing_ section of the NVIDIA DLI [self-paced catalog](https://www.nvidia.com/en-us/training/online/). 31 | --------------------------------------------------------------------------------