├── .gitignore
├── 1-1-cuda_libraries.ipynb
├── 1-2-programming_models.ipynb
├── 1-3-memory_management.ipynb
├── 1-4-concurrent_computing.ipynb
├── 1-5-preparation.txt
├── 2-1-application_analysis_optimization.ipynb
├── 2-2-kernel_analysis_optimization.ipynb
├── LICENSE
├── Manifest.toml
├── Project.toml
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | *.nsys-rep
3 | 


--------------------------------------------------------------------------------
/1-2-programming_models.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "code",
   5 |    "execution_count": 1,
   6 |    "id": "8435688d",
   7 |    "metadata": {},
   8 |    "outputs": [
   9 |     {
  10 |      "name": "stdout",
  11 |      "output_type": "stream",
  12 |      "text": [
  13 |       "\u001b[32m\u001b[1m  Activating\u001b[22m\u001b[39m project at `~/Julia/doc/cscs_gpu_course`\n"
  14 |      ]
  15 |     },
  16 |     {
  17 |      "name": "stderr",
  18 |      "output_type": "stream",
  19 |      "text": [
  20 |       "┌ Warning: The active manifest file is an older format with no julia version entry. Dependencies may have been resolved with a different julia version.\n",
  21 |       "└ @ nothing /home/tim/Julia/doc/cscs_gpu_course/Manifest.toml:0\n"
  22 |      ]
  23 |     }
  24 |    ],
  25 |    "source": [
  26 |     "using Pkg\n",
  27 |     "Pkg.DEFAULT_IO[] = stdout  # Julia 1.6.1 bug (Pkg.jl#2542)\n",
  28 |     "Pkg.activate(@__DIR__)\n",
  29 |     "Pkg.instantiate()"
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "markdown",
  34 |    "id": "42a459ca",
  35 |    "metadata": {},
  36 |    "source": [
  37 |     "# Programming models\n",
  38 |     "\n",
  39 |     "There are different ways of programming (NVIDIA) GPUs in Julia, at different levels of abstraction."
  40 |    ]
  41 |   },
  42 |   {
  43 |    "cell_type": "markdown",
  44 |    "id": "81e37034",
  45 |    "metadata": {},
  46 |    "source": [
  47 |     "## Array programming\n",
  48 |     "\n",
  49 |     "The easiest way to use a GPU is via vectorized array operations. Each of these operations will be backed by one or more GPU kernels, either natively written in Julia or from some application library. As long as your data is large enough you'll should be able to get some nice speed-ups.\n",
  50 |     "\n",
  51 |     "For NVIDIA GPUs, you use the `CuArray` type from CUDA.jl, which serves a dual porpose:\n",
  52 |     "- a managed container for GPU memory\n",
  53 |     "- a way to dispatch to operations that execute on the GPU"
  54 |    ]
  55 |   },
  56 |   {
  57 |    "cell_type": "code",
  58 |    "execution_count": 2,
  59 |    "id": "eae46b3b",
  60 |    "metadata": {},
  61 |    "outputs": [
  62 |     {
  63 |      "data": {
  64 |       "text/plain": [
  65 |        "2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n",
  66 |        " 1.0  2.0\n",
  67 |        " 3.0  4.0"
  68 |       ]
  69 |      },
  70 |      "execution_count": 2,
  71 |      "metadata": {},
  72 |      "output_type": "execute_result"
  73 |     }
  74 |    ],
  75 |    "source": [
  76 |     "using CUDA\n",
  77 |     "A = CuArray([1. 2.; 3. 4.])"
  78 |    ]
  79 |   },
  80 |   {
  81 |    "cell_type": "markdown",
  82 |    "id": "36e6cdef",
  83 |    "metadata": {},
  84 |    "source": [
  85 |     "Memory management will be discussed in detail in a next notebook, but for now it's enough to remember that a CuArray is **a CPU object representing memory on the GPU**. It will be automatically freed when all references have been removed, and the garbage collector runs."
  86 |    ]
  87 |   },
  88 |   {
  89 |    "cell_type": "markdown",
  90 |    "id": "3739475e",
  91 |    "metadata": {},
  92 |    "source": [
  93 |     "The goal of `CuArray` is to make it easy to program GPUs using array operations:"
  94 |    ]
  95 |   },
  96 |   {
  97 |    "cell_type": "code",
  98 |    "execution_count": 3,
  99 |    "id": "c1fd7735",
 100 |    "metadata": {},
 101 |    "outputs": [
 102 |     {
 103 |      "data": {
 104 |       "text/plain": [
 105 |        "2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n",
 106 |        "  7.0  10.0\n",
 107 |        " 15.0  22.0"
 108 |       ]
 109 |      },
 110 |      "execution_count": 3,
 111 |      "metadata": {},
 112 |      "output_type": "execute_result"
 113 |     }
 114 |    ],
 115 |    "source": [
 116 |     "# this will automatically use CUBLAS\n",
 117 |     "A * A"
 118 |    ]
 119 |   },
 120 |   {
 121 |    "cell_type": "code",
 122 |    "execution_count": 4,
 123 |    "id": "9f8928fc",
 124 |    "metadata": {},
 125 |    "outputs": [
 126 |     {
 127 |      "data": {
 128 |       "text/plain": [
 129 |        "2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n",
 130 |        " 1.0   4.0\n",
 131 |        " 9.0  16.0"
 132 |       ]
 133 |      },
 134 |      "execution_count": 4,
 135 |      "metadata": {},
 136 |      "output_type": "execute_result"
 137 |     }
 138 |    ],
 139 |    "source": [
 140 |     "# whereas this operation will use a native broadcast kernel\n",
 141 |     "A .* A"
 142 |    ]
 143 |   },
 144 |   {
 145 |    "cell_type": "markdown",
 146 |    "id": "319f0322",
 147 |    "metadata": {},
 148 |    "source": [
 149 |     "This works by specializing certain methods with a GPU-specialized implementation, either for:\n",
 150 |     "- compatibility: not all CPU implementations work on the GPU\n",
 151 |     "- performance: GPUs have a different programming model so might require optimized implementations\n",
 152 |     "\n",
 153 |     "This generally works pretty well, the goal is to get as close to the CPU `Array` type's functionality as possible, and entire applications have been built on top of CuArray's array functionality. So instead let's highlight what can go wrong if you don't call into a GPU-specialized implementation where you need one."
 154 |    ]
 155 |   },
 156 |   {
 157 |    "cell_type": "markdown",
 158 |    "id": "4683bfb4",
 159 |    "metadata": {},
 160 |    "source": [
 161 |     "### Compatibility: Calling into C libraries\n",
 162 |     "\n",
 163 |     "A common issue arises when calling CPU-specific code, e.g. in some C library, using a GPU array. This generally does not work, because GPU pointers are not dereferencable on the CPU. To prevent this from crashing, we introduce a GPU-specific pointer type and disallow conversions:"
 164 |    ]
 165 |   },
 166 |   {
 167 |    "cell_type": "code",
 168 |    "execution_count": 5,
 169 |    "id": "f2c6005b",
 170 |    "metadata": {},
 171 |    "outputs": [
 172 |     {
 173 |      "ename": "LoadError",
 174 |      "evalue": "ArgumentError: cannot take the CPU address of a CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}",
 175 |      "output_type": "error",
 176 |      "traceback": [
 177 |       "ArgumentError: cannot take the CPU address of a CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}",
 178 |       "",
 179 |       "Stacktrace:",
 180 |       " [1] unsafe_convert(#unused#::Type{Ptr{Float64}}, x::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})",
 181 |       "   @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/array.jl:315",
 182 |       " [2] unsafe_convert(#unused#::Type{Ptr{Float32}}, a::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})",
 183 |       "   @ Base ./pointer.jl:66",
 184 |       " [3] top-level scope",
 185 |       "   @ ./In[5]:1",
 186 |       " [4] eval",
 187 |       "   @ ./boot.jl:373 [inlined]",
 188 |       " [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
 189 |       "   @ Base ./loading.jl:1196"
 190 |      ]
 191 |     }
 192 |    ],
 193 |    "source": [
 194 |     "ccall(:whatever, Nothing, (Ptr{Float32},), A)"
 195 |    ]
 196 |   },
 197 |   {
 198 |    "cell_type": "markdown",
 199 |    "id": "639c3981",
 200 |    "metadata": {},
 201 |    "source": [
 202 |     "In that case, either you need to use different (supported) array operations, or fix the implementation in CUDA.jl. Such a fix can mean using functions from a CUDA library, using existing operations, or writing your own kernel."
 203 |    ]
 204 |   },
 205 |   {
 206 |    "cell_type": "markdown",
 207 |    "id": "a010f0f3",
 208 |    "metadata": {},
 209 |    "source": [
 210 |     "### Performance: Scalar iteration\n",
 211 |     "\n",
 212 |     "A key performance issue comes from the fact that a `CuArray` instance is a CPU object representing a chunk of memory on the GPU. That means we invoke the GPU for every CPU operation invoked on a CuArray. That is OK for scalar invocations, where the GPU operation will have to do a bunch of work, but is very bad when you have CPU code performing a bunch of small scalar operations:"
 213 |    ]
 214 |   },
 215 |   {
 216 |    "cell_type": "code",
 217 |    "execution_count": 6,
 218 |    "id": "dba2b8d6",
 219 |    "metadata": {},
 220 |    "outputs": [
 221 |     {
 222 |      "name": "stderr",
 223 |      "output_type": "stream",
 224 |      "text": [
 225 |       "┌ Warning: Performing scalar indexing on task Task (runnable) @0x00007fa7ed15e290.\n",
 226 |       "│ Invocation of getindex resulted in scalar indexing of a GPU array.\n",
 227 |       "│ This is typically caused by calling an iterating implementation of a method.\n",
 228 |       "│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,\n",
 229 |       "│ and therefore are only permitted from the REPL for prototyping purposes.\n",
 230 |       "│ If you did intend to index this array, annotate the caller with @allowscalar.\n",
 231 |       "└ @ GPUArrays /home/tim/Julia/depot/packages/GPUArrays/3sW6s/src/host/indexing.jl:56\n"
 232 |      ]
 233 |     },
 234 |     {
 235 |      "data": {
 236 |       "text/plain": [
 237 |        "55"
 238 |       ]
 239 |      },
 240 |      "execution_count": 6,
 241 |      "metadata": {},
 242 |      "output_type": "execute_result"
 243 |     }
 244 |    ],
 245 |    "source": [
 246 |     "A = CuArray(1:10)\n",
 247 |     "A_sum = zero(eltype(A))\n",
 248 |     "for I in eachindex(A)\n",
 249 |     "    A_sum += A[I]\n",
 250 |     "end\n",
 251 |     "A_sum"
 252 |    ]
 253 |   },
 254 |   {
 255 |    "cell_type": "markdown",
 256 |    "id": "ee038f2c",
 257 |    "metadata": {},
 258 |    "source": [
 259 |     "Because of this kind of programming pattern, iterating the array and fetching one scalar at a time (hence 'scalar iteration'), being so slow CUDA.jl warns about it. With the above snippet, the situation is actually even worse: Not only does every iteration require a GPU operation to fetch an element, the `getindex` call is also the only array operation meaning that the actual summation won't even run on the GPU!\n",
 260 |     "\n",
 261 |     "The solution here is to use the `sum` function that performs the entire operation as a single step. More on these operations later.\n",
 262 |     "To disallow scalar iteration, use the `allowscalar` function:"
 263 |    ]
 264 |   },
 265 |   {
 266 |    "cell_type": "code",
 267 |    "execution_count": 7,
 268 |    "id": "303d3f85",
 269 |    "metadata": {},
 270 |    "outputs": [
 271 |     {
 272 |      "ename": "LoadError",
 273 |      "evalue": "Scalar indexing is disallowed.\nInvocation of getindex resulted in scalar indexing of a GPU array.\nThis is typically caused by calling an iterating implementation of a method.\nSuch implementations *do not* execute on the GPU, but very slowly on the CPU,\nand therefore are only permitted from the REPL for prototyping purposes.\nIf you did intend to index this array, annotate the caller with @allowscalar.",
 274 |      "output_type": "error",
 275 |      "traceback": [
 276 |       "Scalar indexing is disallowed.\nInvocation of getindex resulted in scalar indexing of a GPU array.\nThis is typically caused by calling an iterating implementation of a method.\nSuch implementations *do not* execute on the GPU, but very slowly on the CPU,\nand therefore are only permitted from the REPL for prototyping purposes.\nIf you did intend to index this array, annotate the caller with @allowscalar.",
 277 |       "",
 278 |       "Stacktrace:",
 279 |       " [1] error(s::String)",
 280 |       "   @ Base ./error.jl:33",
 281 |       " [2] assertscalar(op::String)",
 282 |       "   @ GPUArrays ~/Julia/depot/packages/GPUArrays/3sW6s/src/host/indexing.jl:53",
 283 |       " [3] getindex(xs::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, I::Int64)",
 284 |       "   @ GPUArrays ~/Julia/depot/packages/GPUArrays/3sW6s/src/host/indexing.jl:86",
 285 |       " [4] top-level scope",
 286 |       "   @ In[7]:2",
 287 |       " [5] eval",
 288 |       "   @ ./boot.jl:373 [inlined]",
 289 |       " [6] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
 290 |       "   @ Base ./loading.jl:1196"
 291 |      ]
 292 |     }
 293 |    ],
 294 |    "source": [
 295 |     "CUDA.allowscalar(false)\n",
 296 |     "A[1]"
 297 |    ]
 298 |   },
 299 |   {
 300 |    "cell_type": "markdown",
 301 |    "id": "07d7532d",
 302 |    "metadata": {},
 303 |    "source": [
 304 |     "You should generally always enable this option! It's not by default in interactive sessions because it simplifies porting CPU code, and it's easy to trigger scalar iteration from non performance-sensitive paths (e.g. display methods):"
 305 |    ]
 306 |   },
 307 |   {
 308 |    "cell_type": "code",
 309 |    "execution_count": 8,
 310 |    "id": "43200f9c",
 311 |    "metadata": {},
 312 |    "outputs": [
 313 |     {
 314 |      "data": {
 315 |       "text/plain": [
 316 |        "1×10 adjoint(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}) with eltype Int64:\n",
 317 |        " 1  2  3  4  5  6  7  8  9  10"
 318 |       ]
 319 |      },
 320 |      "execution_count": 8,
 321 |      "metadata": {},
 322 |      "output_type": "execute_result"
 323 |     }
 324 |    ],
 325 |    "source": [
 326 |     "A'"
 327 |    ]
 328 |   },
 329 |   {
 330 |    "cell_type": "code",
 331 |    "execution_count": 9,
 332 |    "id": "3b17085c",
 333 |    "metadata": {},
 334 |    "outputs": [
 335 |     {
 336 |      "ename": "ErrorException",
 337 |      "evalue": "Scalar indexing is disallowed.\nInvocation of getindex resulted in scalar indexing of a GPU array.\nThis is typically caused by calling an iterating implementation of a method.\nSuch implementations *do not* execute on the GPU, but very slowly on the CPU,\nand therefore are only permitted from the REPL for prototyping purposes.\nIf you did intend to index this array, annotate the caller with @allowscalar.",
 338 |      "output_type": "error",
 339 |      "traceback": [
 340 |       "Scalar indexing is disallowed.\nInvocation of getindex resulted in scalar indexing of a GPU array.\nThis is typically caused by calling an iterating implementation of a method.\nSuch implementations *do not* execute on the GPU, but very slowly on the CPU,\nand therefore are only permitted from the REPL for prototyping purposes.\nIf you did intend to index this array, annotate the caller with @allowscalar.",
 341 |       "",
 342 |       "Stacktrace:",
 343 |       "  [1] error(s::String)",
 344 |       "    @ Base ./error.jl:33",
 345 |       "  [2] assertscalar(op::String)",
 346 |       "    @ GPUArrays ~/Julia/depot/packages/GPUArrays/3sW6s/src/host/indexing.jl:53",
 347 |       "  [3] getindex",
 348 |       "    @ ~/Julia/depot/packages/GPUArrays/3sW6s/src/host/indexing.jl:86 [inlined]",
 349 |       "  [4] getindex",
 350 |       "    @ ~/.cache/jl/installs/bin/linux/x64/1.7/julia-1.7-latest-linux-x86_64/share/julia/stdlib/v1.7/LinearAlgebra/src/adjtrans.jl:178 [inlined]",
 351 |       "  [5] _getindex",
 352 |       "    @ ./abstractarray.jl:1245 [inlined]",
 353 |       "  [6] getindex",
 354 |       "    @ ./abstractarray.jl:1218 [inlined]",
 355 |       "  [7] getindex",
 356 |       "    @ ./subarray.jl:276 [inlined]",
 357 |       "  [8] isassigned(::SubArray{Int64, 2, LinearAlgebra.Adjoint{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, true}, ::Int64, ::Int64)",
 358 |       "    @ Base ./abstractarray.jl:553",
 359 |       "  [9] alignment(io::IOContext{IOBuffer}, X::AbstractVecOrMat, rows::Vector{Int64}, cols::Vector{Int64}, cols_if_complete::Int64, cols_otherwise::Int64, sep::Int64)",
 360 |       "    @ Base ./arrayshow.jl:67",
 361 |       " [10] _print_matrix(io::IOContext{IOBuffer}, X::AbstractVecOrMat, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64, rowsA::UnitRange{Int64}, colsA::UnitRange{Int64})",
 362 |       "    @ Base ./arrayshow.jl:204",
 363 |       " [11] print_matrix(io::IOContext{IOBuffer}, X::SubArray{Int64, 2, LinearAlgebra.Adjoint{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, true}, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64) (repeats 2 times)",
 364 |       "    @ Base ./arrayshow.jl:169",
 365 |       " [12] print_array",
 366 |       "    @ ./arrayshow.jl:355 [inlined]",
 367 |       " [13] show(io::IOContext{IOBuffer}, #unused#::MIME{Symbol(\"text/plain\")}, X::SubArray{Int64, 2, LinearAlgebra.Adjoint{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, true})",
 368 |       "    @ Base ./arrayshow.jl:396",
 369 |       " [14] limitstringmime(mime::MIME{Symbol(\"text/plain\")}, x::SubArray{Int64, 2, LinearAlgebra.Adjoint{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, true})",
 370 |       "    @ IJulia ~/Julia/depot/packages/IJulia/e8kqU/src/inline.jl:43",
 371 |       " [15] display_mimestring",
 372 |       "    @ ~/Julia/depot/packages/IJulia/e8kqU/src/display.jl:71 [inlined]",
 373 |       " [16] display_dict(x::SubArray{Int64, 2, LinearAlgebra.Adjoint{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, true})",
 374 |       "    @ IJulia ~/Julia/depot/packages/IJulia/e8kqU/src/display.jl:102",
 375 |       " [17] #invokelatest#2",
 376 |       "    @ ./essentials.jl:716 [inlined]",
 377 |       " [18] invokelatest",
 378 |       "    @ ./essentials.jl:714 [inlined]",
 379 |       " [19] execute_request(socket::ZMQ.Socket, msg::IJulia.Msg)",
 380 |       "    @ IJulia ~/Julia/depot/packages/IJulia/e8kqU/src/execute_request.jl:112",
 381 |       " [20] #invokelatest#2",
 382 |       "    @ ./essentials.jl:716 [inlined]",
 383 |       " [21] invokelatest",
 384 |       "    @ ./essentials.jl:714 [inlined]",
 385 |       " [22] eventloop(socket::ZMQ.Socket)",
 386 |       "    @ IJulia ~/Julia/depot/packages/IJulia/e8kqU/src/eventloop.jl:8",
 387 |       " [23] (::IJulia.var\"#15#18\")()",
 388 |       "    @ IJulia ./task.jl:411"
 389 |      ]
 390 |     }
 391 |    ],
 392 |    "source": [
 393 |     "view(A', :, :)"
 394 |    ]
 395 |   },
 396 |   {
 397 |    "cell_type": "markdown",
 398 |    "id": "ba9398ad",
 399 |    "metadata": {},
 400 |    "source": [
 401 |     "Because of how Julia's type system works, it's easy to trigger non GPU-specialized methods when using array wrappers. Still, for non-interactive code it's recommended to always disable scalar iteration."
 402 |    ]
 403 |   },
 404 |   {
 405 |    "cell_type": "markdown",
 406 |    "id": "194c612d",
 407 |    "metadata": {},
 408 |    "source": [
 409 |     "### CuArray isn't device-compatible\n",
 410 |     "\n",
 411 |     "A more subtle result of `CuArray` being the CPU-side object is that these objects cannot be used directly on the GPU. Instead, a conversion to `CuDeviceArray` happens:"
 412 |    ]
 413 |   },
 414 |   {
 415 |    "cell_type": "code",
 416 |    "execution_count": 10,
 417 |    "id": "bd79553d",
 418 |    "metadata": {},
 419 |    "outputs": [
 420 |     {
 421 |      "name": "stdout",
 422 |      "output_type": "stream",
 423 |      "text": [
 424 |       "PTX CompilerJob of kernel #2(CuDeviceVector{Int64, 1}) for sm_75\n",
 425 |       "\n",
 426 |       "MethodInstance for (::var\"#2#3\")(::CuDeviceVector{Int64, 1})\n",
 427 |       "  from (::var\"#2#3\")(A) in Main at In[10]:1\n",
 428 |       "Arguments\n",
 429 |       "  #self#\u001b[36m::Core.Const(var\"#2#3\"())\u001b[39m\n",
 430 |       "  A\u001b[36m::CuDeviceVector{Int64, 1}\u001b[39m\n",
 431 |       "Body\u001b[36m::Nothing\u001b[39m\n",
 432 |       "\u001b[90m1 ─\u001b[39m     return Main.nothing\n",
 433 |       "\n"
 434 |      ]
 435 |     }
 436 |    ],
 437 |    "source": [
 438 |     "@device_code_warntype @cuda (A->nothing)(A)"
 439 |    ]
 440 |   },
 441 |   {
 442 |    "cell_type": "markdown",
 443 |    "id": "f6c8fd2d",
 444 |    "metadata": {},
 445 |    "source": [
 446 |     "Typically, this conversion is hidden and shouldn't affect you as an end user. The only time you need to take care, is when embedding `CuArray`s in a structure:"
 447 |    ]
 448 |   },
 449 |   {
 450 |    "cell_type": "code",
 451 |    "execution_count": 11,
 452 |    "id": "2c30c38c",
 453 |    "metadata": {},
 454 |    "outputs": [
 455 |     {
 456 |      "ename": "LoadError",
 457 |      "evalue": "GPU compilation of kernel #4(MyStruct) failed\nKernelError: passing and using non-bitstype argument\n\nArgument 2 to your kernel function is of type MyStruct, which is not isbits:\n  .inner is of type CuArray which is not isbits.\n    .storage is of type Union{Nothing, CUDA.ArrayStorage{B}} where B which is not isbits.\n    .dims is of type Tuple{Vararg{Int64, N}} where N which is not isbits.\n\n",
 458 |      "output_type": "error",
 459 |      "traceback": [
 460 |       "GPU compilation of kernel #4(MyStruct) failed\nKernelError: passing and using non-bitstype argument\n\nArgument 2 to your kernel function is of type MyStruct, which is not isbits:\n  .inner is of type CuArray which is not isbits.\n    .storage is of type Union{Nothing, CUDA.ArrayStorage{B}} where B which is not isbits.\n    .dims is of type Tuple{Vararg{Int64, N}} where N which is not isbits.\n\n",
 461 |       "",
 462 |       "Stacktrace:",
 463 |       "  [1] check_invocation(job::GPUCompiler.CompilerJob)",
 464 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/validation.jl:66",
 465 |       "  [2] macro expansion",
 466 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:332 [inlined]",
 467 |       "  [3] macro expansion",
 468 |       "    @ ~/Julia/depot/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]",
 469 |       "  [4] macro expansion",
 470 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:331 [inlined]",
 471 |       "  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)",
 472 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/utils.jl:62",
 473 |       "  [6] cufunction_compile(job::GPUCompiler.CompilerJob)",
 474 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:326",
 475 |       "  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))",
 476 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/cache.jl:89",
 477 |       "  [8] cufunction(f::var\"#4#5\", tt::Type{Tuple{MyStruct}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
 478 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:297",
 479 |       "  [9] cufunction(f::var\"#4#5\", tt::Type{Tuple{MyStruct}})",
 480 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:291",
 481 |       " [10] top-level scope",
 482 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:102",
 483 |       " [11] eval",
 484 |       "    @ ./boot.jl:373 [inlined]",
 485 |       " [12] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
 486 |       "    @ Base ./loading.jl:1196"
 487 |      ]
 488 |     }
 489 |    ],
 490 |    "source": [
 491 |     "struct MyStruct\n",
 492 |     "    inner::CuArray\n",
 493 |     "end\n",
 494 |     "B = MyStruct(A)\n",
 495 |     "@cuda (A->nothing)(B)"
 496 |    ]
 497 |   },
 498 |   {
 499 |    "cell_type": "markdown",
 500 |    "id": "9f1f4aaa",
 501 |    "metadata": {},
 502 |    "source": [
 503 |     "Here, CUDA.jl makes it clear that a `CuArray` isn't GPU compatible because it's not an `isbits` type. The underlying reason is that the automatic conversion from `CuArray` to `CuDeviceArray` doesn't know about your `MyStruct` and how to convert it to something GPU-compatible. This conversion is done using Adapt.jl, and to make this code work you need to teach Adapt about how to convert `MyStruct` objects:"
 504 |    ]
 505 |   },
 506 |   {
 507 |    "cell_type": "code",
 508 |    "execution_count": 12,
 509 |    "id": "ccabb456",
 510 |    "metadata": {},
 511 |    "outputs": [
 512 |     {
 513 |      "name": "stdout",
 514 |      "output_type": "stream",
 515 |      "text": [
 516 |       "PTX CompilerJob of kernel #6(MyParametricStruct{CuDeviceVector{Int64, 1}}) for sm_75\n",
 517 |       "\n",
 518 |       "MethodInstance for (::var\"#6#7\")(::MyParametricStruct{CuDeviceVector{Int64, 1}})\n",
 519 |       "  from (::var\"#6#7\")(A) in Main at In[12]:11\n",
 520 |       "Arguments\n",
 521 |       "  #self#\u001b[36m::Core.Const(var\"#6#7\"())\u001b[39m\n",
 522 |       "  A\u001b[36m::MyParametricStruct{CuDeviceVector{Int64, 1}}\u001b[39m\n",
 523 |       "Body\u001b[36m::Nothing\u001b[39m\n",
 524 |       "\u001b[90m1 ─\u001b[39m     return Main.nothing\n",
 525 |       "\n"
 526 |      ]
 527 |     }
 528 |    ],
 529 |    "source": [
 530 |     "# to store both a CuArray and a CuDeviceArray\n",
 531 |     "# our struct needs to be parametric\n",
 532 |     "struct MyParametricStruct{T<:AbstractArray}\n",
 533 |     "    inner::T\n",
 534 |     "end\n",
 535 |     "\n",
 536 |     "using Adapt\n",
 537 |     "Adapt.adapt_structure(to, x::MyParametricStruct) = MyParametricStruct(adapt(to, x.inner))\n",
 538 |     "\n",
 539 |     "C = MyParametricStruct(A)\n",
 540 |     "@device_code_warntype @cuda (A->nothing)(C)"
 541 |    ]
 542 |   },
 543 |   {
 544 |    "cell_type": "code",
 545 |    "execution_count": 13,
 546 |    "id": "85ee738d",
 547 |    "metadata": {},
 548 |    "outputs": [
 549 |     {
 550 |      "data": {
 551 |       "text/plain": [
 552 |        "0.4086858880384474"
 553 |       ]
 554 |      },
 555 |      "execution_count": 13,
 556 |      "metadata": {},
 557 |      "output_type": "execute_result"
 558 |     }
 559 |    ],
 560 |    "source": [
 561 |     "A = rand(1024, 1024)\n",
 562 |     "B = rand(1024, 1024)\n",
 563 |     "sqrt(sum((A-B).^2) / length(A))"
 564 |    ]
 565 |   },
 566 |   {
 567 |    "cell_type": "markdown",
 568 |    "id": "61ecb474",
 569 |    "metadata": {},
 570 |    "source": [
 571 |     "## Kernel programming\n",
 572 |     "\n",
 573 |     "When an array operation is not supported, or you need to perform an operation that you can easily express using existing array abstractions, you might need to write your own kernel. Kernels are **scalar functions that are executed multiple times in parallel**. Each 'thread' runs on one of the many streaming multiprocessors a GPU has, and threads running on a single SM are called a 'block'. Within a SM, some threads are always executed together; these form a 'warp' of 32 threads. Efficient communication between these entities is required to effectively use the GPU:\n",
 574 |     "\n",
 575 |     "- between blocks: global memory\n",
 576 |     "- within a block: shared memory\n",
 577 |     "- within a warp: via registers (shuffle)"
 578 |    ]
 579 |   },
 580 |   {
 581 |    "cell_type": "markdown",
 582 |    "id": "a32753b5",
 583 |    "metadata": {},
 584 |    "source": [
 585 |     "Within kernels, most of the Julia language is supported, with the exception of functionality that requires the Julia runtime library. That does mean you cannot allocate memory, or perform dynamic function calls, both of which are easy to do accidentally.\n",
 586 |     "\n",
 587 |     "At the same time, there are some special functions that only work in kernel context. Let's start by discussing those."
 588 |    ]
 589 |   },
 590 |   {
 591 |    "cell_type": "markdown",
 592 |    "id": "6c1e4c71",
 593 |    "metadata": {},
 594 |    "source": [
 595 |     "### Hardware indices\n",
 596 |     "\n",
 597 |     "You can fetch the thread, block and warp index using specific functions that query hardware indices:\n",
 598 |     "\n",
 599 |     "- `threadIdx()` and `blockDim()`: 3D\n",
 600 |     "- `blockIdx()` and `gridDim()`: 3D\n",
 601 |     "- `laneid()` and `warpsize()`\n",
 602 |     "\n",
 603 |     "When you don't need to care about which block a thread is part of, a very common index calculation is as follows:"
 604 |    ]
 605 |   },
 606 |   {
 607 |    "cell_type": "code",
 608 |    "execution_count": 14,
 609 |    "id": "bf386c10",
 610 |    "metadata": {},
 611 |    "outputs": [],
 612 |    "source": [
 613 |     "function kernel()\n",
 614 |     "    i = (blockIdx().x-1) * blockDim().x + threadIdx().x\n",
 615 |     "    @cushow i\n",
 616 |     "    return\n",
 617 |     "end\n",
 618 |     "@cuda threads=2 blocks=2 kernel();"
 619 |    ]
 620 |   },
 621 |   {
 622 |    "cell_type": "markdown",
 623 |    "id": "d4247070",
 624 |    "metadata": {},
 625 |    "source": [
 626 |     "### Synchronization\n",
 627 |     "\n",
 628 |     "If threads are working together -- say, they are using the same global memory, or are communicating using shared memory or finer-grained intrinsics -- you may need to have threads wait on each other. Note that this is only possible **within a block**; different blocks generally cannot wait on one another.\n",
 629 |     "\n",
 630 |     "Let's look at a contrived example:"
 631 |    ]
 632 |   },
 633 |   {
 634 |    "cell_type": "code",
 635 |    "execution_count": 15,
 636 |    "id": "93ff4d29",
 637 |    "metadata": {},
 638 |    "outputs": [
 639 |     {
 640 |      "name": "stdout",
 641 |      "output_type": "stream",
 642 |      "text": [
 643 |       "i = 1\n",
 644 |       "i = 2\n",
 645 |       "i = 3\n",
 646 |       "i = 4\n"
 647 |      ]
 648 |     },
 649 |     {
 650 |      "data": {
 651 |       "text/plain": [
 652 |        "1-element Vector{Float32}:\n",
 653 |        " 42.0"
 654 |       ]
 655 |      },
 656 |      "execution_count": 15,
 657 |      "metadata": {},
 658 |      "output_type": "execute_result"
 659 |     }
 660 |    ],
 661 |    "source": [
 662 |     "A = CUDA.zeros(512)\n",
 663 |     "\n",
 664 |     "function kernel(A)\n",
 665 |     "    # simple kernel without multiple blocks\n",
 666 |     "    i = threadIdx().x\n",
 667 |     "    \n",
 668 |     "    # first thread sets up the data\n",
 669 |     "    if i == 1\n",
 670 |     "        A[1] = 42\n",
 671 |     "    end\n",
 672 |     "    \n",
 673 |     "    sync_threads()\n",
 674 |     "    \n",
 675 |     "    # other threads can now read this data\n",
 676 |     "    if i != 1\n",
 677 |     "        A[i] = A[1]\n",
 678 |     "    end\n",
 679 |     "    \n",
 680 |     "    return\n",
 681 |     "end\n",
 682 |     "@cuda threads=length(A) kernel(A)\n",
 683 |     "unique(Array(A))"
 684 |    ]
 685 |   },
 686 |   {
 687 |    "cell_type": "markdown",
 688 |    "id": "655f23cc",
 689 |    "metadata": {},
 690 |    "source": [
 691 |     "Note how we didn't put `sync_threads()` inside of the branch; All threads need to reach the synchronization point for the kernel to make progress. This makes it dangerous to synchronize from a branch, as the branch cannot be divergent within a block or the kernel would deadlock!"
 692 |    ]
 693 |   },
 694 |   {
 695 |    "cell_type": "markdown",
 696 |    "id": "082f21b3",
 697 |    "metadata": {},
 698 |    "source": [
 699 |     "When coordinating within the warp, you may need the `sync_warp()` function. A detailed explanation of warp-level programming is out of scope for this notebook, refer to the [NVIDIA developer blog](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/) for more information."
 700 |    ]
 701 |   },
 702 |   {
 703 |    "cell_type": "markdown",
 704 |    "id": "ade663de",
 705 |    "metadata": {},
 706 |    "source": [
 707 |     "### Atomic operations\n",
 708 |     "\n",
 709 |     "When you want to use the same global memory from different threads, you may want to use atomic operations. For example:"
 710 |    ]
 711 |   },
 712 |   {
 713 |    "cell_type": "code",
 714 |    "execution_count": 16,
 715 |    "id": "cb9d0979",
 716 |    "metadata": {},
 717 |    "outputs": [
 718 |     {
 719 |      "data": {
 720 |       "text/plain": [
 721 |        "253.92513f0"
 722 |       ]
 723 |      },
 724 |      "execution_count": 16,
 725 |      "metadata": {},
 726 |      "output_type": "execute_result"
 727 |     }
 728 |    ],
 729 |    "source": [
 730 |     "A_sum = CUDA.zeros(1)\n",
 731 |     "A = CUDA.rand(512)\n",
 732 |     "\n",
 733 |     "function kernel(A, A_sum)\n",
 734 |     "    i = threadIdx().x\n",
 735 |     "    CUDA.@atomic A_sum[] += A[i]\n",
 736 |     "    return\n",
 737 |     "end\n",
 738 |     "@cuda threads=length(A) kernel(A, A_sum)\n",
 739 |     "Array(A_sum)[]"
 740 |    ]
 741 |   },
 742 |   {
 743 |    "cell_type": "markdown",
 744 |    "id": "05f3e656",
 745 |    "metadata": {},
 746 |    "source": [
 747 |     "You shouldn't overuse atomics though, as they generally serialize execution and thus are very expensive! But they may be useful for an initial implementation (i.e. before considering more fine-grained communication), or to reduce values from different blocks (because of the difficulty of synchronizing the grid)."
 748 |    ]
 749 |   },
 750 |   {
 751 |    "cell_type": "markdown",
 752 |    "id": "98a5e5b7",
 753 |    "metadata": {},
 754 |    "source": [
 755 |     "### Output\n",
 756 |     "\n",
 757 |     "To help with implementing a kernel, there's a couple of helpful macros to generate output:"
 758 |    ]
 759 |   },
 760 |   {
 761 |    "cell_type": "code",
 762 |    "execution_count": 17,
 763 |    "id": "f880a67d",
 764 |    "metadata": {},
 765 |    "outputs": [],
 766 |    "source": [
 767 |     "function kernel()\n",
 768 |     "    i = threadIdx().x\n",
 769 |     "    @cuprintf \"I'm thread %ld\\n\" Int(i)\n",
 770 |     "    return\n",
 771 |     "end\n",
 772 |     "@cuda kernel();"
 773 |    ]
 774 |   },
 775 |   {
 776 |    "cell_type": "markdown",
 777 |    "id": "f1cfed74",
 778 |    "metadata": {},
 779 |    "source": [
 780 |     "However, `@cuprintf` is a bit cumbersome, so we have `@cuprintln` trying to automatically generate an appropriate formatting string, while even supporting string interpolation:"
 781 |    ]
 782 |   },
 783 |   {
 784 |    "cell_type": "code",
 785 |    "execution_count": 18,
 786 |    "id": "a0804fee",
 787 |    "metadata": {},
 788 |    "outputs": [
 789 |     {
 790 |      "name": "stdout",
 791 |      "output_type": "stream",
 792 |      "text": [
 793 |       "I'm thread 1\n"
 794 |      ]
 795 |     }
 796 |    ],
 797 |    "source": [
 798 |     "function kernel()\n",
 799 |     "    i = threadIdx().x\n",
 800 |     "    @cuprintln \"I'm thread $i\"\n",
 801 |     "    return\n",
 802 |     "end\n",
 803 |     "@cuda kernel();"
 804 |    ]
 805 |   },
 806 |   {
 807 |    "cell_type": "markdown",
 808 |    "id": "c6add9d3",
 809 |    "metadata": {},
 810 |    "source": [
 811 |     "And for quick debugging, we have a helpful `@cushow` you can surround expressions with:"
 812 |    ]
 813 |   },
 814 |   {
 815 |    "cell_type": "code",
 816 |    "execution_count": 19,
 817 |    "id": "110e15db",
 818 |    "metadata": {},
 819 |    "outputs": [
 820 |     {
 821 |      "name": "stdout",
 822 |      "output_type": "stream",
 823 |      "text": [
 824 |       "I'm thread 1\n"
 825 |      ]
 826 |     }
 827 |    ],
 828 |    "source": [
 829 |     "function kernel()\n",
 830 |     "    i = @cushow(threadIdx().x)\n",
 831 |     "    return\n",
 832 |     "end\n",
 833 |     "@cuda kernel();"
 834 |    ]
 835 |   },
 836 |   {
 837 |    "cell_type": "markdown",
 838 |    "id": "3b1fa90b",
 839 |    "metadata": {},
 840 |    "source": [
 841 |     "## When things go wrong\n",
 842 |     "\n",
 843 |     "Because some aspects of the Julia language are unsupported, you'll definitely be running into compilation errors when writing GPU device code. With CPU code, Julia being a dynamic language, errors are postponed to run-time, so if you have a typo in your code it will still compile but you will run into an error at run time.\n",
 844 |     "\n",
 845 |     "With GPU code, it's harder to report errors at run time (for one, they'd be generated by every thread, resulting in a multiplication of errors), so the GPU compiler generally refuses to compile when it encounters certain unsupported code patterns."
 846 |    ]
 847 |   },
 848 |   {
 849 |    "cell_type": "markdown",
 850 |    "id": "72fc3201",
 851 |    "metadata": {},
 852 |    "source": [
 853 |     "<div class=\"alert alert-block alert-info\">\n",
 854 |     "    <b>Note</b>: Work is under way to improve the ability for GPU code to call into the CPU for, e.g., dynamic error reporting. That would make it possible to compile unsupported code and have it error at run time just like Julia code on the CPU.\n",
 855 |     "</div>"
 856 |    ]
 857 |   },
 858 |   {
 859 |    "cell_type": "markdown",
 860 |    "id": "b9bfabaf",
 861 |    "metadata": {},
 862 |    "source": [
 863 |     "Let's demonstrate a couple of common errors:"
 864 |    ]
 865 |   },
 866 |   {
 867 |    "cell_type": "markdown",
 868 |    "id": "e62360de",
 869 |    "metadata": {},
 870 |    "source": [
 871 |     "### Returning values from kernels\n",
 872 |     "\n",
 873 |     "Kernel functions cannot return anything; if you do so you'll run into a compilation error:"
 874 |    ]
 875 |   },
 876 |   {
 877 |    "cell_type": "code",
 878 |    "execution_count": 20,
 879 |    "id": "a8c49d31",
 880 |    "metadata": {},
 881 |    "outputs": [
 882 |     {
 883 |      "ename": "LoadError",
 884 |      "evalue": "GPU compilation of kernel kernel() failed\nKernelError: kernel returns a value of type `Int64`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
 885 |      "output_type": "error",
 886 |      "traceback": [
 887 |       "GPU compilation of kernel kernel() failed\nKernelError: kernel returns a value of type `Int64`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
 888 |       "",
 889 |       "Stacktrace:",
 890 |       "  [1] check_method(job::GPUCompiler.CompilerJob)",
 891 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/validation.jl:21",
 892 |       "  [2] macro expansion",
 893 |       "    @ ~/Julia/depot/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]",
 894 |       "  [3] macro expansion",
 895 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:89 [inlined]",
 896 |       "  [4] emit_julia(job::GPUCompiler.CompilerJob)",
 897 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/utils.jl:62",
 898 |       "  [5] cufunction_compile(job::GPUCompiler.CompilerJob)",
 899 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:324",
 900 |       "  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))",
 901 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/cache.jl:89",
 902 |       "  [7] cufunction(f::typeof(kernel), tt::Type{Tuple{}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
 903 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:297",
 904 |       "  [8] cufunction(f::typeof(kernel), tt::Type{Tuple{}})",
 905 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:291",
 906 |       "  [9] top-level scope",
 907 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:102",
 908 |       " [10] eval",
 909 |       "    @ ./boot.jl:373 [inlined]",
 910 |       " [11] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
 911 |       "    @ Base ./loading.jl:1196"
 912 |      ]
 913 |     }
 914 |    ],
 915 |    "source": [
 916 |     "kernel() = 42\n",
 917 |     "@cuda kernel()"
 918 |    ]
 919 |   },
 920 |   {
 921 |    "cell_type": "markdown",
 922 |    "id": "c9277d91",
 923 |    "metadata": {},
 924 |    "source": [
 925 |     "That's easy enough, but as the error message hints to you can run into this error in an unexpected way:"
 926 |    ]
 927 |   },
 928 |   {
 929 |    "cell_type": "code",
 930 |    "execution_count": 21,
 931 |    "id": "ac78b400",
 932 |    "metadata": {},
 933 |    "outputs": [
 934 |     {
 935 |      "ename": "LoadError",
 936 |      "evalue": "GPU compilation of kernel kernel() failed\nKernelError: kernel returns a value of type `Union{}`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
 937 |      "output_type": "error",
 938 |      "traceback": [
 939 |       "GPU compilation of kernel kernel() failed\nKernelError: kernel returns a value of type `Union{}`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
 940 |       "",
 941 |       "Stacktrace:",
 942 |       "  [1] check_method(job::GPUCompiler.CompilerJob)",
 943 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/validation.jl:21",
 944 |       "  [2] macro expansion",
 945 |       "    @ ~/Julia/depot/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]",
 946 |       "  [3] macro expansion",
 947 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:89 [inlined]",
 948 |       "  [4] emit_julia(job::GPUCompiler.CompilerJob)",
 949 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/utils.jl:62",
 950 |       "  [5] cufunction_compile(job::GPUCompiler.CompilerJob)",
 951 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:324",
 952 |       "  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))",
 953 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/cache.jl:89",
 954 |       "  [7] cufunction(f::typeof(kernel), tt::Type{Tuple{}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
 955 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:297",
 956 |       "  [8] cufunction(f::typeof(kernel), tt::Type{Tuple{}})",
 957 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:291",
 958 |       "  [9] top-level scope",
 959 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:102",
 960 |       " [10] eval",
 961 |       "    @ ./boot.jl:373 [inlined]",
 962 |       " [11] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
 963 |       "    @ Base ./loading.jl:1196"
 964 |      ]
 965 |     }
 966 |    ],
 967 |    "source": [
 968 |     "function kernel()\n",
 969 |     "    throw(42)\n",
 970 |     "    return\n",
 971 |     "end\n",
 972 |     "@cuda kernel()"
 973 |    ]
 974 |   },
 975 |   {
 976 |    "cell_type": "markdown",
 977 |    "id": "7a84dbe9",
 978 |    "metadata": {},
 979 |    "source": [
 980 |     "Even though we have a `return` at the end of our kernel, because of the unconditional throw a value of type `Union{}` is returned (the bottom type in the type lattice). As a result, any kernel that Julia figures out to be unconditionally throwing will trigger this error, a red herring for the actual problem with your kernel. For example:"
 981 |    ]
 982 |   },
 983 |   {
 984 |    "cell_type": "code",
 985 |    "execution_count": 22,
 986 |    "id": "d43cdaee",
 987 |    "metadata": {},
 988 |    "outputs": [
 989 |     {
 990 |      "name": "stdout",
 991 |      "output_type": "stream",
 992 |      "text": [
 993 |       "(threadIdx()).x = 1\n"
 994 |      ]
 995 |     },
 996 |     {
 997 |      "ename": "LoadError",
 998 |      "evalue": "GPU compilation of kernel kernel(CuDeviceVector{Int64, 1}) failed\nKernelError: kernel returns a value of type `Union{}`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
 999 |      "output_type": "error",
1000 |      "traceback": [
1001 |       "GPU compilation of kernel kernel(CuDeviceVector{Int64, 1}) failed\nKernelError: kernel returns a value of type `Union{}`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
1002 |       "",
1003 |       "Stacktrace:",
1004 |       "  [1] check_method(job::GPUCompiler.CompilerJob)",
1005 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/validation.jl:21",
1006 |       "  [2] macro expansion",
1007 |       "    @ ~/Julia/depot/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]",
1008 |       "  [3] macro expansion",
1009 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:89 [inlined]",
1010 |       "  [4] emit_julia(job::GPUCompiler.CompilerJob)",
1011 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/utils.jl:62",
1012 |       "  [5] cufunction_compile(job::GPUCompiler.CompilerJob)",
1013 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:324",
1014 |       "  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))",
1015 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/cache.jl:89",
1016 |       "  [7] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
1017 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:297",
1018 |       "  [8] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}})",
1019 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:291",
1020 |       "  [9] top-level scope",
1021 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:102",
1022 |       " [10] eval",
1023 |       "    @ ./boot.jl:373 [inlined]",
1024 |       " [11] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
1025 |       "    @ Base ./loading.jl:1196"
1026 |      ]
1027 |     }
1028 |    ],
1029 |    "source": [
1030 |     "function kernel(a)\n",
1031 |     "    if threadIdx().X == 1\n",
1032 |     "        a[] = 42\n",
1033 |     "    end\n",
1034 |     "    return\n",
1035 |     "end\n",
1036 |     "@cuda kernel(CuArray([1]))"
1037 |    ]
1038 |   },
1039 |   {
1040 |    "cell_type": "markdown",
1041 |    "id": "a0e01a04",
1042 |    "metadata": {},
1043 |    "source": [
1044 |     "The unconditional throw happens because we have a typo: `threadIdx().X` instead of `threadIdx().x`. Julia's compiler detects this, and unconditionally lowers this to an exception. You can spot this with the CUDA.jl's code reflection macros, showing that the `getproperty` call is the last evaluated expression before everything devolves into a constant `Union{}`:"
1045 |    ]
1046 |   },
1047 |   {
1048 |    "cell_type": "markdown",
1049 |    "id": "96d1a4b0",
1050 |    "metadata": {},
1051 |    "source": [
1052 |     "<div class=\"alert alert-block alert-info\">\n",
1053 |     "    <b>Note</b>: Accurate reporting requires Julia 1.7.\n",
1054 |     "</div>"
1055 |    ]
1056 |   },
1057 |   {
1058 |    "cell_type": "code",
1059 |    "execution_count": 23,
1060 |    "id": "372062ee",
1061 |    "metadata": {},
1062 |    "outputs": [
1063 |     {
1064 |      "name": "stdout",
1065 |      "output_type": "stream",
1066 |      "text": [
1067 |       "PTX CompilerJob of kernel kernel(CuDeviceVector{Int64, 1}) for sm_75\n",
1068 |       "\n",
1069 |       "MethodInstance for kernel(::CuDeviceVector{Int64, 1})\n",
1070 |       "  from kernel(a) in Main at In[22]:1\n",
1071 |       "Arguments\n",
1072 |       "  #self#\u001b[36m::Core.Const(kernel)\u001b[39m\n",
1073 |       "  a\u001b[36m::CuDeviceVector{Int64, 1}\u001b[39m\n",
1074 |       "Body\u001b[36m::Union{}\u001b[39m\n",
1075 |       "\u001b[90m1 ─\u001b[39m %1 = Main.threadIdx()\u001b[36m::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}\u001b[39m\n",
1076 |       "\u001b[90m│  \u001b[39m      Base.getproperty(%1, :X)\n",
1077 |       "\u001b[90m│  \u001b[39m      Core.Const(:(%2 == 1))\n",
1078 |       "\u001b[90m│  \u001b[39m      Core.Const(:(Core.typeassert(%3, Core.Bool)))\n",
1079 |       "\u001b[90m│  \u001b[39m      Core.Const(:(Base.setindex!(a, 42)))\n",
1080 |       "\u001b[90m└──\u001b[39m      Core.Const(:(return nothing))\n",
1081 |       "\n"
1082 |      ]
1083 |     },
1084 |     {
1085 |      "ename": "LoadError",
1086 |      "evalue": "GPU compilation of kernel kernel(CuDeviceVector{Int64, 1}) failed\nKernelError: kernel returns a value of type `Union{}`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
1087 |      "output_type": "error",
1088 |      "traceback": [
1089 |       "GPU compilation of kernel kernel(CuDeviceVector{Int64, 1}) failed\nKernelError: kernel returns a value of type `Union{}`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
1090 |       "",
1091 |       "Stacktrace:",
1092 |       "  [1] check_method(job::GPUCompiler.CompilerJob)",
1093 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/validation.jl:21",
1094 |       "  [2] macro expansion",
1095 |       "    @ ~/Julia/depot/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]",
1096 |       "  [3] macro expansion",
1097 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:89 [inlined]",
1098 |       "  [4] emit_julia(job::GPUCompiler.CompilerJob)",
1099 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/utils.jl:62",
1100 |       "  [5] cufunction_compile(job::GPUCompiler.CompilerJob)",
1101 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:324",
1102 |       "  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))",
1103 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/cache.jl:89",
1104 |       "  [7] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
1105 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:297",
1106 |       "  [8] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}})",
1107 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:291",
1108 |       "  [9] macro expansion",
1109 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:102 [inlined]",
1110 |       " [10] top-level scope",
1111 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/reflection.jl:147",
1112 |       " [11] eval",
1113 |       "    @ ./boot.jl:373 [inlined]",
1114 |       " [12] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
1115 |       "    @ Base ./loading.jl:1196"
1116 |      ]
1117 |     }
1118 |    ],
1119 |    "source": [
1120 |     "@device_code_warntype @cuda kernel(CuArray([1]))"
1121 |    ]
1122 |   },
1123 |   {
1124 |    "cell_type": "markdown",
1125 |    "id": "a47bdbe7",
1126 |    "metadata": {},
1127 |    "source": [
1128 |     "<div class=\"alert alert-block alert-info\">\n",
1129 |     "    <b>Note</b>: When the error occurs in a child function, and not in the kernel itself, it can be difficult to spot the issue. It used to be possible to use Cthulhu.jl to interactively inspect the code generated for child functions, but that functionality is currently broken. Awaiting a fix, you can manually inspect code generated for a child function by using `CUDA.code_warntype`.\n",
1130 |     "</div>"
1131 |    ]
1132 |   },
1133 |   {
1134 |    "cell_type": "code",
1135 |    "execution_count": 24,
1136 |    "id": "d89a4ebf",
1137 |    "metadata": {},
1138 |    "outputs": [
1139 |     {
1140 |      "name": "stdout",
1141 |      "output_type": "stream",
1142 |      "text": [
1143 |       "PTX CompilerJob of kernel kernel(CuDeviceVector{Int64, 1}) for sm_75\n",
1144 |       "\n",
1145 |       "MethodInstance for kernel(::CuDeviceVector{Int64, 1})\n",
1146 |       "  from kernel(a) in Main at In[24]:1\n",
1147 |       "Arguments\n",
1148 |       "  #self#\u001b[36m::Core.Const(kernel)\u001b[39m\n",
1149 |       "  a\u001b[36m::CuDeviceVector{Int64, 1}\u001b[39m\n",
1150 |       "Body\u001b[36m::Union{}\u001b[39m\n",
1151 |       "\u001b[90m1 ─\u001b[39m     Main.child(a)\n",
1152 |       "\u001b[90m└──\u001b[39m     Core.Const(:(return %1))\n",
1153 |       "\n"
1154 |      ]
1155 |     },
1156 |     {
1157 |      "ename": "LoadError",
1158 |      "evalue": "GPU compilation of kernel kernel(CuDeviceVector{Int64, 1}) failed\nKernelError: kernel returns a value of type `Union{}`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
1159 |      "output_type": "error",
1160 |      "traceback": [
1161 |       "GPU compilation of kernel kernel(CuDeviceVector{Int64, 1}) failed\nKernelError: kernel returns a value of type `Union{}`\n\nMake sure your kernel function ends in `return`, `return nothing` or `nothing`.\nIf the returned value is of type `Union{}`, your Julia code probably throws an exception.\nInspect the code with `@device_code_warntype` for more details.\n",
1162 |       "",
1163 |       "Stacktrace:",
1164 |       "  [1] check_method(job::GPUCompiler.CompilerJob)",
1165 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/validation.jl:21",
1166 |       "  [2] macro expansion",
1167 |       "    @ ~/Julia/depot/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]",
1168 |       "  [3] macro expansion",
1169 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:89 [inlined]",
1170 |       "  [4] emit_julia(job::GPUCompiler.CompilerJob)",
1171 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/utils.jl:62",
1172 |       "  [5] cufunction_compile(job::GPUCompiler.CompilerJob)",
1173 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:324",
1174 |       "  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))",
1175 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/cache.jl:89",
1176 |       "  [7] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
1177 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:297",
1178 |       "  [8] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}})",
1179 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:291",
1180 |       "  [9] macro expansion",
1181 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:102 [inlined]",
1182 |       " [10] top-level scope",
1183 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/reflection.jl:147",
1184 |       " [11] eval",
1185 |       "    @ ./boot.jl:373 [inlined]",
1186 |       " [12] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
1187 |       "    @ Base ./loading.jl:1196"
1188 |      ]
1189 |     }
1190 |    ],
1191 |    "source": [
1192 |     "kernel(a) = child(a)\n",
1193 |     "@noinline child(a) = a[] = threadIdx().X\n",
1194 |     "@device_code_warntype @cuda kernel(CuArray([1]))"
1195 |    ]
1196 |   },
1197 |   {
1198 |    "cell_type": "code",
1199 |    "execution_count": 25,
1200 |    "id": "b7918f47",
1201 |    "metadata": {},
1202 |    "outputs": [
1203 |     {
1204 |      "name": "stdout",
1205 |      "output_type": "stream",
1206 |      "text": [
1207 |       "MethodInstance for child(::CuDeviceVector{Int64, 1})\n",
1208 |       "  from child(a) in Main at In[24]:2\n",
1209 |       "Arguments\n",
1210 |       "  #self#\u001b[36m::Core.Const(child)\u001b[39m\n",
1211 |       "  a\u001b[36m::CuDeviceVector{Int64, 1}\u001b[39m\n",
1212 |       "Body\u001b[36m::Union{}\u001b[39m\n",
1213 |       "\u001b[90m1 ─\u001b[39m      $(Expr(:meta, :noinline))\n",
1214 |       "\u001b[90m│  \u001b[39m %2 = Main.threadIdx()\u001b[36m::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}\u001b[39m\n",
1215 |       "\u001b[90m│  \u001b[39m      Base.getproperty(%2, :X)\n",
1216 |       "\u001b[90m│  \u001b[39m      Core.Const(:(Base.setindex!(a, %3)))\n",
1217 |       "\u001b[90m└──\u001b[39m      Core.Const(:(return %3))\n",
1218 |       "\n"
1219 |      ]
1220 |     }
1221 |    ],
1222 |    "source": [
1223 |     "CUDA.code_warntype(child, (CuDeviceVector{Int64, 1},))"
1224 |    ]
1225 |   },
1226 |   {
1227 |    "cell_type": "markdown",
1228 |    "id": "98d97244",
1229 |    "metadata": {},
1230 |    "source": [
1231 |     "<div class=\"alert alert-block alert-info\">\n",
1232 |     "    <b>Note</b>: When using the non-macro versions of reflection utilities, you need to specify device-side types. For example, to inspect a kernel taking a `CuArray` you will need to specify `CuDeviceVector`.\n",
1233 |     "</div>"
1234 |    ]
1235 |   },
1236 |   {
1237 |    "cell_type": "markdown",
1238 |    "id": "8e8ab320",
1239 |    "metadata": {},
1240 |    "source": [
1241 |     "### Unsupported IR\n",
1242 |     "\n",
1243 |     "Other errors might result in a failure later during compilation:"
1244 |    ]
1245 |   },
1246 |   {
1247 |    "cell_type": "code",
1248 |    "execution_count": 26,
1249 |    "id": "1edfc671",
1250 |    "metadata": {},
1251 |    "outputs": [
1252 |     {
1253 |      "ename": "LoadError",
1254 |      "evalue": "InvalidIRError: compiling kernel kernel(CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR\nReason: unsupported use of an undefined name (use of 'threadId')\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation (call to getproperty)\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation (call to ==)\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m",
1255 |      "output_type": "error",
1256 |      "traceback": [
1257 |       "InvalidIRError: compiling kernel kernel(CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR\nReason: unsupported use of an undefined name (use of 'threadId')\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation (call to getproperty)\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation (call to ==)\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m",
1258 |       "",
1259 |       "Stacktrace:",
1260 |       "  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(kernel), Tuple{CuDeviceVector{Int64, 1}}}}, args::LLVM.Module)",
1261 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/validation.jl:111",
1262 |       "  [2] macro expansion",
1263 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:333 [inlined]",
1264 |       "  [3] macro expansion",
1265 |       "    @ ~/Julia/depot/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]",
1266 |       "  [4] macro expansion",
1267 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:331 [inlined]",
1268 |       "  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)",
1269 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/utils.jl:62",
1270 |       "  [6] cufunction_compile(job::GPUCompiler.CompilerJob)",
1271 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:326",
1272 |       "  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))",
1273 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/cache.jl:89",
1274 |       "  [8] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
1275 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:297",
1276 |       "  [9] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}})",
1277 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:291",
1278 |       " [10] top-level scope",
1279 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:102",
1280 |       " [11] eval",
1281 |       "    @ ./boot.jl:373 [inlined]",
1282 |       " [12] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
1283 |       "    @ Base ./loading.jl:1196"
1284 |      ]
1285 |     }
1286 |    ],
1287 |    "source": [
1288 |     "function kernel(a)\n",
1289 |     "    if threadId().x == 1\n",
1290 |     "        a[] = 42\n",
1291 |     "    end\n",
1292 |     "    return\n",
1293 |     "end\n",
1294 |     "@cuda kernel(CuArray([1]))"
1295 |    ]
1296 |   },
1297 |   {
1298 |    "cell_type": "markdown",
1299 |    "id": "32a496a1",
1300 |    "metadata": {},
1301 |    "source": [
1302 |     "CUDA.jl tries its best to figure out why dynamic IR was generated, as can be seen in the error trace here. But again, we're also able to spot this error using reflection macros:"
1303 |    ]
1304 |   },
1305 |   {
1306 |    "cell_type": "code",
1307 |    "execution_count": 27,
1308 |    "id": "72797926",
1309 |    "metadata": {},
1310 |    "outputs": [
1311 |     {
1312 |      "name": "stdout",
1313 |      "output_type": "stream",
1314 |      "text": [
1315 |       "PTX CompilerJob of kernel kernel(CuDeviceVector{Int64, 1}) for sm_75\n",
1316 |       "\n",
1317 |       "MethodInstance for kernel(::CuDeviceVector{Int64, 1})\n",
1318 |       "  from kernel(a) in Main at In[26]:1\n",
1319 |       "Arguments\n",
1320 |       "  #self#\u001b[36m::Core.Const(kernel)\u001b[39m\n",
1321 |       "  a\u001b[36m::CuDeviceVector{Int64, 1}\u001b[39m\n",
1322 |       "Body\u001b[36m::Nothing\u001b[39m\n",
1323 |       "\u001b[90m1 ─\u001b[39m %1 = Main.threadId()\u001b[91m\u001b[1m::Any\u001b[22m\u001b[39m\n",
1324 |       "\u001b[90m│  \u001b[39m %2 = Base.getproperty(%1, :x)\u001b[91m\u001b[1m::Any\u001b[22m\u001b[39m\n",
1325 |       "\u001b[90m│  \u001b[39m %3 = (%2 == 1)\u001b[91m\u001b[1m::Any\u001b[22m\u001b[39m\n",
1326 |       "\u001b[90m└──\u001b[39m      goto #3 if not %3\n",
1327 |       "\u001b[90m2 ─\u001b[39m      Base.setindex!(a, 42)\n",
1328 |       "\u001b[90m3 ┄\u001b[39m      return nothing\n",
1329 |       "\n"
1330 |      ]
1331 |     },
1332 |     {
1333 |      "ename": "LoadError",
1334 |      "evalue": "InvalidIRError: compiling kernel kernel(CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR\nReason: unsupported use of an undefined name (use of 'threadId')\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation (call to getproperty)\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation (call to ==)\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m",
1335 |      "output_type": "error",
1336 |      "traceback": [
1337 |       "InvalidIRError: compiling kernel kernel(CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR\nReason: unsupported use of an undefined name (use of 'threadId')\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation (call to getproperty)\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m\nReason: unsupported dynamic function invocation (call to ==)\nStacktrace:\n [1] \u001b[0m\u001b[1mkernel\u001b[22m\n\u001b[90m   @ \u001b[39m\u001b[90m./\u001b[39m\u001b[90m\u001b[4mIn[26]:2\u001b[24m\u001b[39m",
1338 |       "",
1339 |       "Stacktrace:",
1340 |       "  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(kernel), Tuple{CuDeviceVector{Int64, 1}}}}, args::LLVM.Module)",
1341 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/validation.jl:111",
1342 |       "  [2] macro expansion",
1343 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:333 [inlined]",
1344 |       "  [3] macro expansion",
1345 |       "    @ ~/Julia/depot/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]",
1346 |       "  [4] macro expansion",
1347 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:331 [inlined]",
1348 |       "  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)",
1349 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/utils.jl:62",
1350 |       "  [6] cufunction_compile(job::GPUCompiler.CompilerJob)",
1351 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:326",
1352 |       "  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))",
1353 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/cache.jl:89",
1354 |       "  [8] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
1355 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:297",
1356 |       "  [9] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Int64, 1}}})",
1357 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:291",
1358 |       " [10] macro expansion",
1359 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:102 [inlined]",
1360 |       " [11] top-level scope",
1361 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/reflection.jl:147",
1362 |       " [12] eval",
1363 |       "    @ ./boot.jl:373 [inlined]",
1364 |       " [13] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
1365 |       "    @ Base ./loading.jl:1196"
1366 |      ]
1367 |     }
1368 |    ],
1369 |    "source": [
1370 |     "@device_code_warntype @cuda kernel(CuArray([1]))"
1371 |    ]
1372 |   },
1373 |   {
1374 |    "cell_type": "markdown",
1375 |    "id": "a56a9cd8",
1376 |    "metadata": {},
1377 |    "source": [
1378 |     "In this IR, the `::Any` indicates that type-unstable code will be generated. This is unsupported on the GPU, as it will result in calls to the runtime library."
1379 |    ]
1380 |   },
1381 |   {
1382 |    "cell_type": "markdown",
1383 |    "id": "7b66cd51",
1384 |    "metadata": {},
1385 |    "source": [
1386 |     "### Passing invalid types\n",
1387 |     "\n",
1388 |     "Compilation errors are not limited to kernel functions, you can trigger some by using array abstractions in an invalid manner. For example:"
1389 |    ]
1390 |   },
1391 |   {
1392 |    "cell_type": "code",
1393 |    "execution_count": 28,
1394 |    "id": "cc76faaf",
1395 |    "metadata": {},
1396 |    "outputs": [
1397 |     {
1398 |      "ename": "LoadError",
1399 |      "evalue": "GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Vector{Int64}, Tuple{Bool}, Tuple{Int64}}}}, Int64) failed\nKernelError: passing and using non-bitstype argument\n\nArgument 4 to your kernel function is of type Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Vector{Int64}, Tuple{Bool}, Tuple{Int64}}}}, which is not isbits:\n  .args is of type Tuple{Base.Broadcast.Extruded{Vector{Int64}, Tuple{Bool}, Tuple{Int64}}} which is not isbits.\n    .1 is of type Base.Broadcast.Extruded{Vector{Int64}, Tuple{Bool}, Tuple{Int64}} which is not isbits.\n      .x is of type Vector{Int64} which is not isbits.\n\n",
1400 |      "output_type": "error",
1401 |      "traceback": [
1402 |       "GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Vector{Int64}, Tuple{Bool}, Tuple{Int64}}}}, Int64) failed\nKernelError: passing and using non-bitstype argument\n\nArgument 4 to your kernel function is of type Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Vector{Int64}, Tuple{Bool}, Tuple{Int64}}}}, which is not isbits:\n  .args is of type Tuple{Base.Broadcast.Extruded{Vector{Int64}, Tuple{Bool}, Tuple{Int64}}} which is not isbits.\n    .1 is of type Base.Broadcast.Extruded{Vector{Int64}, Tuple{Bool}, Tuple{Int64}} which is not isbits.\n      .x is of type Vector{Int64} which is not isbits.\n\n",
1403 |       "",
1404 |       "Stacktrace:",
1405 |       "  [1] check_invocation(job::GPUCompiler.CompilerJob)",
1406 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/validation.jl:66",
1407 |       "  [2] macro expansion",
1408 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:332 [inlined]",
1409 |       "  [3] macro expansion",
1410 |       "    @ ~/Julia/depot/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]",
1411 |       "  [4] macro expansion",
1412 |       "    @ ~/Julia/depot/packages/GPUCompiler/AJD5L/src/driver.jl:331 [inlined]",
1413 |       "  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)",
1414 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/utils.jl:62",
1415 |       "  [6] cufunction_compile(job::GPUCompiler.CompilerJob)",
1416 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:326",
1417 |       "  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))",
1418 |       "    @ GPUCompiler ~/Julia/depot/packages/GPUCompiler/AJD5L/src/cache.jl:89",
1419 |       "  [8] cufunction(f::GPUArrays.var\"#broadcast_kernel#17\", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Vector{Int64}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
1420 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:297",
1421 |       "  [9] cufunction",
1422 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:291 [inlined]",
1423 |       " [10] macro expansion",
1424 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:102 [inlined]",
1425 |       " [11] #launch_heuristic#236",
1426 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/gpuarrays.jl:17 [inlined]",
1427 |       " [12] copyto!",
1428 |       "    @ ~/Julia/depot/packages/GPUArrays/3sW6s/src/host/broadcast.jl:65 [inlined]",
1429 |       " [13] copyto!",
1430 |       "    @ ./broadcast.jl:913 [inlined]",
1431 |       " [14] materialize!",
1432 |       "    @ ./broadcast.jl:871 [inlined]",
1433 |       " [15] materialize!(dest::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(identity), Tuple{Vector{Int64}}})",
1434 |       "    @ Base.Broadcast ./broadcast.jl:868",
1435 |       " [16] top-level scope",
1436 |       "    @ In[28]:1",
1437 |       " [17] eval",
1438 |       "    @ ./boot.jl:373 [inlined]",
1439 |       " [18] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
1440 |       "    @ Base ./loading.jl:1196"
1441 |      ]
1442 |     }
1443 |    ],
1444 |    "source": [
1445 |     "CuArray([1]) .= [1]"
1446 |    ]
1447 |   },
1448 |   {
1449 |    "cell_type": "markdown",
1450 |    "id": "d3ea5b62",
1451 |    "metadata": {},
1452 |    "source": [
1453 |     "Here, compilation fails because GPU code requires all arguments to be `isbits` (remember the difference between `CuArray` and `CuDeviceArray`)."
1454 |    ]
1455 |   },
1456 |   {
1457 |    "cell_type": "markdown",
1458 |    "id": "3740087b",
1459 |    "metadata": {},
1460 |    "source": [
1461 |     "### Run-time exceptions\n",
1462 |     "\n",
1463 |     "Finally, to demonstrate that we _are_ working on dynamic error semantics: Some exceptions are already being reported at run time:"
1464 |    ]
1465 |   },
1466 |   {
1467 |    "cell_type": "code",
1468 |    "execution_count": 29,
1469 |    "id": "c920c270",
1470 |    "metadata": {},
1471 |    "outputs": [
1472 |     {
1473 |      "name": "stdout",
1474 |      "output_type": "stream",
1475 |      "text": [
1476 |       "ERROR: a exception was thrown during kernel execution.\n",
1477 |       "       Run Julia on debug level 2 for device stack traces.\n"
1478 |      ]
1479 |     },
1480 |     {
1481 |      "ename": "LoadError",
1482 |      "evalue": "KernelException: exception thrown during kernel execution on device Quadro RTX 5000",
1483 |      "output_type": "error",
1484 |      "traceback": [
1485 |       "KernelException: exception thrown during kernel execution on device Quadro RTX 5000",
1486 |       "",
1487 |       "Stacktrace:",
1488 |       " [1] check_exceptions()",
1489 |       "   @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/exceptions.jl:34",
1490 |       " [2] nonblocking_synchronize",
1491 |       "   @ ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/context.jl:347 [inlined]",
1492 |       " [3] device_synchronize()",
1493 |       "   @ CUDA ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/context.jl:335",
1494 |       " [4] top-level scope",
1495 |       "   @ In[29]:8",
1496 |       " [5] eval",
1497 |       "   @ ./boot.jl:373 [inlined]",
1498 |       " [6] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
1499 |       "   @ Base ./loading.jl:1196"
1500 |      ]
1501 |     }
1502 |    ],
1503 |    "source": [
1504 |     "function kernel(a)\n",
1505 |     "    if threadIdx().x == 1\n",
1506 |     "        a[] += 42.1\n",
1507 |     "    end\n",
1508 |     "    return\n",
1509 |     "end\n",
1510 |     "@cuda kernel(CuArray([42]))\n",
1511 |     "device_synchronize()"
1512 |    ]
1513 |   },
1514 |   {
1515 |    "cell_type": "markdown",
1516 |    "id": "032d63e4",
1517 |    "metadata": {},
1518 |    "source": [
1519 |     "When such a run-time exception happens, the GPU will print, and any subsequent synchronization API call will fail and report the exception on the host too. For more information on the error, launch Julia with `-g2`:\n",
1520 |     "\n",
1521 |     "```\n",
1522 |     "$ julia -g2\n",
1523 |     "...\n",
1524 |     "julia> @cuda kernel(CuArray([42]))\n",
1525 |     "ERROR: a exception was thrown during kernel execution.\n",
1526 |     "Stacktrace:\n",
1527 |     " [1] Int64 at ./float.jl:812\n",
1528 |     " [2] convert at ./number.jl:7\n",
1529 |     " [3] setindex! at /home/tim/Julia/pkg/CUDA/src/device/array.jl:204\n",
1530 |     " [4] setindex! at /home/tim/Julia/pkg/CUDA/src/device/array.jl:217\n",
1531 |     " [5] kernel at ./REPL[1]:3\n",
1532 |     "```"
1533 |    ]
1534 |   },
1535 |   {
1536 |    "attachments": {
1537 |     "image.png": {
1538 |      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAOgAAAA8CAYAAABsBvZBAAAABHNCSVQICAgIfAhkiAAADXBJREFUeF7tnc9vHEkVx5Pw44CATG4I0Lq9B4T4ZYeVOAFu74EbxOEPIHZOcGEdJBYukMmekBCbMRISFzbtBSE4sJkIJIRAuB3BYSVInAUECLFuI7iAIA4IkEAofD9JVbZS6erptqftHk896auqeu/Vq1ev6nVVz4yTY8cixQjECMQIxAjECMQINI/A8eZdYo+WIpDK7mMt2Y5mux+B38vFn/puxgT1I3J47Rc19L8Ob/g48iFH4Aca/wuH7EMcviICLFCkGIGHInAixqMTEZiXFy93wpPoRKciEBO0G8vxhNy41Q1XohddisCru+TMFPvCCfrNlufPGJeF62acBZWFsCPMCLvCqpHFIkYgRsCJwIbqr205ImuyT5JCPeGukJo2Re7UY7UjEYhX3G4sBMn5n5ZduS37W2aMRVPmzpibLY8fzccITGQEEnn9jQrP+Spsr3DNMo4lTtPcFaqeeO3YHF8EZmUKNKb4Dto4ZGPvcFoWbwSsFuLzfsh7InUS1SWuqZa4ts49LD72XrVvGl7hyHj/HHq6rtwTNW7iy0DgnXa3pPcp8fC9TFai3mlWIu8yYSkwH2SFMC+cFM4G9MSO1MUIXJJTJEwZLYrJRuZ6yqYeRSQwGyUX6MdJ6ZNNDh4MbRD2eSiwIcvIyrck7JUpTCCPpFsv8XtFvA2HX6jeL9GLrA5H4HvyjSdriEhgks1d6JCuy2dzFIJ/6pLA2GuL8BeUEQmZC2xoQP2oJOlQcyHmLhFrHkSWclVApAmKAAvmJ5HvPotMUoU2vq9v25kqJIJLAzVyjzeu5qwM4SenZBnNiwkscYq77bI+k8JblKOFULWWyPuTMqHo5/3T4zs1AsHG3xXY/GzqutSTYmKUl1XyPSjX5W3hoiMzKvsu1mTh2r6tTKYBErMQ/Aeinc2qKsPJnNr0ev0hTf1zNafPwpOgJFfohKppqjW1LVkOnfKJZGlrI+/fMCdgk4df2Yg8nHhI+cQtIfOZsd39CHxGLn64gZvr0iVJu/gk5gTBtydL5sPDBZ9zIfPk3Aw4XQ6TiCuJVQjuCZiqzZx6NZ3rS89+am67kJwDp39W01ZU60AEviUf3trADzbKlsCmWWnQ7yBUZ41fnEQu4fPQMCgLR4guc0kd3kFX8aFvBsUXW4eVCbtGVqdgTbDBwwpKBF4pNhz070lqUvwetGagWlJ7XHb/1MA2m4VNwPemzwokq//EbmBurKpJwBqJmwls2tTUrSptKLeMhiVX0qpPwF1zoTEWpDQQ7Mlp9ay/br/UGAzZKozcFokqX/Z4ob6e2v1mTNDSsBwI83Ua5Z8CT9wmREJ+SiBB54WuJKidQ+FNBv8ACUAyrTvyM6pvevpNmktStqfVqH6FFIBPfcPA1o5g/UlUnxHcd0p0Sea6lEsRRJrACHxQPl/eo98r6pftse9+uvXUOQ0Y4DTjYbMYkGfiF47slNG/VKI/LOG1zdrVAG4yEmPmw7wgHgS0iUGIeP92r7ghvdr8+GP52qEau+KcLL60B6skwDlheQ992Vy399BvXn04ObaENNAfWRUlEhaOAvOHcodHNRUY7yCJhwWnuzsHrr53DO+CSm4BtIdCyD/6QCTpWChecccSxj0ZeZd6fbVhT57mfWGpYT+rzsbiGteU2Lirwu6Ijn+QPAno0LfnyLAH5Q4vVZ2Hz7ZA3ZU5aq1XiZO9fpNs3HRIYBL5qYrRmd9mhbyxKJ6gjUM2tg7vlCU2fl1ic7BR2MC3a3Ziw8wa3b7KiwIbjvq4Cbu5kAYM98VPhA0hF0iC6wL9LHGN5CHEJqd+UEQ8zwvE56owEIhd7jiQqo7vVTQn4VgTNJ6gVeFuT8bme01D82yOFaFo0I9T6pawLeTCkpAJdhNxdWMjWnKTBd6OcMWRj6pmAX3GYM6JMTCrkn+D6ZJp24I5ssnPCvh8UJRoIMZmrvi5LCwIQwGCR5v42XjtGpkt4KNDTI8kEQTIlkdykmZSMyq/3mCC69JdaaCPKvpsIjeed9XmJN4r9dXx0ojOueSLnk6hNmNbWlMFnk9pgO/rjbPNiY1vNhmJz7aAj5ZSVQrTGKi0SfqKxv24XHMZ46hzgs4LDGrJDeQtMQshE/wnBn3oa/WZkJ2kqo9QJk4ioM+mWRW2jNYFlecExugJ2MmFvpAKjDMQLFkbtKkXAsGpGv9B5w5U3i0fiG0dIjaFcKWGMnGdEfoC8XxesOtD0jCmfz12E1jih8j29flVbRK4L2w4SqyrXT/msyCcLjGCj7nhL6k8qPW8Y/ybVfmCsCm475r4tCX0hUxgPi711Dgj4PNYicXBOMFiQcG6kJlRTqlcFvhKgDs693NL9KHvZVOymEywjNBlwU4K1j4TZqIsGMGYN218WhaeEwhUKpT5iB1L9MXGtnBWwG6X6bNy7mfCj0Y4yVzYMLlAXCA3aYghc7eUOnWqHxXsml1SnfivGgxUrgiJEKJCgiuOsK86flwMdTB8a9fqJeKzzqxjLrg2XVN90yhUsoZDV9hinTiTYHfMmOzVJoSfIGvSqY6uXXR0WcDPC88I/gIU4s0Ijwvbgku5GpsCfd2/4Hd11tRgcT4muPYZ/7aA3B9zXTzGTAVLVT5if0vA3qLAAneVvi3HPm58rfJxVcJTVQojZG5M+9IlnqwVcQJ1KZHiOSE1HXKVrE9h2mUFm35DaLIOrCFzzg3K7HaNl8iheWHYtmNs/rsCpU+5kfV9gdrIZo2cRSujTMwy+8dNv7IxFyXLPWNlNlyVlQp7nqnWmvxCCISIObNxD4NSDZrsYWASh76sCaAOL1LLEThR075djJ2AfiH+pvARwX/i8yRFRvL7BI9rxScFnkIu3VRj3eONamZGAXuHRRc0MO+YISJ5/x0StszPZb/Ywxi76kNfHiyAOrxILUegToLyxJwTnheuVPhDMpHIS54Od/usot/A9CMhtwTa9GEDVI0XMknC48fpkEKLfMb9tMCPEEL0Hgl+GRJGfoyAG4Gy70F5T+RDISgxOK9yVLJk0rks0N/qzqp+XCg7PcW+R32Bjf2UwIMAUM8FTl8S1aeQPfgk+YKAzRAtSbAaEOKvT9tiLPvMkjY/Yj8pvL1EZlnM70aFPIpiBB5EoCxBOSkvCmzUVLgqnBNs0j3o7FVIjmsCCUpisqlJsIHRK9v41gTJgl4qLApnTB3esuBTyBb8kMy1MVQDX31iDmX94Y+iN0jhA0bpbRXK85LZmJSpvUpMPmyLNB0RKDTNv4SmWpagVpdNuSGsC7zTrQijkpSNR4IuC32BRHtWqKJEwsIgUwl6wpZwTmDcOgkitXtkT3/6V1HIZohfZQvZJwT+f0eusPyMj0Qvs8X76e/oEKA3iv/dgCyyj14EntaUOBRLqSpBbYfCVBZUjkrQm9LZEUhSTilQRWziTEg9pV211wSSm2S97clDzRkjuD6iDzabvKPiD3Oroj9LyP/xyYPhSYET9e9eB+b7X+F/Ht9tMtc3VcijaIoiUCdB7Uk0WzMu/AX5l4TnBL4oH0XzUjglhJKQ5PCp7GRCZ2AUV/0OXjtROx2h44rrJGhmOvxaJQnKSfoTb4x3qP1bjxebMQJuBHpqkGv3DoQ6CWoThKujvbblqqcCxk4Kc8KWAL0gkKCcpNuGV1XQn5NyxVM6o/YzQlky4ocl6pycJGcqnBdGnXb4av21dsZV2k9o+STXT9B58W6Ma6Bo50hGYKhZcVt98KMfrnokEolgQdtNmHUjIwn6RpZ5fTgBbeIwCB8QQSH78NEvhL5Awlj7G6rzIU5PgMps4CuJSFkI+DiL8iHT+zU+Pn2lxI8vive+En5kxQjYCKypsiskMGxC1QkPm58TALpap0NNHWySnCQjiQjhoH8Kur6SALZNvUv0ejnzD4H34AXPMd5R+Wc22/6vBrsUjzJfEjGXywQBXiZ+EZBFdoxAowjw4Pij8DevF/wfC00eio0GniDlC/KVhzDlokGukoftqmmnKgeGF2OmQEQaXwS+bzbWY47Jt6iejW+IibbEjcneluxEClXcmxF8EpNEnlo6MbUzb3fifJIL8UmuJTbkS057WquzmviO4L7CwJsRNgX/lYVknlqKCdrO0v/CmOVrFUsk6M/bGW6irM7J28uex6lpk6AuoTv0eLEZI7DvCPBJLSfB1xxLfP1U9Wdo+x50gg2sm3gtTvAcousTFAESkQTlk1yId6kfmtKwYuFEoFDdv9rGACkC8Yrbzjbg7z1fFvjdLcnJz/7uCHETPhrvRCz7/vmodMo5MUHb2QAk4q+EnvBmgXcp+17azoiTa9Vea/33z8md0Rg9jwk6xmB6pkhQiL8NJUGn+tNILzZuc8E08gqdqRXFBG1v6e1vcvnTsyeEW+0NNdGWU+P9xkTPoiXn6/xYvqWhj7xZm6Ccnrxj7Rz5GTebIO/mxCW+fzaLW9QeUwTYgLyLvijwG9xIr0SAuISQxEDFCBxUBH5jNiJ/ThepOgI80CJ5EYjvoO1uCXvN5V+Rj1QdAU7USF4EYoK2uyVigrYb32g9RmBfEeBvP/+6Lwux81RHIJ6g7S4/P06I33+2G+NoPUZgXxHge9BIMQIxAjECMQJHLQL/B7RjhpRDeOhiAAAAAElFTkSuQmCC"
1539 |     }
1540 |    },
1541 |    "cell_type": "markdown",
1542 |    "id": "c166fc3e",
1543 |    "metadata": {},
1544 |    "source": [
1545 |     "## Exercise: Matrix RMSE\n",
1546 |     "\n",
1547 |     "With all that out of the way, time for an exercise: Try to compute the RMSE of two matrices on the GPU using both array operations and using a GPU kernel (a single kernel, if possible):\n",
1548 |     "\n",
1549 |     "![image.png](attachment:image.png)"
1550 |    ]
1551 |   },
1552 |   {
1553 |    "cell_type": "markdown",
1554 |    "id": "dc563644",
1555 |    "metadata": {},
1556 |    "source": [
1557 |     "For prototyping, you can develop the array operation on the CPU (one of the advantages of using array operations):"
1558 |    ]
1559 |   },
1560 |   {
1561 |    "cell_type": "code",
1562 |    "execution_count": 30,
1563 |    "id": "5fa8342d",
1564 |    "metadata": {},
1565 |    "outputs": [
1566 |     {
1567 |      "data": {
1568 |       "text/plain": [
1569 |        "0.4048518f0"
1570 |       ]
1571 |      },
1572 |      "execution_count": 30,
1573 |      "metadata": {},
1574 |      "output_type": "execute_result"
1575 |     }
1576 |    ],
1577 |    "source": [
1578 |     "A = CUDA.rand(10,10)\n",
1579 |     "B = CUDA.rand(10,10)\n",
1580 |     "sqrt(sum((A-B).^2) / length(A))"
1581 |    ]
1582 |   },
1583 |   {
1584 |    "cell_type": "markdown",
1585 |    "id": "d50845c3",
1586 |    "metadata": {},
1587 |    "source": [
1588 |     "To 'port' this to the GPU, just change the type of the input arrays to `CuArray` and the computation of C just works:"
1589 |    ]
1590 |   },
1591 |   {
1592 |    "cell_type": "code",
1593 |    "execution_count": 31,
1594 |    "id": "1defb88c",
1595 |    "metadata": {},
1596 |    "outputs": [
1597 |     {
1598 |      "data": {
1599 |       "text/plain": [
1600 |        "0.4048518f0"
1601 |       ]
1602 |      },
1603 |      "execution_count": 31,
1604 |      "metadata": {},
1605 |      "output_type": "execute_result"
1606 |     }
1607 |    ],
1608 |    "source": [
1609 |     "A = CuArray(A)\n",
1610 |     "B = CuArray(B)\n",
1611 |     "sqrt(sum((A-B).^2) / length(A))"
1612 |    ]
1613 |   },
1614 |   {
1615 |    "cell_type": "markdown",
1616 |    "id": "29a8d1cf",
1617 |    "metadata": {},
1618 |    "source": [
1619 |     "Now for a CUDA.jl kernel:"
1620 |    ]
1621 |   },
1622 |   {
1623 |    "cell_type": "code",
1624 |    "execution_count": 32,
1625 |    "id": "2e2abb5a",
1626 |    "metadata": {},
1627 |    "outputs": [
1628 |     {
1629 |      "data": {
1630 |       "text/plain": [
1631 |        "1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n",
1632 |        " 0.40485176"
1633 |       ]
1634 |      },
1635 |      "execution_count": 32,
1636 |      "metadata": {},
1637 |      "output_type": "execute_result"
1638 |     }
1639 |    ],
1640 |    "source": [
1641 |     "function rmse_kernel(C, A, B)  \n",
1642 |     "    i = threadIdx().x\n",
1643 |     "\n",
1644 |     "    # initialize the memory\n",
1645 |     "    if i == 1\n",
1646 |     "        C[] = 0\n",
1647 |     "    end\n",
1648 |     "    sync_threads()\n",
1649 |     "    \n",
1650 |     "    # process an element on each thread\n",
1651 |     "    a = A[i]\n",
1652 |     "    b = B[i]\n",
1653 |     "    CUDA.@atomic C[] += (a-b)^2\n",
1654 |     "    sync_threads()\n",
1655 |     "    \n",
1656 |     "    # finalize the computation\n",
1657 |     "    if i == 1\n",
1658 |     "        C[1] = sqrt(C[] / length(A))\n",
1659 |     "    end\n",
1660 |     "    return\n",
1661 |     "end\n",
1662 |     "\n",
1663 |     "C = similar(A, 1)\n",
1664 |     "@cuda threads=length(A) rmse_kernel(C, A, B)\n",
1665 |     "C"
1666 |    ]
1667 |   },
1668 |   {
1669 |    "cell_type": "markdown",
1670 |    "id": "5c0450ab",
1671 |    "metadata": {},
1672 |    "source": [
1673 |     "This kernel only works when the array fits in a single block though:"
1674 |    ]
1675 |   },
1676 |   {
1677 |    "cell_type": "code",
1678 |    "execution_count": 33,
1679 |    "id": "273206e1",
1680 |    "metadata": {},
1681 |    "outputs": [
1682 |     {
1683 |      "ename": "LoadError",
1684 |      "evalue": "CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)",
1685 |      "output_type": "error",
1686 |      "traceback": [
1687 |       "CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)",
1688 |       "",
1689 |       "Stacktrace:",
1690 |       "  [1] throw_api_error(res::CUDA.cudaError_enum)",
1691 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/error.jl:91",
1692 |       "  [2] macro expansion",
1693 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/error.jl:101 [inlined]",
1694 |       "  [3] cuLaunchKernel(f::CuFunction, gridDimX::UInt32, gridDimY::UInt32, gridDimZ::UInt32, blockDimX::UInt32, blockDimY::UInt32, blockDimZ::UInt32, sharedMemBytes::Int64, hStream::CuStream, kernelParams::Vector{Ptr{Nothing}}, extra::Ptr{Nothing})",
1695 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/lib/utils/call.jl:26",
1696 |       "  [4] #27",
1697 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/execution.jl:69 [inlined]",
1698 |       "  [5] macro expansion",
1699 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/execution.jl:33 [inlined]",
1700 |       "  [6] macro expansion",
1701 |       "    @ ./none:0 [inlined]",
1702 |       "  [7] pack_arguments(::CUDA.var\"#27#28\"{Bool, Int64, CuStream, CuFunction, CuDim3, CuDim3}, ::CUDA.KernelState, ::CuDeviceVector{Float32, 1}, ::CuDeviceMatrix{Float32, 1}, ::CuDeviceMatrix{Float32, 1})",
1703 |       "    @ CUDA ./none:0",
1704 |       "  [8] #launch#26",
1705 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/execution.jl:62 [inlined]",
1706 |       "  [9] #32",
1707 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/execution.jl:136 [inlined]",
1708 |       " [10] macro expansion",
1709 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/execution.jl:95 [inlined]",
1710 |       " [11] macro expansion",
1711 |       "    @ ./none:0 [inlined]",
1712 |       " [12] convert_arguments",
1713 |       "    @ ./none:0 [inlined]",
1714 |       " [13] #cudacall#31",
1715 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/lib/cudadrv/execution.jl:135 [inlined]",
1716 |       " [14] macro expansion",
1717 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:204 [inlined]",
1718 |       " [15] macro expansion",
1719 |       "    @ ./none:0 [inlined]",
1720 |       " [16] call(::CUDA.HostKernel{typeof(rmse_kernel), Tuple{CuDeviceVector{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}}}, ::CuDeviceVector{Float32, 1}, ::CuDeviceMatrix{Float32, 1}, ::CuDeviceMatrix{Float32, 1}; call_kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol, Symbol}, NamedTuple{(:threads, :blocks), Tuple{Int64, Int64}}})",
1721 |       "    @ CUDA ./none:0",
1722 |       " [17] (::CUDA.HostKernel{typeof(rmse_kernel), Tuple{CuDeviceVector{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}}})(::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; threads::Int64, blocks::Int64, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})",
1723 |       "    @ CUDA ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:462",
1724 |       " [18] top-level scope",
1725 |       "    @ ~/Julia/depot/packages/CUDA/rZxom/src/compiler/execution.jl:104",
1726 |       " [19] eval",
1727 |       "    @ ./boot.jl:373 [inlined]",
1728 |       " [20] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)",
1729 |       "    @ Base ./loading.jl:1196"
1730 |      ]
1731 |     }
1732 |    ],
1733 |    "source": [
1734 |     "A = CUDA.rand(1024,1024)\n",
1735 |     "B = CUDA.rand(1024,1024)\n",
1736 |     "@cuda threads=length(A) rmse_kernel(C, A, B)"
1737 |    ]
1738 |   },
1739 |   {
1740 |    "cell_type": "markdown",
1741 |    "id": "a3a151df",
1742 |    "metadata": {},
1743 |    "source": [
1744 |     "We could just use multiple blocks, since our currently implementation doesn't use any communication between threads. However, in a future notebook we _will_ add such communication, so for now let's make it so that we only need a single block to process this matrix.\n",
1745 |     "\n",
1746 |     "A good way to do so is to introduce a grid-stride loop. This has been explained in a previous notebook, so try adapting the kernel implementation:"
1747 |    ]
1748 |   },
1749 |   {
1750 |    "cell_type": "code",
1751 |    "execution_count": 34,
1752 |    "id": "addd90a5",
1753 |    "metadata": {},
1754 |    "outputs": [
1755 |     {
1756 |      "data": {
1757 |       "text/plain": [
1758 |        "1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n",
1759 |        " 0.40825698"
1760 |       ]
1761 |      },
1762 |      "execution_count": 34,
1763 |      "metadata": {},
1764 |      "output_type": "execute_result"
1765 |     }
1766 |    ],
1767 |    "source": [
1768 |     "function rmse_kernel(C, A, B)  \n",
1769 |     "    # initialize the memory\n",
1770 |     "    if threadIdx().x == 1\n",
1771 |     "        C[1] = 0\n",
1772 |     "    end\n",
1773 |     "    sync_threads()\n",
1774 |     "    \n",
1775 |     "    # grid-stride loop to process each batch in a block\n",
1776 |     "    for i in threadIdx().x:blockDim().x:length(A)\n",
1777 |     "        a = A[i]\n",
1778 |     "        b = B[i]\n",
1779 |     "        CUDA.@atomic C[1] += (a-b)^2\n",
1780 |     "    end    \n",
1781 |     "    sync_threads()\n",
1782 |     "    \n",
1783 |     "    # finalize the computation\n",
1784 |     "    if threadIdx().x == 1\n",
1785 |     "        C[1] = sqrt(C[1] / length(A))\n",
1786 |     "    end\n",
1787 |     "    return\n",
1788 |     "end\n",
1789 |     "\n",
1790 |     "@cuda threads=256 rmse_kernel(C, A, B)\n",
1791 |     "C"
1792 |    ]
1793 |   },
1794 |   {
1795 |    "cell_type": "markdown",
1796 |    "id": "b97e4dba",
1797 |    "metadata": {},
1798 |    "source": [
1799 |     "## High-level kernel programming\n",
1800 |     "\n",
1801 |     "There's a couple of packages that aim to simplify kernel programming without resorting to array operations (which may result in extraneous kernel launches, more on that in some of the next notebooks)."
1802 |    ]
1803 |   },
1804 |   {
1805 |    "cell_type": "markdown",
1806 |    "id": "3946951a",
1807 |    "metadata": {},
1808 |    "source": [
1809 |     "### Tullio.jl\n",
1810 |     "\n",
1811 |     "With Tullio, it's easy to write kernels using index notation. This makes it easy to express operations like our RMSE calculation in a single expression which typically will also be compiled to a single kernel:"
1812 |    ]
1813 |   },
1814 |   {
1815 |    "cell_type": "code",
1816 |    "execution_count": 35,
1817 |    "id": "65b3b960",
1818 |    "metadata": {},
1819 |    "outputs": [
1820 |     {
1821 |      "data": {
1822 |       "text/plain": [
1823 |        "10×10 Matrix{Float64}:\n",
1824 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0\n",
1825 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0\n",
1826 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0\n",
1827 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0\n",
1828 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0\n",
1829 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0\n",
1830 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0\n",
1831 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0\n",
1832 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0\n",
1833 |        " 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0"
1834 |       ]
1835 |      },
1836 |      "execution_count": 35,
1837 |      "metadata": {},
1838 |      "output_type": "execute_result"
1839 |     }
1840 |    ],
1841 |    "source": [
1842 |     "using Tullio\n",
1843 |     "\n",
1844 |     "A = ones(10, 10)\n",
1845 |     "\n",
1846 |     "# assigning with `:=` creates a new array\n",
1847 |     "@tullio C[i,j] := A[i,j]"
1848 |    ]
1849 |   },
1850 |   {
1851 |    "cell_type": "code",
1852 |    "execution_count": 36,
1853 |    "id": "b6296c7d",
1854 |    "metadata": {},
1855 |    "outputs": [
1856 |     {
1857 |      "data": {
1858 |       "text/plain": [
1859 |        "10-element Vector{Float64}:\n",
1860 |        " 10.0\n",
1861 |        " 10.0\n",
1862 |        " 10.0\n",
1863 |        " 10.0\n",
1864 |        " 10.0\n",
1865 |        " 10.0\n",
1866 |        " 10.0\n",
1867 |        " 10.0\n",
1868 |        " 10.0\n",
1869 |        " 10.0"
1870 |       ]
1871 |      },
1872 |      "execution_count": 36,
1873 |      "metadata": {},
1874 |      "output_type": "execute_result"
1875 |     }
1876 |    ],
1877 |    "source": [
1878 |     "# dropping an index will sum across that dimension\n",
1879 |     "@tullio C[i] := A[i,j]"
1880 |    ]
1881 |   },
1882 |   {
1883 |    "cell_type": "code",
1884 |    "execution_count": 37,
1885 |    "id": "bb0c96ad",
1886 |    "metadata": {},
1887 |    "outputs": [
1888 |     {
1889 |      "data": {
1890 |       "text/plain": [
1891 |        "0-dimensional Array{Float64, 0}:\n",
1892 |        "2.0"
1893 |       ]
1894 |      },
1895 |      "execution_count": 37,
1896 |      "metadata": {},
1897 |      "output_type": "execute_result"
1898 |     }
1899 |    ],
1900 |    "source": [
1901 |     "# pipe operators can be used to apply functions 'outside' of the reduction\n",
1902 |     "@tullio C[] := A[i,j] |> log10(_)"
1903 |    ]
1904 |   },
1905 |   {
1906 |    "cell_type": "markdown",
1907 |    "id": "08d63901",
1908 |    "metadata": {},
1909 |    "source": [
1910 |     "With that explained, try to implement the matrix RMSE operation using index notation. Recall the original operation (first on CPU arrays):"
1911 |    ]
1912 |   },
1913 |   {
1914 |    "cell_type": "code",
1915 |    "execution_count": 38,
1916 |    "id": "29407d2c",
1917 |    "metadata": {},
1918 |    "outputs": [
1919 |     {
1920 |      "data": {
1921 |       "text/plain": [
1922 |        "0.40815849155534367"
1923 |       ]
1924 |      },
1925 |      "execution_count": 38,
1926 |      "metadata": {},
1927 |      "output_type": "execute_result"
1928 |     }
1929 |    ],
1930 |    "source": [
1931 |     "A = rand(1024, 1024)\n",
1932 |     "B = rand(1024, 1024)\n",
1933 |     "sqrt(sum((A-B).^2)/length(A))"
1934 |    ]
1935 |   },
1936 |   {
1937 |    "cell_type": "markdown",
1938 |    "id": "6b8fe8b4",
1939 |    "metadata": {},
1940 |    "source": [
1941 |     "Now with Tullio:"
1942 |    ]
1943 |   },
1944 |   {
1945 |    "cell_type": "code",
1946 |    "execution_count": 39,
1947 |    "id": "f183267d",
1948 |    "metadata": {},
1949 |    "outputs": [
1950 |     {
1951 |      "data": {
1952 |       "text/plain": [
1953 |        "0-dimensional Array{Float64, 0}:\n",
1954 |        "0.40815849155534617"
1955 |       ]
1956 |      },
1957 |      "execution_count": 39,
1958 |      "metadata": {},
1959 |      "output_type": "execute_result"
1960 |     }
1961 |    ],
1962 |    "source": [
1963 |     "@tullio C[] := (A[i,j] - B[i,j])^2 |> sqrt(_ / length(A))"
1964 |    ]
1965 |   },
1966 |   {
1967 |    "cell_type": "markdown",
1968 |    "id": "c2cc4201",
1969 |    "metadata": {},
1970 |    "source": [
1971 |     "To use Tullio with GPU arrays, you need to install and import the relevant CUDA support packages:"
1972 |    ]
1973 |   },
1974 |   {
1975 |    "cell_type": "code",
1976 |    "execution_count": 40,
1977 |    "id": "3fc9b8be",
1978 |    "metadata": {},
1979 |    "outputs": [],
1980 |    "source": [
1981 |     "using KernelAbstractions, CUDAKernels"
1982 |    ]
1983 |   },
1984 |   {
1985 |    "cell_type": "code",
1986 |    "execution_count": 41,
1987 |    "id": "8463dec2",
1988 |    "metadata": {},
1989 |    "outputs": [
1990 |     {
1991 |      "data": {
1992 |       "text/plain": [
1993 |        "0-dimensional CuArray{Float32, 0, CUDA.Mem.DeviceBuffer}:\n",
1994 |        "0.40816107"
1995 |       ]
1996 |      },
1997 |      "execution_count": 41,
1998 |      "metadata": {},
1999 |      "output_type": "execute_result"
2000 |     }
2001 |    ],
2002 |    "source": [
2003 |     "A = CUDA.rand(1024, 1024)\n",
2004 |     "B = CUDA.rand(1024, 1024)\n",
2005 |     "@tullio C[] := (A[i,j] - B[i,j])^2 |> sqrt(_ / length(A))"
2006 |    ]
2007 |   },
2008 |   {
2009 |    "cell_type": "markdown",
2010 |    "id": "c0ed8649",
2011 |    "metadata": {},
2012 |    "source": [
2013 |     "Tullio is great for quickly creating portable kernels (CPU, different GPU back-ends) for mathematical operations, and it can be seen as a generalization of broadcast."
2014 |    ]
2015 |   },
2016 |   {
2017 |    "cell_type": "markdown",
2018 |    "id": "24b863a6",
2019 |    "metadata": {},
2020 |    "source": [
2021 |     "### KernelAbstractions.jl\n",
2022 |     "\n",
2023 |     "For a more flexible API, i.e. not restricted to Tullio's index notation, but still retaining Tullio's portability, you can consider the KernelAbstractions.jl framework that Tullio.jl is built on:"
2024 |    ]
2025 |   },
2026 |   {
2027 |    "cell_type": "code",
2028 |    "execution_count": 42,
2029 |    "id": "8d513082",
2030 |    "metadata": {},
2031 |    "outputs": [],
2032 |    "source": [
2033 |     "using KernelAbstractions"
2034 |    ]
2035 |   },
2036 |   {
2037 |    "cell_type": "code",
2038 |    "execution_count": 43,
2039 |    "id": "41110dd6",
2040 |    "metadata": {},
2041 |    "outputs": [],
2042 |    "source": [
2043 |     "@kernel function ka_kernel(A)\n",
2044 |     "    # simple kernel without multiple blocks\n",
2045 |     "    i = @index(Global, Linear)\n",
2046 |     "    \n",
2047 |     "    # first thread sets up the data\n",
2048 |     "    if i == 1\n",
2049 |     "        A[1] = 42\n",
2050 |     "    end\n",
2051 |     "    \n",
2052 |     "    @synchronize()\n",
2053 |     "    \n",
2054 |     "    # other threads can now read this data\n",
2055 |     "    if i != 1\n",
2056 |     "        A[i] = A[1]\n",
2057 |     "    end\n",
2058 |     "end;"
2059 |    ]
2060 |   },
2061 |   {
2062 |    "cell_type": "code",
2063 |    "execution_count": 44,
2064 |    "id": "fb65a2df",
2065 |    "metadata": {},
2066 |    "outputs": [
2067 |     {
2068 |      "data": {
2069 |       "text/plain": [
2070 |        "2-element Vector{Float64}:\n",
2071 |        " 42.0\n",
2072 |        "  0.0"
2073 |       ]
2074 |      },
2075 |      "execution_count": 44,
2076 |      "metadata": {},
2077 |      "output_type": "execute_result"
2078 |     }
2079 |    ],
2080 |    "source": [
2081 |     "A = zeros(512)\n",
2082 |     "\n",
2083 |     "the_ka_kernel = ka_kernel(CPU(), 16)\n",
2084 |     "event = the_ka_kernel(A, ndrange=size(A))\n",
2085 |     "wait(event)\n",
2086 |     "unique(A)"
2087 |    ]
2088 |   },
2089 |   {
2090 |    "cell_type": "markdown",
2091 |    "id": "35d2aee9",
2092 |    "metadata": {},
2093 |    "source": [
2094 |     "The programming interface is now much closer to CUDA.jl's, while retaining platform portability!"
2095 |    ]
2096 |   },
2097 |   {
2098 |    "cell_type": "code",
2099 |    "execution_count": 45,
2100 |    "id": "448c0832",
2101 |    "metadata": {},
2102 |    "outputs": [
2103 |     {
2104 |      "data": {
2105 |       "text/plain": [
2106 |        "1-element Vector{Float32}:\n",
2107 |        " 42.0"
2108 |       ]
2109 |      },
2110 |      "execution_count": 45,
2111 |      "metadata": {},
2112 |      "output_type": "execute_result"
2113 |     }
2114 |    ],
2115 |    "source": [
2116 |     "A = CUDA.zeros(512)\n",
2117 |     "the_ka_kernel = ka_kernel(CUDADevice(), 16)\n",
2118 |     "event = the_ka_kernel(A, ndrange=size(A))\n",
2119 |     "wait(event)\n",
2120 |     "unique(Array(A))"
2121 |    ]
2122 |   },
2123 |   {
2124 |    "cell_type": "markdown",
2125 |    "id": "0252a3bf",
2126 |    "metadata": {},
2127 |    "source": [
2128 |     "The disadvantage of platform portability of course is that KernelAbstraction.jl's feature set is limited to the common denominator of all supported platforms. That means many CUDA features, like atomics or warp-level programming, are not supported. In addition, KernelAbstractions is built on Cassette.jl which will incur a significant compilation cost for nontrivial applications."
2129 |    ]
2130 |   },
2131 |   {
2132 |    "cell_type": "markdown",
2133 |    "id": "a3ee6a32",
2134 |    "metadata": {},
2135 |    "source": [
2136 |     "## Exercise: Batched matrix RMSE\n",
2137 |     "\n",
2138 |     "To extend our RMSE example to something more interesting (that we will use in later notebooks), let's extend the computation of the RMSE between two matrices to a batched version that computes `N` RMSEs:"
2139 |    ]
2140 |   },
2141 |   {
2142 |    "cell_type": "code",
2143 |    "execution_count": 46,
2144 |    "id": "2249833f",
2145 |    "metadata": {},
2146 |    "outputs": [],
2147 |    "source": [
2148 |     "N = 16\n",
2149 |     "A = CUDA.rand(1024, 1024, N)\n",
2150 |     "B = CUDA.rand(1024, 1024, N)\n",
2151 |     "CUDA.allowscalar(false)"
2152 |    ]
2153 |   },
2154 |   {
2155 |    "cell_type": "code",
2156 |    "execution_count": 47,
2157 |    "id": "913fadfc",
2158 |    "metadata": {},
2159 |    "outputs": [
2160 |     {
2161 |      "data": {
2162 |       "text/plain": [
2163 |        "16-element Vector{Float64}:\n",
2164 |        " 0.40842095017433167\n",
2165 |        " 0.40842196345329285\n",
2166 |        " 0.40827152132987976\n",
2167 |        " 0.40808382630348206\n",
2168 |        " 0.4082046151161194\n",
2169 |        " 0.407973051071167\n",
2170 |        " 0.4077332019805908\n",
2171 |        " 0.4081213176250458\n",
2172 |        " 0.40818658471107483\n",
2173 |        " 0.40828144550323486\n",
2174 |        " 0.4080793261528015\n",
2175 |        " 0.4081481993198395\n",
2176 |        " 0.40807607769966125\n",
2177 |        " 0.40809497237205505\n",
2178 |        " 0.408576637506485\n",
2179 |        " 0.4083676040172577"
2180 |       ]
2181 |      },
2182 |      "execution_count": 47,
2183 |      "metadata": {},
2184 |      "output_type": "execute_result"
2185 |     }
2186 |    ],
2187 |    "source": [
2188 |     "rmse(A, B) = sqrt(sum((A-B).^2)/length(A))\n",
2189 |     "\n",
2190 |     "rmses = Vector{Float64}(undef, N)\n",
2191 |     "for i in 1:N\n",
2192 |     "    rmses[i] = rmse(A[:, :, i], B[:, :, i])\n",
2193 |     "end\n",
2194 |     "rmses"
2195 |    ]
2196 |   },
2197 |   {
2198 |    "cell_type": "markdown",
2199 |    "id": "394f1e57",
2200 |    "metadata": {},
2201 |    "source": [
2202 |     "This is a pretty bad implementation, but we'll have a look at optimizing it in a future notebook. For now, let's just focus on a correct implementation."
2203 |    ]
2204 |   },
2205 |   {
2206 |    "cell_type": "markdown",
2207 |    "id": "e766193a",
2208 |    "metadata": {},
2209 |    "source": [
2210 |     "First, let's try to extend the Tullio expression to correctly handle the batch dimension:"
2211 |    ]
2212 |   },
2213 |   {
2214 |    "cell_type": "code",
2215 |    "execution_count": 48,
2216 |    "id": "49a01435",
2217 |    "metadata": {},
2218 |    "outputs": [
2219 |     {
2220 |      "data": {
2221 |       "text/plain": [
2222 |        "16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n",
2223 |        " 0.4083086\n",
2224 |        " 0.4083096\n",
2225 |        " 0.4081559\n",
2226 |        " 0.4079672\n",
2227 |        " 0.40808588\n",
2228 |        " 0.40785775\n",
2229 |        " 0.40761772\n",
2230 |        " 0.40800425\n",
2231 |        " 0.40807247\n",
2232 |        " 0.4081641\n",
2233 |        " 0.4079632\n",
2234 |        " 0.40804175\n",
2235 |        " 0.407964\n",
2236 |        " 0.40797997\n",
2237 |        " 0.4084567\n",
2238 |        " 0.40825087"
2239 |       ]
2240 |      },
2241 |      "execution_count": 48,
2242 |      "metadata": {},
2243 |      "output_type": "execute_result"
2244 |     }
2245 |    ],
2246 |    "source": [
2247 |     "@tullio C[k] := (A[i,j,k] - B[i,j,k])^2 |> sqrt(_ / (size(A,1)*size(A,2)))"
2248 |    ]
2249 |   },
2250 |   {
2251 |    "cell_type": "markdown",
2252 |    "id": "744937f4",
2253 |    "metadata": {},
2254 |    "source": [
2255 |     "Note the manual length computation because Tullio doesn't like an additional `prod(size(A)[1:2])`."
2256 |    ]
2257 |   },
2258 |   {
2259 |    "cell_type": "markdown",
2260 |    "id": "0bf39d59",
2261 |    "metadata": {},
2262 |    "source": [
2263 |     "Next, try to extend the grid-stride kernel implementation to handle multiple batches. We could just launch our kernel `N` times, but let's try and handle the batching *inside* the kernel. The easiest way to do so, is to launch one block per batch and to fetch the batch number inside the kernel from the `blockIdx()` hardware indices.\n",
2264 |     "\n",
2265 |     "That poses a problem though, as we were using a linear index whereas we now need 3 indices (x, y, and batch). There's multiple possible solutions:\n",
2266 |     "- generalize indexing to cartesian indices\n",
2267 |     "- launch 2-dimensional blocks, and extend the grid-stride loop to cover both dimensions\n",
2268 |     "- reshape the input to a 2D matrix (i.e. flatten the matrix dimensions)\n",
2269 |     "\n",
2270 |     "Let's start with reshaping, for simplicity:"
2271 |    ]
2272 |   },
2273 |   {
2274 |    "cell_type": "code",
2275 |    "execution_count": 49,
2276 |    "id": "68295ac0",
2277 |    "metadata": {},
2278 |    "outputs": [],
2279 |    "source": [
2280 |     "A_flat = reshape(A, (prod(size(A)[1:2]),N))\n",
2281 |     "B_flat = reshape(B, (prod(size(B)[1:2]),N))\n",
2282 |     "C = similar(A, N);"
2283 |    ]
2284 |   },
2285 |   {
2286 |    "cell_type": "code",
2287 |    "execution_count": 50,
2288 |    "id": "fbfa9594",
2289 |    "metadata": {},
2290 |    "outputs": [
2291 |     {
2292 |      "data": {
2293 |       "text/plain": [
2294 |        "16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n",
2295 |        " 0.40830863\n",
2296 |        " 0.40830982\n",
2297 |        " 0.40815574\n",
2298 |        " 0.40796724\n",
2299 |        " 0.40808544\n",
2300 |        " 0.40785766\n",
2301 |        " 0.40761775\n",
2302 |        " 0.40800443\n",
2303 |        " 0.40807247\n",
2304 |        " 0.40816417\n",
2305 |        " 0.40796322\n",
2306 |        " 0.40804172\n",
2307 |        " 0.4079642\n",
2308 |        " 0.40797964\n",
2309 |        " 0.40845713\n",
2310 |        " 0.4082506"
2311 |       ]
2312 |      },
2313 |      "execution_count": 50,
2314 |      "metadata": {},
2315 |      "output_type": "execute_result"
2316 |     }
2317 |    ],
2318 |    "source": [
2319 |     "function rmse_kernel(C, A, B)  \n",
2320 |     "    batch = blockIdx().x\n",
2321 |     "\n",
2322 |     "    # initialize the memory\n",
2323 |     "    if threadIdx().x == 1\n",
2324 |     "        C[batch] = 0\n",
2325 |     "    end\n",
2326 |     "    sync_threads()\n",
2327 |     "    \n",
2328 |     "    # grid-stride loop to process each batch in a block\n",
2329 |     "    for i in threadIdx().x:blockDim().x:size(A,1)\n",
2330 |     "        a = A[i, batch]\n",
2331 |     "        b = B[i, batch]\n",
2332 |     "        CUDA.@atomic C[batch] += (a-b)^2\n",
2333 |     "    end    \n",
2334 |     "    sync_threads()\n",
2335 |     "    \n",
2336 |     "    # finalize the computation\n",
2337 |     "    if threadIdx().x == 1\n",
2338 |     "        C[batch] = sqrt(C[batch] / size(A,1))\n",
2339 |     "    end\n",
2340 |     "    return\n",
2341 |     "end\n",
2342 |     "\n",
2343 |     "@cuda threads=256 blocks=N rmse_kernel(C, A_flat, B_flat)\n",
2344 |     "C"
2345 |    ]
2346 |   },
2347 |   {
2348 |    "cell_type": "markdown",
2349 |    "id": "4e64a60c",
2350 |    "metadata": {},
2351 |    "source": [
2352 |     "A much more general pattern for dealing with multiple independent datasets or batches within a single kernel (i.e. without launching multiple kernels, one for each batch, or without reshaping data) so is to compute and pass separate cartesian indices to the kernel, and make sure those map into hardware indices the way we want. For example, here we have N-dimensional inputs whose last index represents the batch, so we can pass two separate cartesian indices:\n",
2353 |     "- one representing the 'main' iteration space, where the last index doesn't count\n",
2354 |     "- one representing the batches, having the samen dimensionality, but with only the last index set\n",
2355 |     "\n",
2356 |     "As we want each RMSE calculation between arrays from a single batch to happen within a single block (again, to simplify communication and synchronization), we should index the main cartesian indices object using a thread index, while using a block index for the batch indices. Within the kernel, we can then merge these two objects using the `max` operator to get a usable index. For more information on this technique, refer to the following blog post: https://julialang.org/blog/2016/02/iteration/."
2357 |    ]
2358 |   },
2359 |   {
2360 |    "cell_type": "code",
2361 |    "execution_count": 51,
2362 |    "id": "fc242b28",
2363 |    "metadata": {},
2364 |    "outputs": [
2365 |     {
2366 |      "data": {
2367 |       "text/plain": [
2368 |        "1024×1024×1 CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}:\n",
2369 |        "[:, :, 1] =\n",
2370 |        " CartesianIndex(1, 1, 1)     …  CartesianIndex(1, 1024, 1)\n",
2371 |        " CartesianIndex(2, 1, 1)        CartesianIndex(2, 1024, 1)\n",
2372 |        " CartesianIndex(3, 1, 1)        CartesianIndex(3, 1024, 1)\n",
2373 |        " CartesianIndex(4, 1, 1)        CartesianIndex(4, 1024, 1)\n",
2374 |        " CartesianIndex(5, 1, 1)        CartesianIndex(5, 1024, 1)\n",
2375 |        " CartesianIndex(6, 1, 1)     …  CartesianIndex(6, 1024, 1)\n",
2376 |        " CartesianIndex(7, 1, 1)        CartesianIndex(7, 1024, 1)\n",
2377 |        " CartesianIndex(8, 1, 1)        CartesianIndex(8, 1024, 1)\n",
2378 |        " CartesianIndex(9, 1, 1)        CartesianIndex(9, 1024, 1)\n",
2379 |        " CartesianIndex(10, 1, 1)       CartesianIndex(10, 1024, 1)\n",
2380 |        " CartesianIndex(11, 1, 1)    …  CartesianIndex(11, 1024, 1)\n",
2381 |        " CartesianIndex(12, 1, 1)       CartesianIndex(12, 1024, 1)\n",
2382 |        " CartesianIndex(13, 1, 1)       CartesianIndex(13, 1024, 1)\n",
2383 |        " ⋮                           ⋱  \n",
2384 |        " CartesianIndex(1013, 1, 1)     CartesianIndex(1013, 1024, 1)\n",
2385 |        " CartesianIndex(1014, 1, 1)     CartesianIndex(1014, 1024, 1)\n",
2386 |        " CartesianIndex(1015, 1, 1)     CartesianIndex(1015, 1024, 1)\n",
2387 |        " CartesianIndex(1016, 1, 1)  …  CartesianIndex(1016, 1024, 1)\n",
2388 |        " CartesianIndex(1017, 1, 1)     CartesianIndex(1017, 1024, 1)\n",
2389 |        " CartesianIndex(1018, 1, 1)     CartesianIndex(1018, 1024, 1)\n",
2390 |        " CartesianIndex(1019, 1, 1)     CartesianIndex(1019, 1024, 1)\n",
2391 |        " CartesianIndex(1020, 1, 1)     CartesianIndex(1020, 1024, 1)\n",
2392 |        " CartesianIndex(1021, 1, 1)  …  CartesianIndex(1021, 1024, 1)\n",
2393 |        " CartesianIndex(1022, 1, 1)     CartesianIndex(1022, 1024, 1)\n",
2394 |        " CartesianIndex(1023, 1, 1)     CartesianIndex(1023, 1024, 1)\n",
2395 |        " CartesianIndex(1024, 1, 1)     CartesianIndex(1024, 1024, 1)"
2396 |       ]
2397 |      },
2398 |      "execution_count": 51,
2399 |      "metadata": {},
2400 |      "output_type": "execute_result"
2401 |     }
2402 |    ],
2403 |    "source": [
2404 |     "Rmain = ntuple(i->i == ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices"
2405 |    ]
2406 |   },
2407 |   {
2408 |    "cell_type": "code",
2409 |    "execution_count": 52,
2410 |    "id": "e9ee8b12",
2411 |    "metadata": {},
2412 |    "outputs": [
2413 |     {
2414 |      "data": {
2415 |       "text/plain": [
2416 |        "1×1×16 CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}:\n",
2417 |        "[:, :, 1] =\n",
2418 |        " CartesianIndex(1, 1, 1)\n",
2419 |        "\n",
2420 |        "[:, :, 2] =\n",
2421 |        " CartesianIndex(1, 1, 2)\n",
2422 |        "\n",
2423 |        "[:, :, 3] =\n",
2424 |        " CartesianIndex(1, 1, 3)\n",
2425 |        "\n",
2426 |        ";;; … \n",
2427 |        "\n",
2428 |        "[:, :, 14] =\n",
2429 |        " CartesianIndex(1, 1, 14)\n",
2430 |        "\n",
2431 |        "[:, :, 15] =\n",
2432 |        " CartesianIndex(1, 1, 15)\n",
2433 |        "\n",
2434 |        "[:, :, 16] =\n",
2435 |        " CartesianIndex(1, 1, 16)"
2436 |       ]
2437 |      },
2438 |      "execution_count": 52,
2439 |      "metadata": {},
2440 |      "output_type": "execute_result"
2441 |     }
2442 |    ],
2443 |    "source": [
2444 |     "Rbatch = ntuple(i->i != ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices"
2445 |    ]
2446 |   },
2447 |   {
2448 |    "cell_type": "code",
2449 |    "execution_count": 53,
2450 |    "id": "112a6f22",
2451 |    "metadata": {},
2452 |    "outputs": [
2453 |     {
2454 |      "data": {
2455 |       "text/plain": [
2456 |        "16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n",
2457 |        " 0.40830866\n",
2458 |        " 0.4083102\n",
2459 |        " 0.40815574\n",
2460 |        " 0.40796766\n",
2461 |        " 0.4080852\n",
2462 |        " 0.40785813\n",
2463 |        " 0.4076176\n",
2464 |        " 0.4080042\n",
2465 |        " 0.40807292\n",
2466 |        " 0.4081641\n",
2467 |        " 0.40796313\n",
2468 |        " 0.4080418\n",
2469 |        " 0.40796375\n",
2470 |        " 0.40797976\n",
2471 |        " 0.40845695\n",
2472 |        " 0.4082506"
2473 |       ]
2474 |      },
2475 |      "execution_count": 53,
2476 |      "metadata": {},
2477 |      "output_type": "execute_result"
2478 |     }
2479 |    ],
2480 |    "source": [
2481 |     "function rmse_kernel(C, A, B, Rmain, Rbatch)\n",
2482 |     "    batch = blockIdx().x\n",
2483 |     "    Ibatch = Rbatch[batch]\n",
2484 |     "    \n",
2485 |     "    # initialize the memory\n",
2486 |     "    if threadIdx().x == 1\n",
2487 |     "        C[batch] = 0\n",
2488 |     "    end\n",
2489 |     "    sync_threads()\n",
2490 |     "    \n",
2491 |     "    # grid-stride loop to process each batch in a block\n",
2492 |     "    for i in threadIdx().x:blockDim().x:length(Rmain)\n",
2493 |     "        Imain = Rmain[i]\n",
2494 |     "        I = max(Imain, Ibatch)\n",
2495 |     "        a = A[I]\n",
2496 |     "        b = B[I]\n",
2497 |     "        CUDA.@atomic C[batch] += (a-b)^2\n",
2498 |     "    end    \n",
2499 |     "    sync_threads()\n",
2500 |     "    \n",
2501 |     "    # finalize the computation\n",
2502 |     "    if threadIdx().x == 1\n",
2503 |     "        C[batch] = sqrt(C[batch] / length(Rmain))\n",
2504 |     "    end\n",
2505 |     "    return\n",
2506 |     "end\n",
2507 |     "\n",
2508 |     "@cuda threads=256 blocks=N rmse_kernel(C, A, B, Rmain, Rbatch)\n",
2509 |     "C"
2510 |    ]
2511 |   },
2512 |   {
2513 |    "cell_type": "markdown",
2514 |    "id": "7b022011",
2515 |    "metadata": {},
2516 |    "source": [
2517 |     "We now have a fully general kernel that handles arbitrarily-sized inputs, treating the last dimension as the batch."
2518 |    ]
2519 |   },
2520 |   {
2521 |    "cell_type": "code",
2522 |    "execution_count": 54,
2523 |    "id": "c6dd0822",
2524 |    "metadata": {},
2525 |    "outputs": [
2526 |     {
2527 |      "data": {
2528 |       "text/plain": [
2529 |        "16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n",
2530 |        " 0.40225008\n",
2531 |        " 0.4022747\n",
2532 |        " 0.4023493\n",
2533 |        " 0.40245104\n",
2534 |        " 0.4025146\n",
2535 |        " 0.40258238\n",
2536 |        " 0.4026644\n",
2537 |        " 0.40274608\n",
2538 |        " 0.40289024\n",
2539 |        " 0.40299597\n",
2540 |        " 0.40308216\n",
2541 |        " 0.40315735\n",
2542 |        " 0.40317976\n",
2543 |        " 0.4032032\n",
2544 |        " 0.40324566\n",
2545 |        " 0.40328714"
2546 |       ]
2547 |      },
2548 |      "execution_count": 54,
2549 |      "metadata": {},
2550 |      "output_type": "execute_result"
2551 |     }
2552 |    ],
2553 |    "source": [
2554 |     "A = CUDA.rand(10, 10, 10, 10, N)\n",
2555 |     "B = CUDA.rand(10, 10, 10, 10, N)\n",
2556 |     "@cuda threads=256 blocks=N rmse_kernel(C, A, B, Rmain, Rbatch)\n",
2557 |     "C"
2558 |    ]
2559 |   }
2560 |  ],
2561 |  "metadata": {
2562 |   "kernelspec": {
2563 |    "display_name": "Julia 1.7",
2564 |    "language": "julia",
2565 |    "name": "julia-1.7"
2566 |   },
2567 |   "language_info": {
2568 |    "file_extension": ".jl",
2569 |    "mimetype": "application/julia",
2570 |    "name": "julia",
2571 |    "version": "1.7.0"
2572 |   }
2573 |  },
2574 |  "nbformat": 4,
2575 |  "nbformat_minor": 5
2576 | }
2577 | 


--------------------------------------------------------------------------------
/1-5-preparation.txt:
--------------------------------------------------------------------------------
1 | For tomorrow's sessions
2 |   - register for an NVIDIA developer account at https://developer.nvidia.com/login
3 |   - download nsight systems 2021.4.1.73 from https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2021-4-1-73
4 |   - install it (does not require admin priviliges, can live in a local folder)
5 | 
6 | Also make sure you can execute NSight Systems on Piz Daint:
7 |     export PATH=/scratch/snx3000/class99/nsight-systems-2021.4.1/bin:$PATH
8 |     ncu --version
9 | If not, let us know!


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Tim Besard
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Manifest.toml:
--------------------------------------------------------------------------------
  1 | # This file is machine-generated - editing it directly is not advised
  2 | 
  3 | [[AbstractFFTs]]
  4 | deps = ["LinearAlgebra"]
  5 | git-tree-sha1 = "485ee0867925449198280d4af84bdb46a2a404d0"
  6 | uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
  7 | version = "1.0.1"
  8 | 
  9 | [[Adapt]]
 10 | deps = ["LinearAlgebra"]
 11 | git-tree-sha1 = "84918055d15b3114ede17ac6a7182f68870c16f7"
 12 | uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
 13 | version = "3.3.1"
 14 | 
 15 | [[ArgTools]]
 16 | uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f"
 17 | 
 18 | [[ArrayInterface]]
 19 | deps = ["Compat", "IfElse", "LinearAlgebra", "Requires", "SparseArrays", "Static"]
 20 | git-tree-sha1 = "d9352737cef8525944bf9ef34392d756321cbd54"
 21 | uuid = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9"
 22 | version = "3.1.38"
 23 | 
 24 | [[Artifacts]]
 25 | uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"
 26 | 
 27 | [[AxisAlgorithms]]
 28 | deps = ["LinearAlgebra", "Random", "SparseArrays", "WoodburyMatrices"]
 29 | git-tree-sha1 = "66771c8d21c8ff5e3a93379480a2307ac36863f7"
 30 | uuid = "13072b0f-2c55-5437-9ae7-d433b7a33950"
 31 | version = "1.0.1"
 32 | 
 33 | [[AxisArrays]]
 34 | deps = ["Dates", "IntervalSets", "IterTools", "RangeArrays"]
 35 | git-tree-sha1 = "d127d5e4d86c7680b20c35d40b503c74b9a39b5e"
 36 | uuid = "39de3d68-74b9-583c-8d2d-e117c070f3a9"
 37 | version = "0.4.4"
 38 | 
 39 | [[BFloat16s]]
 40 | deps = ["LinearAlgebra", "Printf", "Random", "Test"]
 41 | git-tree-sha1 = "a598ecb0d717092b5539dbbe890c98bac842b072"
 42 | uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
 43 | version = "0.2.0"
 44 | 
 45 | [[Base64]]
 46 | uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
 47 | 
 48 | [[BenchmarkTools]]
 49 | deps = ["JSON", "Logging", "Printf", "Profile", "Statistics", "UUIDs"]
 50 | git-tree-sha1 = "da2e31a77bdaa26a9951214842dabebb5016c08f"
 51 | repo-rev = "104f4c1e210da1933ace369d6db1393cf23ac102"
 52 | repo-url = "https://github.com/JuliaCI/BenchmarkTools.jl.git"
 53 | uuid = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
 54 | version = "1.2.0"
 55 | 
 56 | [[CEnum]]
 57 | git-tree-sha1 = "215a9aa4a1f23fbd05b92769fdd62559488d70e9"
 58 | uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
 59 | version = "0.4.1"
 60 | 
 61 | [[CUDA]]
 62 | deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions", "TimerOutputs"]
 63 | git-tree-sha1 = "b703ede58945ebc836bad953d4f5aaeca2aa2114"
 64 | repo-rev = "d3d5e7567d08e10b500f84d5c5b1a9785cc99083"
 65 | repo-url = "https://github.com/JuliaGPU/CUDA.jl.git"
 66 | uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
 67 | version = "3.5.0"
 68 | 
 69 | [[CUDAKernels]]
 70 | deps = ["Adapt", "CUDA", "Cassette", "KernelAbstractions", "SpecialFunctions", "StaticArrays"]
 71 | git-tree-sha1 = "3ec28af1d3680c3a3decfe8d90668033b5d7dda7"
 72 | uuid = "72cfdca4-0801-4ab0-bf6a-d52aa10adc57"
 73 | version = "0.3.1"
 74 | 
 75 | [[Cassette]]
 76 | git-tree-sha1 = "6ce3cd755d4130d43bab24ea5181e77b89b51839"
 77 | uuid = "7057c7e9-c182-5462-911a-8362d720325c"
 78 | version = "0.3.9"
 79 | 
 80 | [[CatIndices]]
 81 | deps = ["CustomUnitRanges", "OffsetArrays"]
 82 | git-tree-sha1 = "a0f80a09780eed9b1d106a1bf62041c2efc995bc"
 83 | uuid = "aafaddc9-749c-510e-ac4f-586e18779b91"
 84 | version = "0.2.2"
 85 | 
 86 | [[ChainRulesCore]]
 87 | deps = ["Compat", "LinearAlgebra", "SparseArrays"]
 88 | git-tree-sha1 = "3533f5a691e60601fe60c90d8bc47a27aa2907ec"
 89 | uuid = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
 90 | version = "1.11.0"
 91 | 
 92 | [[ColorTypes]]
 93 | deps = ["FixedPointNumbers", "Random"]
 94 | git-tree-sha1 = "32a2b8af383f11cbb65803883837a149d10dfe8a"
 95 | uuid = "3da002f7-5984-5a60-b8a6-cbb66c0b333f"
 96 | version = "0.10.12"
 97 | 
 98 | [[ColorVectorSpace]]
 99 | deps = ["ColorTypes", "Colors", "FixedPointNumbers", "LinearAlgebra", "SpecialFunctions", "Statistics", "StatsBase"]
100 | git-tree-sha1 = "4d17724e99f357bfd32afa0a9e2dda2af31a9aea"
101 | uuid = "c3611d14-8923-5661-9e6a-0046d554d3a4"
102 | version = "0.8.7"
103 | 
104 | [[Colors]]
105 | deps = ["ColorTypes", "FixedPointNumbers", "Reexport"]
106 | git-tree-sha1 = "417b0ed7b8b838aa6ca0a87aadf1bb9eb111ce40"
107 | uuid = "5ae59095-9a9b-59fe-a467-6f913c188581"
108 | version = "0.12.8"
109 | 
110 | [[Compat]]
111 | deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"]
112 | git-tree-sha1 = "dce3e3fea680869eaa0b774b2e8343e9ff442313"
113 | uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
114 | version = "3.40.0"
115 | 
116 | [[CompilerSupportLibraries_jll]]
117 | deps = ["Artifacts", "Libdl"]
118 | uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"
119 | 
120 | [[ComputationalResources]]
121 | git-tree-sha1 = "52cb3ec90e8a8bea0e62e275ba577ad0f74821f7"
122 | uuid = "ed09eef8-17a6-5b46-8889-db040fac31e3"
123 | version = "0.3.2"
124 | 
125 | [[CoordinateTransformations]]
126 | deps = ["LinearAlgebra", "StaticArrays"]
127 | git-tree-sha1 = "681ea870b918e7cff7111da58791d7f718067a19"
128 | uuid = "150eb455-5306-5404-9cee-2592286d6298"
129 | version = "0.6.2"
130 | 
131 | [[CustomUnitRanges]]
132 | git-tree-sha1 = "1a3f97f907e6dd8983b744d2642651bb162a3f7a"
133 | uuid = "dc8bdbbb-1ca9-579f-8c36-e416f6a65cce"
134 | version = "1.0.2"
135 | 
136 | [[DataAPI]]
137 | git-tree-sha1 = "cc70b17275652eb47bc9e5f81635981f13cea5c8"
138 | uuid = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
139 | version = "1.9.0"
140 | 
141 | [[DataStructures]]
142 | deps = ["Compat", "InteractiveUtils", "OrderedCollections"]
143 | git-tree-sha1 = "7d9d316f04214f7efdbb6398d545446e246eff02"
144 | uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
145 | version = "0.18.10"
146 | 
147 | [[Dates]]
148 | deps = ["Printf"]
149 | uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"
150 | 
151 | [[DelimitedFiles]]
152 | deps = ["Mmap"]
153 | uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab"
154 | 
155 | [[DiffRules]]
156 | deps = ["NaNMath", "Random", "SpecialFunctions"]
157 | git-tree-sha1 = "7220bc21c33e990c14f4a9a319b1d242ebc5b269"
158 | uuid = "b552c78f-8df3-52c6-915a-8e097449b14b"
159 | version = "1.3.1"
160 | 
161 | [[Distances]]
162 | deps = ["LinearAlgebra", "Statistics", "StatsAPI"]
163 | git-tree-sha1 = "837c83e5574582e07662bbbba733964ff7c26b9d"
164 | uuid = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
165 | version = "0.10.6"
166 | 
167 | [[Distributed]]
168 | deps = ["Random", "Serialization", "Sockets"]
169 | uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b"
170 | 
171 | [[DocStringExtensions]]
172 | deps = ["LibGit2"]
173 | git-tree-sha1 = "b19534d1895d702889b219c382a6e18010797f0b"
174 | uuid = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
175 | version = "0.8.6"
176 | 
177 | [[Downloads]]
178 | deps = ["ArgTools", "LibCURL", "NetworkOptions"]
179 | uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
180 | 
181 | [[EllipsisNotation]]
182 | deps = ["ArrayInterface"]
183 | git-tree-sha1 = "8041575f021cba5a099a456b4163c9a08b566a02"
184 | uuid = "da5c29d0-fa7d-589e-88eb-ea29b0a81949"
185 | version = "1.1.0"
186 | 
187 | [[ExprTools]]
188 | git-tree-sha1 = "b7e3d17636b348f005f11040025ae8c6f645fe92"
189 | uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04"
190 | version = "0.1.6"
191 | 
192 | [[FFTViews]]
193 | deps = ["CustomUnitRanges", "FFTW"]
194 | git-tree-sha1 = "cbdf14d1e8c7c8aacbe8b19862e0179fd08321c2"
195 | uuid = "4f61f5a4-77b1-5117-aa51-3ab5ef4ef0cd"
196 | version = "0.3.2"
197 | 
198 | [[FFTW]]
199 | deps = ["AbstractFFTs", "FFTW_jll", "LinearAlgebra", "MKL_jll", "Preferences", "Reexport"]
200 | git-tree-sha1 = "463cb335fa22c4ebacfd1faba5fde14edb80d96c"
201 | uuid = "7a1cc6ca-52ef-59f5-83cd-3a7055c09341"
202 | version = "1.4.5"
203 | 
204 | [[FFTW_jll]]
205 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"]
206 | git-tree-sha1 = "c6033cc3892d0ef5bb9cd29b7f2f0331ea5184ea"
207 | uuid = "f5851436-0d7a-5f13-b9de-f02708fd171a"
208 | version = "3.3.10+0"
209 | 
210 | [[FileIO]]
211 | deps = ["Pkg", "Requires", "UUIDs"]
212 | git-tree-sha1 = "2db648b6712831ecb333eae76dbfd1c156ca13bb"
213 | uuid = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
214 | version = "1.11.2"
215 | 
216 | [[FixedPointNumbers]]
217 | deps = ["Statistics"]
218 | git-tree-sha1 = "335bfdceacc84c5cdf16aadc768aa5ddfc5383cc"
219 | uuid = "53c48c17-4a7d-5ca2-90c5-79b7896eea93"
220 | version = "0.8.4"
221 | 
222 | [[GPUArrays]]
223 | deps = ["Adapt", "LinearAlgebra", "Printf", "Random", "Serialization", "Statistics"]
224 | git-tree-sha1 = "7772508f17f1d482fe0df72cabc5b55bec06bbe0"
225 | uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
226 | version = "8.1.2"
227 | 
228 | [[GPUCompiler]]
229 | deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"]
230 | git-tree-sha1 = "77d915a0af27d474f0aaf12fcd46c400a552e84c"
231 | uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
232 | version = "0.13.7"
233 | 
234 | [[Graphics]]
235 | deps = ["Colors", "LinearAlgebra", "NaNMath"]
236 | git-tree-sha1 = "1c5a84319923bea76fa145d49e93aa4394c73fc2"
237 | uuid = "a2bd30eb-e257-5431-a919-1863eab51364"
238 | version = "1.1.1"
239 | 
240 | [[IdentityRanges]]
241 | deps = ["OffsetArrays"]
242 | git-tree-sha1 = "be8fcd695c4da16a1d6d0cd213cb88090a150e3b"
243 | uuid = "bbac6d45-d8f3-5730-bfe4-7a449cd117ca"
244 | version = "0.3.1"
245 | 
246 | [[IfElse]]
247 | git-tree-sha1 = "debdd00ffef04665ccbb3e150747a77560e8fad1"
248 | uuid = "615f187c-cbe4-4ef1-ba3b-2fcf58d6d173"
249 | version = "0.1.1"
250 | 
251 | [[ImageAxes]]
252 | deps = ["AxisArrays", "ImageCore", "Reexport", "SimpleTraits"]
253 | git-tree-sha1 = "794ad1d922c432082bc1aaa9fa8ffbd1fe74e621"
254 | uuid = "2803e5a7-5153-5ecf-9a86-9b4c37f5f5ac"
255 | version = "0.6.9"
256 | 
257 | [[ImageContrastAdjustment]]
258 | deps = ["ColorVectorSpace", "ImageCore", "ImageTransformations", "Parameters"]
259 | git-tree-sha1 = "2e6084db6cccab11fe0bc3e4130bd3d117092ed9"
260 | uuid = "f332f351-ec65-5f6a-b3d1-319c6670881a"
261 | version = "0.3.7"
262 | 
263 | [[ImageCore]]
264 | deps = ["AbstractFFTs", "Colors", "FixedPointNumbers", "Graphics", "MappedArrays", "MosaicViews", "OffsetArrays", "PaddedViews", "Reexport"]
265 | git-tree-sha1 = "db645f20b59f060d8cfae696bc9538d13fd86416"
266 | uuid = "a09fc81d-aa75-5fe9-8630-4744c3626534"
267 | version = "0.8.22"
268 | 
269 | [[ImageDistances]]
270 | deps = ["ColorVectorSpace", "Distances", "ImageCore", "ImageMorphology", "LinearAlgebra", "Statistics"]
271 | git-tree-sha1 = "6378c34a3c3a216235210d19b9f495ecfff2f85f"
272 | uuid = "51556ac3-7006-55f5-8cb3-34580c88182d"
273 | version = "0.2.13"
274 | 
275 | [[ImageFiltering]]
276 | deps = ["CatIndices", "ColorVectorSpace", "ComputationalResources", "DataStructures", "FFTViews", "FFTW", "ImageCore", "LinearAlgebra", "OffsetArrays", "Requires", "SparseArrays", "StaticArrays", "Statistics", "TiledIteration"]
277 | git-tree-sha1 = "bf96839133212d3eff4a1c3a80c57abc7cfbf0ce"
278 | uuid = "6a3955dd-da59-5b1f-98d4-e7296123deb5"
279 | version = "0.6.21"
280 | 
281 | [[ImageIO]]
282 | deps = ["FileIO", "Netpbm", "OpenEXR", "PNGFiles", "TiffImages", "UUIDs"]
283 | git-tree-sha1 = "a2951c93684551467265e0e32b577914f69532be"
284 | uuid = "82e4d734-157c-48bb-816b-45c225c6df19"
285 | version = "0.5.9"
286 | 
287 | [[ImageMetadata]]
288 | deps = ["AxisArrays", "ColorVectorSpace", "ImageAxes", "ImageCore", "IndirectArrays"]
289 | git-tree-sha1 = "ae76038347dc4edcdb06b541595268fca65b6a42"
290 | uuid = "bc367c6b-8a6b-528e-b4bd-a4b897500b49"
291 | version = "0.9.5"
292 | 
293 | [[ImageMorphology]]
294 | deps = ["ColorVectorSpace", "ImageCore", "LinearAlgebra", "TiledIteration"]
295 | git-tree-sha1 = "68e7cbcd7dfaa3c2f74b0a8ab3066f5de8f2b71d"
296 | uuid = "787d08f9-d448-5407-9aad-5290dd7ab264"
297 | version = "0.2.11"
298 | 
299 | [[ImageQualityIndexes]]
300 | deps = ["ColorVectorSpace", "ImageCore", "ImageDistances", "ImageFiltering", "OffsetArrays", "Statistics"]
301 | git-tree-sha1 = "1198f85fa2481a3bb94bf937495ba1916f12b533"
302 | uuid = "2996bd0c-7a13-11e9-2da2-2f5ce47296a9"
303 | version = "0.2.2"
304 | 
305 | [[ImageShow]]
306 | deps = ["Base64", "FileIO", "ImageCore", "Requires"]
307 | git-tree-sha1 = "c9df184bc7c2e665f971079174aabb7d18f1845f"
308 | uuid = "4e3cecfd-b093-5904-9786-8bbb286a6a31"
309 | version = "0.2.3"
310 | 
311 | [[ImageTransformations]]
312 | deps = ["AxisAlgorithms", "ColorVectorSpace", "CoordinateTransformations", "IdentityRanges", "ImageCore", "Interpolations", "OffsetArrays", "Rotations", "StaticArrays"]
313 | git-tree-sha1 = "e4cc551e4295a5c96545bb3083058c24b78d4cf0"
314 | uuid = "02fcd773-0e25-5acc-982a-7f6622650795"
315 | version = "0.8.13"
316 | 
317 | [[Images]]
318 | deps = ["AxisArrays", "Base64", "ColorVectorSpace", "FileIO", "Graphics", "ImageAxes", "ImageContrastAdjustment", "ImageCore", "ImageDistances", "ImageFiltering", "ImageMetadata", "ImageMorphology", "ImageQualityIndexes", "ImageShow", "ImageTransformations", "IndirectArrays", "OffsetArrays", "Random", "Reexport", "SparseArrays", "StaticArrays", "Statistics", "StatsBase", "TiledIteration"]
319 | git-tree-sha1 = "535bcaae047f017f4fd7331ee859b75f2b27e505"
320 | uuid = "916415d5-f1e6-5110-898d-aaa5f9f070e0"
321 | version = "0.23.3"
322 | 
323 | [[Imath_jll]]
324 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"]
325 | git-tree-sha1 = "87f7662e03a649cffa2e05bf19c303e168732d3e"
326 | uuid = "905a6f67-0a94-5f89-b386-d35d92009cd1"
327 | version = "3.1.2+0"
328 | 
329 | [[IndirectArrays]]
330 | git-tree-sha1 = "c2a145a145dc03a7620af1444e0264ef907bd44f"
331 | uuid = "9b13fd28-a010-5f03-acff-a1bbcff69959"
332 | version = "0.5.1"
333 | 
334 | [[Inflate]]
335 | git-tree-sha1 = "f5fc07d4e706b84f72d54eedcc1c13d92fb0871c"
336 | uuid = "d25df0c9-e2be-5dd7-82c8-3ad0b3e990b9"
337 | version = "0.1.2"
338 | 
339 | [[IntelOpenMP_jll]]
340 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"]
341 | git-tree-sha1 = "d979e54b71da82f3a65b62553da4fc3d18c9004c"
342 | uuid = "1d5cc7b8-4909-519e-a0f8-d0f5ad9712d0"
343 | version = "2018.0.3+2"
344 | 
345 | [[InteractiveUtils]]
346 | deps = ["Markdown"]
347 | uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
348 | 
349 | [[Interpolations]]
350 | deps = ["AxisAlgorithms", "ChainRulesCore", "LinearAlgebra", "OffsetArrays", "Random", "Ratios", "Requires", "SharedArrays", "SparseArrays", "StaticArrays", "WoodburyMatrices"]
351 | git-tree-sha1 = "61aa005707ea2cebf47c8d780da8dc9bc4e0c512"
352 | uuid = "a98d9a8b-a2ab-59e6-89dd-64a1c18fca59"
353 | version = "0.13.4"
354 | 
355 | [[IntervalSets]]
356 | deps = ["Dates", "EllipsisNotation", "Statistics"]
357 | git-tree-sha1 = "3cc368af3f110a767ac786560045dceddfc16758"
358 | uuid = "8197267c-284f-5f27-9208-e0e47529a953"
359 | version = "0.5.3"
360 | 
361 | [[InverseFunctions]]
362 | deps = ["Test"]
363 | git-tree-sha1 = "f0c6489b12d28fb4c2103073ec7452f3423bd308"
364 | uuid = "3587e190-3f89-42d0-90ee-14403ec27112"
365 | version = "0.1.1"
366 | 
367 | [[IrrationalConstants]]
368 | git-tree-sha1 = "7fd44fd4ff43fc60815f8e764c0f352b83c49151"
369 | uuid = "92d709cd-6900-40b7-9082-c6be49f344b6"
370 | version = "0.1.1"
371 | 
372 | [[IterTools]]
373 | git-tree-sha1 = "05110a2ab1fc5f932622ffea2a003221f4782c18"
374 | uuid = "c8e1da08-722c-5040-9ed9-7db0dc04731e"
375 | version = "1.3.0"
376 | 
377 | [[JLLWrappers]]
378 | deps = ["Preferences"]
379 | git-tree-sha1 = "642a199af8b68253517b80bd3bfd17eb4e84df6e"
380 | uuid = "692b3bcd-3c85-4b1f-b108-f13ce0eb3210"
381 | version = "1.3.0"
382 | 
383 | [[JSON]]
384 | deps = ["Dates", "Mmap", "Parsers", "Unicode"]
385 | git-tree-sha1 = "8076680b162ada2a031f707ac7b4953e30667a37"
386 | uuid = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
387 | version = "0.21.2"
388 | 
389 | [[KernelAbstractions]]
390 | deps = ["Adapt", "Cassette", "InteractiveUtils", "MacroTools", "SpecialFunctions", "StaticArrays", "UUIDs"]
391 | git-tree-sha1 = "5e6c70389c1b1e40adb81664ca8cea6ce8127afc"
392 | uuid = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
393 | version = "0.7.0"
394 | 
395 | [[LLVM]]
396 | deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"]
397 | git-tree-sha1 = "46092047ca4edc10720ecab437c42283cd7c44f3"
398 | uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
399 | version = "4.6.0"
400 | 
401 | [[LLVMExtra_jll]]
402 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"]
403 | git-tree-sha1 = "6a2af408fe809c4f1a54d2b3f188fdd3698549d6"
404 | uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab"
405 | version = "0.0.11+0"
406 | 
407 | [[LazyArtifacts]]
408 | deps = ["Artifacts", "Pkg"]
409 | uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3"
410 | 
411 | [[LibCURL]]
412 | deps = ["LibCURL_jll", "MozillaCACerts_jll"]
413 | uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21"
414 | 
415 | [[LibCURL_jll]]
416 | deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"]
417 | uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0"
418 | 
419 | [[LibGit2]]
420 | deps = ["Base64", "NetworkOptions", "Printf", "SHA"]
421 | uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"
422 | 
423 | [[LibSSH2_jll]]
424 | deps = ["Artifacts", "Libdl", "MbedTLS_jll"]
425 | uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8"
426 | 
427 | [[Libdl]]
428 | uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"
429 | 
430 | [[LinearAlgebra]]
431 | deps = ["Libdl"]
432 | uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
433 | 
434 | [[LogExpFunctions]]
435 | deps = ["ChainRulesCore", "DocStringExtensions", "InverseFunctions", "IrrationalConstants", "LinearAlgebra"]
436 | git-tree-sha1 = "6193c3815f13ba1b78a51ce391db8be016ae9214"
437 | uuid = "2ab3a3ac-af41-5b50-aa03-7779005ae688"
438 | version = "0.3.4"
439 | 
440 | [[Logging]]
441 | uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"
442 | 
443 | [[MKL_jll]]
444 | deps = ["Artifacts", "IntelOpenMP_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg"]
445 | git-tree-sha1 = "5455aef09b40e5020e1520f551fa3135040d4ed0"
446 | uuid = "856f044c-d86e-5d09-b602-aeab76dc8ba7"
447 | version = "2021.1.1+2"
448 | 
449 | [[MacroTools]]
450 | deps = ["Markdown", "Random"]
451 | git-tree-sha1 = "3d3e902b31198a27340d0bf00d6ac452866021cf"
452 | uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
453 | version = "0.5.9"
454 | 
455 | [[MappedArrays]]
456 | git-tree-sha1 = "e8b359ef06ec72e8c030463fe02efe5527ee5142"
457 | uuid = "dbb5928d-eab1-5f90-85c2-b9b0edb7c900"
458 | version = "0.4.1"
459 | 
460 | [[Markdown]]
461 | deps = ["Base64"]
462 | uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"
463 | 
464 | [[MbedTLS_jll]]
465 | deps = ["Artifacts", "Libdl"]
466 | uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1"
467 | 
468 | [[Missings]]
469 | deps = ["DataAPI"]
470 | git-tree-sha1 = "bf210ce90b6c9eed32d25dbcae1ebc565df2687f"
471 | uuid = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28"
472 | version = "1.0.2"
473 | 
474 | [[Mmap]]
475 | uuid = "a63ad114-7e13-5084-954f-fe012c677804"
476 | 
477 | [[MosaicViews]]
478 | deps = ["MappedArrays", "OffsetArrays", "PaddedViews", "StackViews"]
479 | git-tree-sha1 = "b34e3bc3ca7c94914418637cb10cc4d1d80d877d"
480 | uuid = "e94cdb99-869f-56ef-bcf0-1ae2bcbe0389"
481 | version = "0.3.3"
482 | 
483 | [[MozillaCACerts_jll]]
484 | uuid = "14a3606d-f60d-562e-9121-12d972cd8159"
485 | 
486 | [[NaNMath]]
487 | git-tree-sha1 = "bfe47e760d60b82b66b61d2d44128b62e3a369fb"
488 | uuid = "77ba4419-2d1f-58cd-9bb1-8ffee604a2e3"
489 | version = "0.3.5"
490 | 
491 | [[Netpbm]]
492 | deps = ["ColorVectorSpace", "FileIO", "ImageCore"]
493 | git-tree-sha1 = "09589171688f0039f13ebe0fdcc7288f50228b52"
494 | uuid = "f09324ee-3d7c-5217-9330-fc30815ba969"
495 | version = "1.0.1"
496 | 
497 | [[NetworkOptions]]
498 | uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908"
499 | 
500 | [[OffsetArrays]]
501 | deps = ["Adapt"]
502 | git-tree-sha1 = "c0e9e582987d36d5a61e650e6e543b9e44d9914b"
503 | uuid = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
504 | version = "1.10.7"
505 | 
506 | [[OpenEXR]]
507 | deps = ["Colors", "FileIO", "OpenEXR_jll"]
508 | git-tree-sha1 = "327f53360fdb54df7ecd01e96ef1983536d1e633"
509 | uuid = "52e1d378-f018-4a11-a4be-720524705ac7"
510 | version = "0.3.2"
511 | 
512 | [[OpenEXR_jll]]
513 | deps = ["Artifacts", "Imath_jll", "JLLWrappers", "Libdl", "Pkg", "Zlib_jll"]
514 | git-tree-sha1 = "923319661e9a22712f24596ce81c54fc0366f304"
515 | uuid = "18a262bb-aa17-5467-a713-aee519bc75cb"
516 | version = "3.1.1+0"
517 | 
518 | [[OpenLibm_jll]]
519 | deps = ["Artifacts", "Libdl"]
520 | uuid = "05823500-19ac-5b8b-9628-191a04bc5112"
521 | 
522 | [[OpenSpecFun_jll]]
523 | deps = ["Artifacts", "CompilerSupportLibraries_jll", "JLLWrappers", "Libdl", "Pkg"]
524 | git-tree-sha1 = "13652491f6856acfd2db29360e1bbcd4565d04f1"
525 | uuid = "efe28fd5-8261-553b-a9e1-b2916fc3738e"
526 | version = "0.5.5+0"
527 | 
528 | [[OrderedCollections]]
529 | git-tree-sha1 = "85f8e6578bf1f9ee0d11e7bb1b1456435479d47c"
530 | uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
531 | version = "1.4.1"
532 | 
533 | [[PNGFiles]]
534 | deps = ["Base64", "CEnum", "ImageCore", "IndirectArrays", "OffsetArrays", "libpng_jll"]
535 | git-tree-sha1 = "33ae7d19c6ba748d30c0c08a82378aae7b64b5e9"
536 | uuid = "f57f5aa1-a3ce-4bc8-8ab9-96f992907883"
537 | version = "0.3.11"
538 | 
539 | [[PaddedViews]]
540 | deps = ["OffsetArrays"]
541 | git-tree-sha1 = "646eed6f6a5d8df6708f15ea7e02a7a2c4fe4800"
542 | uuid = "5432bcbf-9aad-5242-b902-cca2824c8663"
543 | version = "0.5.10"
544 | 
545 | [[Parameters]]
546 | deps = ["OrderedCollections", "UnPack"]
547 | git-tree-sha1 = "34c0e9ad262e5f7fc75b10a9952ca7692cfc5fbe"
548 | uuid = "d96e819e-fc66-5662-9728-84c9c7592b0a"
549 | version = "0.12.3"
550 | 
551 | [[Parsers]]
552 | deps = ["Dates"]
553 | git-tree-sha1 = "d911b6a12ba974dabe2291c6d450094a7226b372"
554 | uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
555 | version = "2.1.1"
556 | 
557 | [[Pkg]]
558 | deps = ["Artifacts", "Dates", "Downloads", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs", "p7zip_jll"]
559 | uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
560 | 
561 | [[PkgVersion]]
562 | deps = ["Pkg"]
563 | git-tree-sha1 = "a7a7e1a88853564e551e4eba8650f8c38df79b37"
564 | uuid = "eebad327-c553-4316-9ea0-9fa01ccd7688"
565 | version = "0.1.1"
566 | 
567 | [[Preferences]]
568 | deps = ["TOML"]
569 | git-tree-sha1 = "00cfd92944ca9c760982747e9a1d0d5d86ab1e5a"
570 | uuid = "21216c6a-2e73-6563-6e65-726566657250"
571 | version = "1.2.2"
572 | 
573 | [[Printf]]
574 | deps = ["Unicode"]
575 | uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"
576 | 
577 | [[Profile]]
578 | deps = ["Printf"]
579 | uuid = "9abbd945-dff8-562f-b5e8-e1ebf5ef1b79"
580 | 
581 | [[ProgressMeter]]
582 | deps = ["Distributed", "Printf"]
583 | git-tree-sha1 = "afadeba63d90ff223a6a48d2009434ecee2ec9e8"
584 | uuid = "92933f4c-e287-5a05-a399-4b506db050ca"
585 | version = "1.7.1"
586 | 
587 | [[REPL]]
588 | deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"]
589 | uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"
590 | 
591 | [[Random]]
592 | deps = ["Serialization"]
593 | uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
594 | 
595 | [[Random123]]
596 | deps = ["Libdl", "Random", "RandomNumbers"]
597 | git-tree-sha1 = "0e8b146557ad1c6deb1367655e052276690e71a3"
598 | uuid = "74087812-796a-5b5d-8853-05524746bad3"
599 | version = "1.4.2"
600 | 
601 | [[RandomNumbers]]
602 | deps = ["Random", "Requires"]
603 | git-tree-sha1 = "043da614cc7e95c703498a491e2c21f58a2b8111"
604 | uuid = "e6cf234a-135c-5ec9-84dd-332b85af5143"
605 | version = "1.5.3"
606 | 
607 | [[RangeArrays]]
608 | git-tree-sha1 = "b9039e93773ddcfc828f12aadf7115b4b4d225f5"
609 | uuid = "b3c3ace0-ae52-54e7-9d0b-2c1406fd6b9d"
610 | version = "0.3.2"
611 | 
612 | [[Ratios]]
613 | deps = ["Requires"]
614 | git-tree-sha1 = "01d341f502250e81f6fec0afe662aa861392a3aa"
615 | uuid = "c84ed2f1-dad5-54f0-aa8e-dbefe2724439"
616 | version = "0.4.2"
617 | 
618 | [[Reexport]]
619 | git-tree-sha1 = "45e428421666073eab6f2da5c9d310d99bb12f9b"
620 | uuid = "189a3867-3050-52da-a836-e630ba90ab69"
621 | version = "1.2.2"
622 | 
623 | [[Requires]]
624 | deps = ["UUIDs"]
625 | git-tree-sha1 = "4036a3bd08ac7e968e27c203d45f5fff15020621"
626 | uuid = "ae029012-a4dd-5104-9daa-d747884805df"
627 | version = "1.1.3"
628 | 
629 | [[Rotations]]
630 | deps = ["LinearAlgebra", "Random", "StaticArrays", "Statistics"]
631 | git-tree-sha1 = "6a23472b6b097d66da87785b61137142ac104f94"
632 | uuid = "6038ab10-8711-5258-84ad-4b1120ba62dc"
633 | version = "1.0.4"
634 | 
635 | [[SHA]]
636 | uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
637 | 
638 | [[Serialization]]
639 | uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
640 | 
641 | [[SharedArrays]]
642 | deps = ["Distributed", "Mmap", "Random", "Serialization"]
643 | uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383"
644 | 
645 | [[SimpleTraits]]
646 | deps = ["InteractiveUtils", "MacroTools"]
647 | git-tree-sha1 = "5d7e3f4e11935503d3ecaf7186eac40602e7d231"
648 | uuid = "699a6c99-e7fa-54fc-8d76-47d257e15c1d"
649 | version = "0.9.4"
650 | 
651 | [[Sockets]]
652 | uuid = "6462fe0b-24de-5631-8697-dd941f90decc"
653 | 
654 | [[SortingAlgorithms]]
655 | deps = ["DataStructures"]
656 | git-tree-sha1 = "b3363d7460f7d098ca0912c69b082f75625d7508"
657 | uuid = "a2af1166-a08f-5f64-846c-94a0d3cef48c"
658 | version = "1.0.1"
659 | 
660 | [[SparseArrays]]
661 | deps = ["LinearAlgebra", "Random"]
662 | uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
663 | 
664 | [[SpecialFunctions]]
665 | deps = ["ChainRulesCore", "IrrationalConstants", "LogExpFunctions", "OpenLibm_jll", "OpenSpecFun_jll"]
666 | git-tree-sha1 = "f0bccf98e16759818ffc5d97ac3ebf87eb950150"
667 | uuid = "276daf66-3868-5448-9aa4-cd146d93841b"
668 | version = "1.8.1"
669 | 
670 | [[StackViews]]
671 | deps = ["OffsetArrays"]
672 | git-tree-sha1 = "46e589465204cd0c08b4bd97385e4fa79a0c770c"
673 | uuid = "cae243ae-269e-4f55-b966-ac2d0dc13c15"
674 | version = "0.1.1"
675 | 
676 | [[Static]]
677 | deps = ["IfElse"]
678 | git-tree-sha1 = "e7bc80dc93f50857a5d1e3c8121495852f407e6a"
679 | uuid = "aedffcd0-7271-4cad-89d0-dc628f76c6d3"
680 | version = "0.4.0"
681 | 
682 | [[StaticArrays]]
683 | deps = ["LinearAlgebra", "Random", "Statistics"]
684 | git-tree-sha1 = "3c76dde64d03699e074ac02eb2e8ba8254d428da"
685 | uuid = "90137ffa-7385-5640-81b9-e52037218182"
686 | version = "1.2.13"
687 | 
688 | [[Statistics]]
689 | deps = ["LinearAlgebra", "SparseArrays"]
690 | uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
691 | 
692 | [[StatsAPI]]
693 | git-tree-sha1 = "1958272568dc176a1d881acb797beb909c785510"
694 | uuid = "82ae8749-77ed-4fe6-ae5f-f523153014b0"
695 | version = "1.0.0"
696 | 
697 | [[StatsBase]]
698 | deps = ["DataAPI", "DataStructures", "LinearAlgebra", "LogExpFunctions", "Missings", "Printf", "Random", "SortingAlgorithms", "SparseArrays", "Statistics", "StatsAPI"]
699 | git-tree-sha1 = "eb35dcc66558b2dda84079b9a1be17557d32091a"
700 | uuid = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
701 | version = "0.33.12"
702 | 
703 | [[StringDistances]]
704 | deps = ["Distances", "StatsAPI"]
705 | git-tree-sha1 = "00e86048552d34bb486cad935754dd9516bdb46e"
706 | uuid = "88034a9c-02f8-509d-84a9-84ec65e18404"
707 | version = "0.11.1"
708 | 
709 | [[TOML]]
710 | deps = ["Dates"]
711 | uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
712 | 
713 | [[Tar]]
714 | deps = ["ArgTools", "SHA"]
715 | uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e"
716 | 
717 | [[Test]]
718 | deps = ["InteractiveUtils", "Logging", "Random", "Serialization"]
719 | uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
720 | 
721 | [[TestImages]]
722 | deps = ["AxisArrays", "ColorTypes", "FileIO", "OffsetArrays", "Pkg", "StringDistances"]
723 | git-tree-sha1 = "f91d170645a8ba6fbaa3ac2879eca5da3d92a31a"
724 | uuid = "5e47fb64-e119-507b-a336-dd2b206d9990"
725 | version = "1.6.2"
726 | 
727 | [[TiffImages]]
728 | deps = ["ColorTypes", "DataStructures", "DocStringExtensions", "FileIO", "FixedPointNumbers", "IndirectArrays", "Inflate", "OffsetArrays", "PkgVersion", "ProgressMeter", "UUIDs"]
729 | git-tree-sha1 = "016185e1a16c1bd83a4352b19a3b136224f22e38"
730 | uuid = "731e570b-9d59-4bfa-96dc-6df516fadf69"
731 | version = "0.5.1"
732 | 
733 | [[TiledIteration]]
734 | deps = ["OffsetArrays"]
735 | git-tree-sha1 = "5683455224ba92ef59db72d10690690f4a8dc297"
736 | uuid = "06e1c1a7-607b-532d-9fad-de7d9aa2abac"
737 | version = "0.3.1"
738 | 
739 | [[TimerOutputs]]
740 | deps = ["ExprTools", "Printf"]
741 | git-tree-sha1 = "7cb456f358e8f9d102a8b25e8dfedf58fa5689bc"
742 | uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
743 | version = "0.5.13"
744 | 
745 | [[Tullio]]
746 | deps = ["ChainRulesCore", "DiffRules", "LinearAlgebra", "Requires"]
747 | git-tree-sha1 = "0288b7a395fc412952baf756fac94e4f28bfec65"
748 | uuid = "bc48ee85-29a4-5162-ae0b-a64e1601d4bc"
749 | version = "0.3.2"
750 | 
751 | [[UUIDs]]
752 | deps = ["Random", "SHA"]
753 | uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"
754 | 
755 | [[UnPack]]
756 | git-tree-sha1 = "387c1f73762231e86e0c9c5443ce3b4a0a9a0c2b"
757 | uuid = "3a884ed6-31ef-47d7-9d2a-63182c4928ed"
758 | version = "1.0.2"
759 | 
760 | [[Unicode]]
761 | uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"
762 | 
763 | [[WoodburyMatrices]]
764 | deps = ["LinearAlgebra", "SparseArrays"]
765 | git-tree-sha1 = "de67fa59e33ad156a590055375a30b23c40299d3"
766 | uuid = "efce3f68-66dc-5838-9240-27a6d6f5f9b6"
767 | version = "0.5.5"
768 | 
769 | [[Zlib_jll]]
770 | deps = ["Libdl"]
771 | uuid = "83775a58-1f1d-513f-b197-d71354ab007a"
772 | 
773 | [[libpng_jll]]
774 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg", "Zlib_jll"]
775 | git-tree-sha1 = "94d180a6d2b5e55e447e2d27a29ed04fe79eb30c"
776 | uuid = "b53b4c65-9356-5827-b1ea-8c7a1a84506f"
777 | version = "1.6.38+0"
778 | 
779 | [[nghttp2_jll]]
780 | deps = ["Artifacts", "Libdl"]
781 | uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d"
782 | 
783 | [[p7zip_jll]]
784 | deps = ["Artifacts", "Libdl"]
785 | uuid = "3f19e933-33d8-53b3-aaab-bd5110c3b7a0"
786 | 


--------------------------------------------------------------------------------
/Project.toml:
--------------------------------------------------------------------------------
 1 | [deps]
 2 | Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
 3 | BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
 4 | CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 5 | CUDAKernels = "72cfdca4-0801-4ab0-bf6a-d52aa10adc57"
 6 | ColorTypes = "3da002f7-5984-5a60-b8a6-cbb66c0b333f"
 7 | FixedPointNumbers = "53c48c17-4a7d-5ca2-90c5-79b7896eea93"
 8 | GPUCompiler = "61eb1bfa-7361-4325-ad38-22787b887f55"
 9 | ImageIO = "82e4d734-157c-48bb-816b-45c225c6df19"
10 | Images = "916415d5-f1e6-5110-898d-aaa5f9f070e0"
11 | KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
12 | TestImages = "5e47fb64-e119-507b-a336-dd2b206d9990"
13 | Tullio = "bc48ee85-29a4-5162-ae0b-a64e1601d4bc"
14 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CSCS Julia GPU course material
2 | 
3 | This repository contains the notebooks for the 2 last days of the [Julia GPU
4 | course at
5 | CSCS](https://www.cscs.ch/events/upcoming-events/event-detail/gpu-programming-with-julia/).
6 | 


--------------------------------------------------------------------------------