├── .gitattributes ├── README.md ├── images ├── .gitignore ├── Makefile ├── action-barrier-action-mem.png ├── action-barrier-action-mem.tex ├── action-barrier-action.png ├── action-barrier-action.tex ├── action-command.png ├── action-command.tex ├── deps-common.tex ├── pipeline-barrier.png └── pipeline-barrier.tex └── memory.md /.gitattributes: -------------------------------------------------------------------------------- 1 | * text=auto 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | This is an incomplete and almost certainly incorrect attempt to rephrase Vulkan's requirements on execution dependencies in a more precise form. 2 | 3 | The basic idea is: Every 'action command' and 'sync command' defines a collection of nodes in a dependency graph, defines some internal edges between those nodes, and defines some external edges between its own nodes and the nodes of other commands. In some cases we define internal nodes that are not strictly necessary but that reduce the number of external edges, making the behaviour easier to understand and visualize. Some sequences of n commands will still result in O(n^2) edges, so these rules should not be implemented literally - implementations may use any approach that gives observably equivalent behaviour. 4 | 5 | The goal is to specify the rules in a pseudo-mathematical way so that they're unambiguous (albeit not necessarily intuitive) for human readers, and so that an algorithm (e.g. implemented in a Vulkan layer) could follow the rules to detect race conditions and (ideally) to render a visualization so the human gets some idea of what they've done wrong. 6 | 7 | The (still unfinished) definition has ended up being quite verbose. To hopefully to make it a bit easier to follow, we can draw some diagrams to illustrate parts of the dependency graph. The internal nodes and edges in action commands and pipeline barriers should look like: 8 | 9 | ![](images/action-command.png) ![](images/pipeline-barrier.png) 10 | 11 | (Some of the pipeline stages are omitted from these diagrams, for clarity.) 12 | 13 | Pipeline barriers create external edges from a stage in a previous action command to the corresponding source stage in the pipeline barrier, and from a destination stage in the pipeline barrier to the corresponding stage in a subsequent action command. 14 | 15 | E.g. 16 | ```cpp 17 | vkCmdDraw(...); 18 | vkCmdPipelineBarrier(..., 19 | VK_PIPELINE_STAGE_VERTEX_SHADER_BIT | VK_PIPELINE_STAGE_TRANSFER_BIT, 20 | VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, 21 | ... 22 | ); 23 | vkCmdDraw(...); 24 | ``` 25 | creates an execution dependency graph like: 26 | 27 | ![](images/action-barrier-action.png) 28 | 29 | in which you can follow the arrows to see an execution dependency chain from e.g. draw 1's `VERTEX_SHADER` stage to draw 2's `TRANSFER` stage, meaning draw 1's `VERTEX_SHADER` must complete before draw 2's `TRANSFER` can start. However there is no execution dependency chain from draw 1's `FRAGMENT_SHADER` stage to any stage in draw 2, meaning the `FRAGMENT_SHADER` can run after or concurrently with any of draw 2's stages. 30 | 31 | When there are multiple arrows leading into a node, *all* of those dependencies must complete before the node can start. It is often easiest to read the graphs by starting at the node you're interested in, then reading backwards along the arrows, following every possible backwards chain of arrows, and the nodes you pass through will be all the ones that your initial node depends on. 32 | 33 | TODO: More diagrams, for events and subpasses and everything else. 34 | 35 | ### Definitions 36 | 37 | First we need to define a few terms. 38 | 39 | An "execution dependency order" is a partial order over elements of the following types: 40 | * (action command, stage) 41 | * (sync command, SRC or DST, stage) 42 | * (sync command, SRC or DST or POST_TRANS or PRE_TRANS) 43 | * (command buffer, SUBMIT or COMPLETE) 44 | * (subpass, SRC or DST, stage) 45 | * subpass dependency 46 | * layout transition 47 | 48 | Actions commands are: 49 | * vkCmdDraw (+Indirect, +Indexed) 50 | * vkCmdDispatch (+Indirect) 51 | * Transfer operations: 52 | * vkCmdCopy* 53 | * vkCmdClear* 54 | * vkCmdBlitImage 55 | * vkCmdUpdateBuffer 56 | * vkCmdFillBuffer 57 | * vkCmdWriteTimestamp 58 | * vkCmdResolveImage ??? 59 | * Other query stuff ??? 60 | 61 | Sync commands are: 62 | * vkCmdPipelineBarrier 63 | * vkCmdSetEvent 64 | * vkCmdWaitEvents 65 | * vkSetEvent, vkResetEvent ??? 66 | 67 | By "command" we really mean a single execution of a command. A command might be recorded once but its command buffer might be submitted many times, and each time will count as a separate execution with its own separate position in the ordering. 68 | 69 | *extractStages(mask)* converts a bitmask into a set of stages: 70 | * If *mask* & `TOP_OF_PIPE_BIT`, add `TOP_OF_PIPE` into the set. 71 | * If *mask* & `DRAW_INDIRECT_BIT`, add `DRAW_INDIRECT` into the set. 72 | * ... 73 | * If *mask* & `ALL_GRAPHICS_BIT`, add `DRAW_INDIRECT`, ..., `COLOR_ATTACHMENT_OUTPUT` into the set. 74 | * If *mask* & `ALL_COMMANDS_BIT`, add `DRAW_INDIRECT`, ..., `COLOR_ATTACHMENT_OUTPUT`, `COMPUTE_SHADER`, `TRANSFER` into the set. 75 | 76 | We define "command order" as follows: 77 | * Commands executed in command buffer submission *A* are considered earlier than commands executed in command buffer submission *B*, where *A* and *B* were submitted to the same queue and *A* was submitted first. 78 | * Commands executed in a single command buffer submission are ordered in the same order that they were recorded in that command buffer. Secondary command buffers are 'inlined' into their primary command buffers for the purpose of defining this order. 79 | 80 | ### Execution dependency order 81 | 82 | We define the execution dependency order '<' as follows: 83 | 84 | * For every `vkCmdPipelineBarrier` *barrier* that does not have `BY_REGION_BIT`: 85 | * If the barrier is not inside a render pass: 86 | * Let *A_a* be the set of all action commands preceding *barrier* in the current queue, in command order. 87 | * Let *A_s* be the set of all sync commands preceding *barrier* in the current queue, in command order. 88 | * Let *B_a* be the set of all action commands following *barrier* in the current queue, in command order. 89 | * Let *B_s* be the set of all sync commands following *barrier* in the current queue, in command order. 90 | * If the barrier is inside a render pass: 91 | * Let *A_a* be the set of all action commands preceding *barrier* in the current subpass, in command order. 92 | * Let *A_s* be the set of all sync commands preceding *barrier* in the current subpass, in command order. 93 | * Let *B_a* be the set of all action commands following *barrier* in the current subpass, in command order. 94 | * Let *B_s* be the set of all sync commands following *barrier* in the current subpass, in command order. 95 | * For every *a* in *A_a*, and every *srcStage* in *extractStages(barrier.srcStageMask)*: 96 | * (*a*, *srcStage*) < (*barrier*, SRC, *srcStage*) 97 | * For every *a* in *A_s*, and every *srcStage* in *extractStages(barrier.srcStageMask)*: 98 | * (*a*, DST, *srcStage*) < (*barrier*, SRC, *srcStage*) 99 | * For every *srcStage* in *extractStages(barrier.srcStageMask)*: 100 | * (*barrier*, SRC, *srcStage*) < (*barrier*, SRC) 101 | * (*barrier*, SRC, TOP) < (*barrier*, SRC) 102 | * (*barrier*, SRC) < (*barrier*, PRE_TRANS) < (*barrier*, POST_TRANS) < (*barrier*, DST) 103 | * For every *dstStage* in *extractStages(barrier.dstStageMask)*: 104 | * (*barrier*, DST) < (*barrier*, DST, *dstStage*) 105 | * (*barrier*, DST) < (*barrier*, DST, BOTTOM) 106 | * For every *b* in *B_a*, and every *dstStage* in *extractStages(barrier.dstStageMask)*: 107 | * (*barrier*, *dstStage*) < (*b*, *dstStage*) 108 | * For every *b* in *B_s*, and every *dstStage* in *extractStages(barrier.dstStageMask)*: 109 | * (*barrier*, *dstStage*) < (*b*, SRC, *dstStage*) 110 | * Let *M_{transition}* be the set of all `VkImageMemoryBarrier` *imgMemBarrier* in *barrier*, where *imgMemBarrier.oldLayout* != *imgMemBarrier.newLayout*. 111 | * For every *transition* in *M_{transition}*: 112 | * (*barrier*, PRE_TRANS) < *transition* < (*barrier*, POST_TRANS) 113 | 114 | --- 115 | 116 | **NOTE**: This is defining that a pipeline barrier looks like: 117 | 118 | ![](images/pipeline-barrier.png) 119 | 120 | There are sets of source stages and destination stages. The stages included in `srcStageMask`/`dstStageMask` are connected to the barrier's internal SRC/DST. Image layout transitions happen in the middle of the pipeline barrier. 121 | 122 | The active source stages are connected to the corresponding stages of earlier commands (either all earlier commands, or (if the barrier is inside a subpass) earlier commands in the current subpass). The active destination stages are connected similarly to following commands. This means you can construct a chain of execution dependencies through multiple pipeline barriers, as long as they have the appropriate bits set in `srcStageMask`/`dstStageMask` to make the connection. 123 | 124 | --- 125 | 126 | * For every `vkCmdWaitEvents` *waitEvents*: 127 | * Let *B_a* be the set of all action commands following *waitEvents* in the current queue, in command order. 128 | * Let *B_s* be the set of all sync commands following *waitEvents* in the current queue, in command order. 129 | * (*waitEvents*, SRC) < (*waitEvents*, PRE_TRANS) < (*waitEvents*, POST_TRANS) < (*waitEvents*, DST) 130 | * For every *dstStage* in *extractStages(waitEvents.dstStageMask)*: 131 | * (*waitEvents*, DST) < (*waitEvents*, DST, *dstStage*) 132 | * (*waitEvents*, DST) < (*waitEvents*, DST, BOTTOM) 133 | * For every *b* in *B_a*, and every *dstStage* in *extractStages(waitEvents.dstStageMask)*: 134 | * (*waitEvents*, DST, *dstStage*) < (*b*, *dstStage*) 135 | * For every *b* in *B_s*, and every *dstStage* in *extractStages(waitEvents.dstStageMask)*: 136 | * (*waitEvents*, DST, *dstStage*) < (*b*, SRC, *dstStage*) 137 | * Let *M_{transition}* be the set of all `VkImageMemoryBarrier` *imgMemBarrier* in *waitEvents*, where *imgMemBarrier.oldLayout* != *imgMemBarrier.newLayout*. 138 | * For every *transition* in *M_{transition}*: 139 | * (*waitEvents*, PRE_TRANS) < *transition* < (*waitEvents*, POST_TRANS) 140 | 141 | * For every `vkCmdSetEvent` *setEvent* on some event object *event*: 142 | * Let *W* be the set of `vkCmdWaitEvents` commands, such that for each *waitEvents* in *W*: 143 | * *event* is included in *waitEvents*'s array of event objects to wait on. 144 | * *waitEvents* follows *setEvent* in command order. 145 | * There is no `vkCmdResetEvent` on *event* that is between *setEvent* and *waitEvents* in command order. 146 | * Let *A* be the set of all action commands and barrier commands preceding *setEvent* in the current queue, in command order. 147 | * For every *a* in *A*, and every *stage* in *extractStages(setEvent.stageMask)*: 148 | * (*a*, *stage*) < (*setEvent*, SRC, *stage*) 149 | * For every *stage* in *extractStages(setEvent.stageMask)*: 150 | * (*setEvent*, SRC, *stage*) < (*setEvent*, SRC) 151 | * (*setEvent*, SRC, TOP) < (*setEvent*, SRC) 152 | * For every *waitEvents* in *W*: 153 | * (*setEvent*, SRC) < (*waitEvents*, SRC) 154 | 155 | * TODO: vkSetEvent from host 156 | 157 | --- 158 | 159 | **NOTE**: A pair of `vkCmdSetEvent` and `vkCmdWaitEvents` is very similar to a `vkCmdPipelineBarrier` split in half. 160 | 161 | --- 162 | 163 | * For all action commands *c*: 164 | * For all stages *stage* (not including `TOP` or `BOTTOM`): 165 | * (*c*, `TOP`) < (*c*, *stage*) < (*c*, `BOTTOM`) 166 | 167 | --- 168 | 169 | **NOTE**: This is defining that an action command looks like: 170 | 171 | ![](images/action-command.png) 172 | 173 | i.e. a bunch of stages between `TOP` and `BOTTOM`. Action commands will not do work in all of these stages. 174 | 175 | --- 176 | 177 | * For all action commands *c*: 178 | * Let *cmdBuf* be the primary command buffer containing *c*. 179 | * (*cmdBuf*, SUBMIT) < (*c*, `TOP`) 180 | * (*c*, `BOTTOM`) < (*cmdBuf*, COMPLETE) 181 | 182 | * For all sync commands *c* inside a command buffer: 183 | * Let *cmdBuf* be the primary command buffer containing *c*. 184 | * (*cmdBuf*, SUBMIT) < (*c*, SRC) 185 | * (*c*, DST) < (*cmdBuf*, COMPLETE) 186 | 187 | --- 188 | 189 | **NOTE**: This is saying that commands can't start before their corresponding command buffer, and the command buffer won't be considered complete until all its commands are complete. 190 | 191 | This is defined in terms of primary command buffers. For commands in secondary command buffers, it'll use the primary command buffer that executes that secondary command buffer. 192 | 193 | --- 194 | 195 | * For every subpass *subpass*: 196 | * For every stage *stage*: 197 | * (*subpass*, SRC, *stage*) < (*subpass*, DST, *stage*) 198 | * For every action command *c* in *subpass*: 199 | * (*subpass*, SRC, *stage*) < (*c*, *stage*) < (*subpass*, DST, *stage*) 200 | * For every sync command *c* in *subpass*: 201 | * (*subpass*, SRC, *stage*) < (*c*, SRC, *stage*) 202 | * (*c*, DST, *stage*) < (*subpass*, DST, *stage*) 203 | 204 | * For every `VkSubpassDependency` *subpassDep* that does not have `BY_REGION_BIT`: 205 | * If *subpassDep.srcSubpass* = *subpassDep.dstSubpass*: 206 | * They must not both be equal to `VK_SUBPASS_EXTERNAL`. 207 | * This is a subpass self-dependency. The subpass can contain `vkCmdPipelineBarrier` commands (subject to certain validity requirements not described here), which create execution dependencies as described above. 208 | * Otherwise: 209 | * If *subpassDep.srcSubpass* is `VK_SUBPASS_EXTERNAL`: 210 | * Let *A_a* be the set of all action commands preceding the current render pass, in command order. 211 | * Let *A_s* be the set of all sync commands preceding the current render pass, in command order. 212 | * For every *a* in *A_a*, and every *srcStage* in *extractStages(subpassDep.srcStageMask)*: 213 | * (*a*, *srcStage*) < *subpassDep* 214 | * For every *a* in *A_s*, and every *srcStage* in *extractStages(subpassDep.srcStageMask)*: 215 | * (*a*, DST, *srcStage*) < *subpassDep* 216 | * Otherwise: 217 | * For every *srcStage* in *extractStages(subpassDep.srcStageMask)*: 218 | * (*subpassDep.srcSubpass*, DST, *srcStage*) < *subpassDep* 219 | * If *subpassDep.dstcSubpass* is `VK_SUBPASS_EXTERNAL`: 220 | * Let *B_a* be the set of all action commands following the current render pass, in command order. 221 | * Let *B_s* be the set of all sync commands following the current render pass, in command order. 222 | * For every *b* in *B_a*, and every *dstStage* in *extractStages(subpassDep.dstStageMask)*: 223 | * *subpassDep* < (*b*, *dstStage*) 224 | * For every *b* in *B_s*, and every *dstStage* in *extractStages(subpassDep.dstStageMask)*: 225 | * *subpassDep* < (*b*, SRC, *dstStage*) 226 | * Otherwise: 227 | * For every *dstStage* in *extractStages(subpassDep.dstStageMask)*: 228 | * *subpassDep* < (*subpassDep.dstSubpass*, SRC, *dstStage*) 229 | 230 | --- 231 | 232 | **NOTE**: The definition of subpass SRC/DST stages is necessary because we might have execution dependency chains passing through a subpass which contains no commands. The SRC/DST stages give something for the dependency to be ordered relative to. 233 | 234 | --- 235 | 236 | * Transitivity: If *X* < *Y* and *Y* < *Z*, then *X* < *Z*. 237 | 238 | We also define the by-region execution dependency order '<\_{region}' as follows: 239 | * For every `vkCmdPipelineBarrier` *barrier* that has `BY_REGION_BIT`: 240 | * ... similar to the definition above ... 241 | * For every `VkSubpassDependency` *subpassDep* that has `BY_REGION_BIT`: 242 | * ... similar to the definition above ... 243 | * Transitivity: If *X* <\_{region} *Y* and *Y* <\_{region} *Z*, then *X* <\_{region} *Z*. 244 | * If *X* < *Y*, then *X* <\_{region} *Y*. 245 | 246 | We need some validity requirements for the event execution dependencies to make sense: 247 | 248 | * If there are two `vkCmdSetEvent` *set_1* and *set_2* on the same event, and neither *set_1* < *set_2* nor *set_2* < *set_1*, then behaviour is undefined. 249 | * If there is a `vkCmdSetEvent` and a `vkCmdWaitEvents` on the same event, and neither *set* < *wait* nor *wait* < *set*, then behaviour is undefined. 250 | * If there is a `vkCmdSetEvent` and a `vkCmdResetEvent` on the same event, and neither *set* < *reset* nor *reset* < *set*, then behaviour is undefined. 251 | * If there is a `vkCmdResetEvent` and a `vkCmdWaitEvents` on the same event, and neither *reset* < *wait* nor *wait* < *reset*, then behaviour is undefined. 252 | 253 | i.e. you must not have race conditions between two commands on the same event when the behaviour depends on the order they execute in. (TODO: These are somewhat stricter than the current spec requires. Maybe it needs to be defined differently, so we allow multiple valid execution orders instead of simply saying it's undefined if there's more than one valid order.) 254 | 255 | Finally we can say: 256 | 257 | * If *X* < *Y*, then the implementation must complete the work performed by *X* before starting the work performed by *Y*. 258 | * If *X* <\_{region} *Y*, then for every region (x,y,layer) in the framebuffer (or viewport or something?), the implementation must complete the work performed by *X* for that region, before starting the work performed by *Y* for that region. 259 | * In all other cases, the implementation may reorder and overlap work however it wishes. 260 | 261 | (TODO: define what "completion" actually means) 262 | 263 | Note that '<' is defined so that execution dependencies always go in the same direction as command order. (...unless there are bugs in the definition). That means an implementation could simply execute every command in command order, with no pipelining and no reordering, and would satisfy all the requirements above. Or it could choose to insert arbitrary sync points at which every previous command completes before any subsequent command starts, for example at the start/end of a command buffer, to limit the scope in which it has to track parallelism. 264 | 265 | ### Memory dependencies 266 | 267 | The concept we use for memory dependencies is that a device will have many internal caches - in the worst case a separate cache for every different access type in every pipeline stage. Cached writes must be flushed to main memory to make them available to other stages. Similarly caches must be invalidated before reading recently-written data through them, to make the changes in main memory visible to that stage, preventing it reading stale data from the cache; and must also be invalidated before writing near recently-written data, since a partial cache line write will essentially perform a read and we again need to avoid using stale data. No memory dependencies are needed for read-after-read and write-after-read. 268 | 269 | (Implementations are not expected to literally use caches like this - e.g. if two stages have a shared cache then they could optimise away a flush/invalidate pair between those stages, or they could flush from independent L1 caches into a shared L2 cache instead of into main memory, or they might buffer memory accesses in something quite different to a data cache, etc. They just need to make sure the observable behaviour is compatible with the what's described here, so that application developers can ignore those details and assume it's simple caches.) 270 | 271 | (NOTE: I'm using the terms "flush" and "invalidate" instead of "make available" and "make visible", even though they're slightly more low-level than intended, because they're much more conventional terms and I find it much easier to remember which way round they go, and because they're easier to use in sentences.) 272 | 273 | We define four new groups of elements: 274 | * (READ, *c*, *stage*, *access*, *mem*) 275 | * (WRITE, *c*, *stage*, *access*, *mem*) 276 | * (FLUSH, *b*, *stage*, *access*, *mem*) 277 | * (INVALIDATE, *b*, *stage*, *access*, *mem*) 278 | 279 | *mem* is either a range of a `VkDeviceMemory`, or an image subresource, or the special value "GLOBAL" (only used in FLUSH/INVALIDATE). *access* is one of `VkAccessFlagBits`. *stage* is one of `VkPipelineStageFlags`. *c* is an action command. *b* is a barrier command or (TODO: other things that create memory dependencies). 280 | 281 | *memOverlap(mem_1, mem_2)* is the set of memory locations in the intersection between the two memory ranges, as defined by the current spec for memory aliasing. (TODO: might need to be more specific about cache line granularity for buffer accesses). If one is GLOBAL then the intersection is the other one. 282 | 283 | *memIsSubset(mem_1, mem_2)* is true if (and only if) *mem_2* is 'larger' than (or equal to) *mem_1*. That means either *mem_2* is GLOBAL, or is a larger range of the same `VkBuffer`, or is a larger image subresource range. (*memIsSubset* ignores aliasing.) 284 | 285 | These elements all participate in the execution dependency order '<', extending the definition of '<' above. 286 | 287 | Most READ and WRITE operations happen inside one of the stages of an action command. To represent them happening at the same time as a (*c*, *stage*), we define an equivalence relation '~' which means the READ/WRITE adopts the same position in execution dependency order as their corresponding command: 288 | 289 | * If *X* ~ *Y* and *X* < *Z* then *Y* < *Z*. 290 | * If *X* ~ *Y* and *Z* < *X* then *Z* < *Y*. 291 | 292 | We need to define every possible memory access from every command: 293 | 294 | * For all transfer commands *c*: 295 | * (*c*, `TRANSFER`) ~ (READ, *c*, `TRANSFER`, `ACCESS_TRANSFER_READ`, whatever the source is) 296 | * (*c*, `TRANSFER`) ~ (WRITE, *c*, `TRANSFER`, `ACCESS_TRANSFER_WRITE`, whatever the destination is) 297 | * For all vkCmdDispatch (+Indirect) commands *c*: 298 | * (*c*, `COMPUTE`) ~ (READ, *c*, `COMPUTE`, `ACCESS_SHADER_READ`, whatever the source is) for each thing the compute shader reads 299 | * (*c*, `COMPUTE`) ~ (WRITE, *c*, `COMPUTE`, `ACCESS_SHADER_WRITE`, whatever the destination is) for each thing the compute shader writes 300 | * (*c*, `COMPUTE`) ~ (READ, *c*, `COMPUTE`, `UNIFORM_SHADER_READ`, whatever the source is) for each uniform thing the compute shader reads 301 | * etc 302 | * For all vkCmdDraw (+Indirect, +Indexed) commands *c*: 303 | * ... loads of stuff depending on the pipeline state etc 304 | * For all vkCmd{Dispatch,Draw}Indirect commands *c*: 305 | * (*c*, `DRAW_INDIRECT`) ~ (READ, *c*, `DRAW_INDIRECT`, `ACCESS_INDIRECT_COMMAND_READ`, whatever the source is) 306 | * TODO: layout transitions sort of read/write 307 | * TODO: what about stuff like automatic clears in render passes? 308 | * TODO: host accesses 309 | 310 | FLUSH and INVALIDATE are created by memory barriers: 311 | 312 | * For every `vkCmdPipelineBarrier` *barrier*: 313 | * For every *memoryBarrier* in *barrier.pMemoryBarriers*: 314 | * For every *srcAccess* in *memoryBarrier.srcAccessMask*, and every *srcStage* in *extractStages(barrier.srcStageMask)*: 315 | * (*barrier*, SRC) < (FLUSH, *barrier*, *srcStage*, *srcAccess*, GLOBAL) < (*barrier*, PRE_TRANS) 316 | * For every *dstAccess* in *memoryBarrier.dstAccessMask*, and every *dstStage* in *extractStages(barrier.dstStageMask)*: 317 | * (*barrier*, POST_TRANS) < (INVALIDATE, *barrier*, *dstStage*, *dstAccess*, GLOBAL) < (*barrier*, DST) 318 | * For every *bufferMemoryBarrier* in *barrier.pBufferMemoryBarriers*: 319 | * ... similar but with a buffer range instead of GLOBAL 320 | * For every *imageMemoryBarrier* in *barrier.pImageMemoryBarriers*: 321 | * ... similar but with an image subresource instead of GLOBAL 322 | * TODO: all the other ways of defining explicit and implicit memory barriers 323 | 324 | If we modify the earlier example to include some memory barriers like: 325 | ```cpp 326 | vkCmdDraw(...); 327 | vkCmdPipelineBarrier(..., 328 | VK_PIPELINE_STAGE_VERTEX_SHADER_BIT | VK_PIPELINE_STAGE_TRANSFER_BIT, 329 | VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT | VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, 330 | pMemoryBarriers = { { 331 | srcAccessMask = 0, 332 | dstAccessMask = VK_ACCESS_SHADER_READ_BIT 333 | } }, 334 | pImageMemoryBarriers = { { 335 | srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT, 336 | dstAccessMask = 0, 337 | image = ..., 338 | subresourceRange = ... 339 | } }, 340 | ... 341 | ); 342 | vkCmdDraw(...); 343 | ``` 344 | then we can illustrate it like: 345 | 346 | ![](images/action-barrier-action-mem.png) 347 | 348 | Here draw 1's `VERTEX_SHADER` stage is writing to some image subresource *img_1*, draw 2's `FRAGMENT_SHADER` is reading from *img_1*, and the pipeline barrier is doing a flush of *img_1* and a GLOBAL invalidate which both include the appropriate access types and stages that correspond to the WRITE/READ. 349 | 350 | Having defined these operations, we can now define the rules that applications must follow. Violating these rules will result in unpredictable reads from memory. 351 | 352 | Race conditions between writes and other read/write accesses are not permitted, because they would result in unpredictable behaviour depending on the scheduling of the commands: 353 | 354 | * For every *write* = (WRITE, *c_1*, *stage_1*, *access_1*, *mem_1*), *read* = (READ, *c_2*, *stage_2*, *access_2*, *mem_2*) such that *memOverlap(mem_1, mem_2)* is not empty: 355 | * The application must ensure either *write* < *read*, or *read* < *write*. (Note this is equivalent to requiring either (*c_1*, *stage_1*) < (*c_2*, *stage_2*) or vice versa, due to the equivalence between WRITEs/READs and command stages.) 356 | * For every *write_1* = (WRITE, *c_1*, *stage_1*, *access_1*, *mem_1*), *write_2* = (WRITE, *c_2*, *stage_2*, *access_2*, *mem_2*) such that *write_1* != *write_2*, and *memOverlap(mem_1, mem_2)* is not empty: 357 | * The application must ensure either *write_1* < *write_2*, or *write_2* < *write_1*. 358 | 359 | Between a write and a subsequent memory access, the memory must be flushed and invalidated: 360 | 361 | * If there exist *write* = (WRITE, *c_1*, *stage_1*, *access_1*, *mem_1*), *read* = (READ, *c_2*, *stage_2*, *access_2*, *mem_2*) such that *write* < *read*, and *mem* = *memOverlap(mem_1, mem_2)* is not empty: 362 | * There must be some *flush* = (FLUSH, *b_1*, *stage_1*, *access_1*, *mem_1b*), *invalidate* = (INVALIDATE, *b_2*, *stage_2*, *access_2*, *mem_2b*) such that *write* < *flush* < *invalidate* < *read*, and *memIsSubset(mem, mem_1b)* and *memIsSubset(mem, mem_2b)*. 363 | * (This is saying that for every region of memory that is both written by one stage and read by another stage, you must first flush at least that much (possibly more) and invalidate that much (possibly more) between the write and read.) 364 | * If there exist *write_1* = (WRITE, *c_1*, *stage_1*, *access_1*, *mem_1*), *write_2* = (WRITE, *c_2*, *stage_2*, *access_2*, *mem_2*) such that *write_1* < *write_2*, and *mem* = *memOverlap(mem_1, mem_2)* is not empty: 365 | * There must be some *flush* = (FLUSH, *b_1*, *stage_1*, *access_1*, *mem_1b*), *invalidate* = (INVALIDATE, *b_2*, *stage_2*, *access_2*, *mem_2b*) such that *write_1* < *flush* < *invalidate* < *write_2*, and *memIsSubset(mem, mem_1b)* and *memIsSubset(mem, mem_2b)*. 366 | 367 | Additionally, you mustn't invalidate dirty memory (because that would leave the contents of RAM unpredictable, depending on whether the dirty lines were partially written back or not) - you must flush it first. This applies even when the invalidate is a different stage or access type, because some implementations might share caches between stages and access types and so the invalidate will still touch the dirty cache lines: 368 | 369 | * If there exist *write* = (WRITE, *c_1*, *stage_1*, *access_1*, *mem_1*), *invalidate* = (INVALIDATE, *b_2*, *stage_2*, *access_2*, *mem_2*) such that *write* < *invalidate*, and *mem* = *memOverlap(mem_1, mem_2)* is not empty: 370 | * There must be some *flush* = (FLUSH, *b_1*, *stage_1*, *access_1*, *mem_1b*) such that *write* < *flush* < *invalidate*, and *memIsSubset(mem, mem_1b)* and *memIsSubset(mem, mem_2)* 371 | 372 | And there must not be any race conditions between writes and invalidates, for the same reason: 373 | 374 | * For every *write* = (WRITE, *c_1*, *stage_1*, *access_1*, *mem_1*), *invalidate* = (INVALIDATE, *b_2*, *stage_2*, *access_2*, *mem_2*) such that *memOverlap(mem_1, mem_2)* is not empty: 375 | * The application must ensure either *write* < *invalidate*, or *invalidate* < *write*. 376 | 377 | (On the other hand, race conditions between writes and flushes are no problem.) 378 | 379 | TODO: by-region memory dependencies. 380 | 381 | TODO: cases where a stage in a command can be coherent with itself (Coherent in shaders, color attachment reads/writes, etc). 382 | 383 | 384 | 385 | 386 | ### TODO 387 | 388 | Transitions: The spec says: 389 | 390 | > Layout transitions that are performed via image memory barriers are automatically ordered against other layout transitions, including those that occur as part of a render pass instance. 391 | 392 | but I have no idea what that even means? 393 | 394 | Fences 395 | 396 | Semaphores 397 | 398 | Host events 399 | 400 | Host accesses 401 | 402 | QueueSubmit guarantees 403 | 404 | Semaphores, fences, events: be clear about how they're referring to the object at the time the command is executed (it might get deleted later) 405 | 406 | ... 407 | -------------------------------------------------------------------------------- /images/.gitignore: -------------------------------------------------------------------------------- 1 | *.aux 2 | *.log 3 | *.pdf 4 | -------------------------------------------------------------------------------- /images/Makefile: -------------------------------------------------------------------------------- 1 | pngs = action-command.png pipeline-barrier.png action-barrier-action.png action-barrier-action-mem.png 2 | 3 | all: $(pngs) 4 | 5 | $(pngs): %.png: %.tex deps-common.tex 6 | pdflatex -shell-escape $< 7 | -------------------------------------------------------------------------------- /images/action-barrier-action-mem.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/philiptaylor/vulkan-sync/7e9508a666002d80a28aee53b50cc3afac6145f4/images/action-barrier-action-mem.png -------------------------------------------------------------------------------- /images/action-barrier-action-mem.tex: -------------------------------------------------------------------------------- 1 | \input{deps-common.tex} 2 | 3 | \tikzset{ 4 | action-mem/.pic={ 5 | % 6 | \node [pipestageinner-wrap] (vs) {}; 7 | \node [pipestageinner,below=of vs.north,yshift=+2mm] (vs-inner) {VERT}; 8 | \node [pipestageinner-mem,above=of vs.south,yshift=-2mm] (vs-mem) {WRITE $\mathit{img}_1$}; 9 | \node [pipestageinner,right=of vs] (fs) {FRAG}; 10 | \node [pipestageinner,right=of fs] (ca) {COLR}; 11 | \node [pipestageinner,right=of ca] (comp) {COMP}; 12 | \node [pipestageinner,right=of comp] (xfer) {XFER}; 13 | % 14 | \draw [memequiv] (vs-inner) -- (vs-mem); 15 | % 16 | \node [pipestage,above=of ca] (top) {TOP}; 17 | \node [pipestage,below=of ca] (bottom) {BOTTOM}; 18 | % 19 | \draw (top.south) 20 | edge [->,pipedep,in=90,out=270] (vs) 21 | edge [->,pipedep,in=90,out=270] (fs) 22 | edge [->,pipedep,in=90,out=270] (ca) 23 | edge [->,pipedep,in=90,out=270] (comp) 24 | edge [->,pipedep,in=90,out=270] (xfer); 25 | \draw (bottom.north) 26 | edge [<-,pipedep,in=270,out=90] (vs) 27 | edge [<-,pipedep,in=270,out=90] (fs) 28 | edge [<-,pipedep,in=270,out=90] (ca) 29 | edge [<-,pipedep,in=270,out=90] (comp) 30 | edge [<-,pipedep,in=270,out=90] (xfer); 31 | % 32 | \begin{scope}[on background layer] 33 | \node (bg) [draw,fit=(top) (bottom) (vs) (xfer),inner ysep=16pt] {}; 34 | \node [anchor=north west] at (bg.north west) {#1}; 35 | \end{scope} 36 | % 37 | }} 38 | 39 | \tikzset{ 40 | action-mem2/.pic={ 41 | % 42 | \node [pipestageinner] (vs) {VERT}; 43 | \node [pipestageinner-wrap,right=of vs] (fs) {}; 44 | \node [pipestageinner,below=of fs.north,yshift=+2mm] (fs-inner) {FRAG}; 45 | \node [pipestageinner-mem,above=of fs.south,yshift=-2mm] (fs-mem) {READ $\mathit{img}_1$}; 46 | \node [pipestageinner,right=of fs] (ca) {COLR}; 47 | \node [pipestageinner,right=of ca] (comp) {COMP}; 48 | \node [pipestageinner,right=of comp] (xfer) {XFER}; 49 | % 50 | \draw [memequiv] (fs-inner) -- (fs-mem); 51 | % 52 | \node [pipestage,above=of ca] (top) {TOP}; 53 | \node [pipestage,below=of ca] (bottom) {BOTTOM}; 54 | % 55 | \draw (top.south) 56 | edge [->,pipedep,in=90,out=270] (vs) 57 | edge [->,pipedep,in=90,out=270] (fs) 58 | edge [->,pipedep,in=90,out=270] (ca) 59 | edge [->,pipedep,in=90,out=270] (comp) 60 | edge [->,pipedep,in=90,out=270] (xfer); 61 | \draw (bottom.north) 62 | edge [<-,pipedep,in=270,out=90] (vs) 63 | edge [<-,pipedep,in=270,out=90] (fs) 64 | edge [<-,pipedep,in=270,out=90] (ca) 65 | edge [<-,pipedep,in=270,out=90] (comp) 66 | edge [<-,pipedep,in=270,out=90] (xfer); 67 | % 68 | \begin{scope}[on background layer] 69 | \node (bg) [draw,fit=(top) (bottom) (vs) (xfer),inner ysep=16pt] {}; 70 | \node [anchor=north west] at (bg.north west) {#1}; 71 | \end{scope} 72 | % 73 | }} 74 | 75 | \tikzset{barrier-vsxf-top-mem/.pic={ 76 | % 77 | \node [pipestageinner] (src-top) {TOP}; 78 | \node [pipestageinner,right=of src-top] (src-bottom) {BOTT}; 79 | \node [pipestageinner,right=of src-bottom] (src-vs) {VERT}; 80 | \node [pipestageinner,right=of src-vs] (src-fs) {FRAG}; 81 | \node [pipestageinner,right=of src-fs] (src-ca) {COLR}; 82 | \node [pipestageinner,right=of src-ca] (src-comp) {COMP}; 83 | \node [pipestageinner,right=of src-comp] (src-xfer) {XFER}; 84 | \node [pipestageinner,yshift=-6cm] (dst-top) {TOP}; 85 | \node [pipestageinner,right=of dst-top] (dst-bottom) {BOTT}; 86 | \node [pipestageinner,right=of dst-bottom] (dst-vs) {VERT}; 87 | \node [pipestageinner,right=of dst-vs] (dst-fs) {FRAG}; 88 | \node [pipestageinner,right=of dst-fs] (dst-ca) {COLR}; 89 | \node [pipestageinner,right=of dst-ca] (dst-comp) {COMP}; 90 | \node [pipestageinner,right=of dst-comp] (dst-xfer) {XFER}; 91 | % 92 | \node [pipestage,node distance=8mm,below=of src-fs] (src) {SRC}; 93 | \node [pipestage,node distance=8mm,above=of dst-fs] (dst) {DST}; 94 | \node [pipestage,node distance=10mm,below=of src] (pretrans) {PRE\_TRANS}; 95 | \node [pipestage,node distance=10mm,above=of dst] (posttrans) {POST\_TRANS}; 96 | % 97 | \node [pipestagemem,node distance=12mm,below=of src-bottom] (flush1) {FLUSH, $\mathit{img}_1$, \texttt{VERT|XFER}, \texttt{SHADER\_WRITE}}; 98 | \node [pipestagemem,node distance=12mm,above=of dst-bottom] (invalidate1) {INVALIDATE, GLOBAL, \texttt{FRAG|TOP}, \texttt{SHADER\_READ}}; 99 | % 100 | \draw (src.west) edge [->,pipedep] (flush1.10); 101 | \draw (flush1.350) edge [->,pipedep] (pretrans.north west); 102 | \draw (posttrans.west) edge [->,pipedep] (invalidate1.10); 103 | \draw (invalidate1.350) edge [->,pipedep] (dst.north west); 104 | % 105 | \draw (src.north) 106 | edge [<-,pipedep,in=270,out=90] (src-top) 107 | edge [<-,pipedep,in=330,out=90] (src-vs) 108 | edge [<-,pipedep,in=270,out=90] (src-xfer); 109 | \draw (dst.south) 110 | edge [->,pipedep,in=90,out=270] (dst-top) 111 | edge [->,pipedep,in=30,out=270] (dst-bottom) 112 | edge [->,pipedep,in=90,out=270] (dst-fs); 113 | \draw [->,pipedep] (src) -- (pretrans); 114 | \draw [->,pipedep] (pretrans) -- (posttrans); 115 | \draw [->,pipedep] (posttrans) -- (dst); 116 | % 117 | \begin{scope}[on background layer] 118 | \node (bg) [draw,fit=(src-top) (dst-xfer),inner ysep=16pt] {}; 119 | \node [anchor=north west] at (bg.north west) {#1}; 120 | \end{scope} 121 | }} 122 | 123 | \begin{document} 124 | \begin{tikzpicture} 125 | 126 | \pic[] (draw1-) {action-mem=Draw 1}; 127 | \pic[yshift=-4cm] (barrier1-) {barrier-vsxf-top-mem=Pipeline barrier}; 128 | \pic[yshift=-14cm] (draw2-) {action-mem2=Draw 2}; 129 | 130 | \draw [->,pipedep2] (draw1-vs.south) -- (barrier1-src-vs.north); 131 | \draw [->,pipedep2] (draw1-xfer.south) -- (barrier1-src-xfer.north); 132 | \draw [->,pipedep2] (barrier1-dst-top.south) -- (draw2-top.north); 133 | \draw [->,pipedep2] (barrier1-dst-fs.south) -- (draw2-fs.north); 134 | 135 | \end{tikzpicture} 136 | \end{document} 137 | -------------------------------------------------------------------------------- /images/action-barrier-action.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/philiptaylor/vulkan-sync/7e9508a666002d80a28aee53b50cc3afac6145f4/images/action-barrier-action.png -------------------------------------------------------------------------------- /images/action-barrier-action.tex: -------------------------------------------------------------------------------- 1 | \input{deps-common.tex} 2 | 3 | \begin{document} 4 | \begin{tikzpicture} 5 | 6 | \pic[] (draw1-) {action=Draw 1}; 7 | \pic[yshift=-4cm] (barrier1-) {barrier-vsxf-top=Pipeline barrier}; 8 | \pic[yshift=-12cm] (draw2-) {action=Draw 2}; 9 | 10 | \draw [->,pipedep2] (draw1-vs.south) -- (barrier1-src-vs.north); 11 | \draw [->,pipedep2] (draw1-xfer.south) -- (barrier1-src-xfer.north); 12 | \draw [->,pipedep2] (barrier1-dst-top.south) -- (draw2-top.north); 13 | 14 | \end{tikzpicture} 15 | \end{document} 16 | -------------------------------------------------------------------------------- /images/action-command.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/philiptaylor/vulkan-sync/7e9508a666002d80a28aee53b50cc3afac6145f4/images/action-command.png -------------------------------------------------------------------------------- /images/action-command.tex: -------------------------------------------------------------------------------- 1 | \input{deps-common.tex} 2 | 3 | \begin{document} 4 | \begin{tikzpicture} 5 | 6 | \pic[] (draw1-) {action=Action command}; 7 | 8 | \end{tikzpicture} 9 | \end{document} 10 | -------------------------------------------------------------------------------- /images/deps-common.tex: -------------------------------------------------------------------------------- 1 | \documentclass[border=2pt,convert={density=150,outext=.png}]{standalone} 2 | \usepackage{mathpazo} 3 | \usepackage{tikz} 4 | \usetikzlibrary{arrows,decorations.pathmorphing,backgrounds,positioning,fit,shapes.misc,calc} 5 | % 6 | \tikzset{ 7 | pipestage/.style={ 8 | rounded rectangle, 9 | minimum width=20mm, 10 | inner sep=2pt, 11 | draw=black, 12 | node distance=15mm, 13 | font=\ttfamily 14 | }, 15 | pipestageinner/.style={ 16 | pipestage, 17 | minimum width=10mm, 18 | node distance=3mm, 19 | font=\ttfamily\small 20 | }, 21 | pipestagelayout/.style={ 22 | pipestage, 23 | minimum width=10mm, 24 | node distance=3mm, 25 | font=\scriptsize 26 | }, 27 | pipedep/.style={ 28 | red,semithick,>=stealth' 29 | }, 30 | pipedep2/.style={ 31 | red,very thick,>=stealth' 32 | }, 33 | pipepixdep/.style={ 34 | red,thick,densely dotted 35 | }, 36 | pipestageinner-wrap/.style={ 37 | pipestageinner, 38 | rectangle, 39 | rounded corners=2mm, 40 | minimum width=12mm, 41 | minimum height=14mm, 42 | }, 43 | pipestageinner-mem/.style={ 44 | pipestageinner, 45 | rectangle, 46 | rounded corners=2mm, 47 | draw=blue, 48 | text width=8mm, 49 | align=center, 50 | font=\rmfamily\scriptsize 51 | }, 52 | pipestagemem/.style={ 53 | pipestage, 54 | rectangle, 55 | rounded corners=2mm, 56 | draw=blue, 57 | text width=16mm, 58 | align=center, 59 | font=\rmfamily\scriptsize 60 | }, 61 | memequiv/.style={ 62 | blue,decorate,decoration={snake,amplitude=0.4mm,segment length=1mm} 63 | }, 64 | barrier/.style={ 65 | rectangle, 66 | minimum width=40mm, 67 | minimum height=8mm, 68 | inner sep=2pt, 69 | draw=black, 70 | font=\ttfamily 71 | }, 72 | barrierport/.style={ 73 | rectangle, 74 | minimum width=40mm, 75 | draw=black, 76 | inner sep=2pt, 77 | font=\ttfamily\footnotesize 78 | }, 79 | } 80 | % 81 | \tikzset{action/.pic={ 82 | % 83 | \node [pipestageinner] (vs) {VERT}; 84 | \node [pipestageinner,right=of vs] (fs) {FRAG}; 85 | \node [pipestageinner,right=of fs] (ca) {COLR}; 86 | \node [pipestageinner,right=of ca] (comp) {COMP}; 87 | \node [pipestageinner,right=of comp] (xfer) {XFER}; 88 | % 89 | \node [pipestage,above=of ca] (top) {TOP}; 90 | \node [pipestage,below=of ca] (bottom) {BOTTOM}; 91 | % 92 | \draw (top.south) 93 | edge [->,pipedep,in=90,out=270] (vs) 94 | edge [->,pipedep,in=90,out=270] (fs) 95 | edge [->,pipedep,in=90,out=270] (ca) 96 | edge [->,pipedep,in=90,out=270] (comp) 97 | edge [->,pipedep,in=90,out=270] (xfer); 98 | \draw (bottom.north) 99 | edge [<-,pipedep,in=270,out=90] (vs) 100 | edge [<-,pipedep,in=270,out=90] (fs) 101 | edge [<-,pipedep,in=270,out=90] (ca) 102 | edge [<-,pipedep,in=270,out=90] (comp) 103 | edge [<-,pipedep,in=270,out=90] (xfer); 104 | % 105 | \begin{scope}[on background layer] 106 | \node (bg) [draw,fit=(top) (bottom) (vs) (xfer),inner ysep=16pt] {}; 107 | \node [anchor=north west] at (bg.north west) {#1}; 108 | \end{scope} 109 | % 110 | }} 111 | % 112 | \tikzset{barrier/.pic={ 113 | % 114 | \node [pipestageinner] (src-top) {TOP}; 115 | \node [pipestageinner,right=of src-top] (src-bottom) {BOTT}; 116 | \node [pipestageinner,right=of src-bottom] (src-vs) {VERT}; 117 | \node [pipestageinner,right=of src-vs] (src-fs) {FRAG}; 118 | \node [pipestageinner,right=of src-fs] (src-ca) {COLR}; 119 | \node [pipestageinner,right=of src-ca] (src-comp) {COMP}; 120 | \node [pipestageinner,right=of src-comp] (src-xfer) {XFER}; 121 | \node [pipestageinner,yshift=-5cm] (dst-top) {TOP}; 122 | \node [pipestageinner,right=of dst-top] (dst-bottom) {BOTT}; 123 | \node [pipestageinner,right=of dst-bottom] (dst-vs) {VERT}; 124 | \node [pipestageinner,right=of dst-vs] (dst-fs) {FRAG}; 125 | \node [pipestageinner,right=of dst-fs] (dst-ca) {COLR}; 126 | \node [pipestageinner,right=of dst-ca] (dst-comp) {COMP}; 127 | \node [pipestageinner,right=of dst-comp] (dst-xfer) {XFER}; 128 | % 129 | \node [pipestage,node distance=8mm,below=of src-fs] (src) {SRC}; 130 | \node [pipestage,node distance=8mm,above=of dst-fs] (dst) {DST}; 131 | \node [pipestage,node distance=2mm,below=of src] (pretrans) {PRE\_TRANS}; 132 | \node [pipestage,node distance=2mm,above=of dst] (posttrans) {POST\_TRANS}; 133 | \node [pipestagelayout,node distance=21mm,below=of src-vs] (layout1) {Layout transition}; 134 | \node [pipestagelayout,node distance=21mm,below=of src-ca] (layout2) {Layout transition}; 135 | % 136 | \draw (src.north) 137 | edge [<-,pipedep,in=270,out=90] (src-top) 138 | edge [<-,pipedep,in=330,out=90] (src-bottom) 139 | edge [<-,pipedep,in=330,out=90] (src-vs) 140 | edge [<-,pipedep,in=270,out=90] (src-fs) 141 | edge [<-,pipedep,in=210,out=90] (src-ca) 142 | edge [<-,pipedep,in=210,out=90] (src-comp) 143 | edge [<-,pipedep,in=270,out=90] (src-xfer); 144 | \draw (dst.south) 145 | edge [->,pipedep,in=90,out=270] (dst-top) 146 | edge [->,pipedep,in=30,out=270] (dst-bottom) 147 | edge [->,pipedep,in=30,out=270] (dst-vs) 148 | edge [->,pipedep,in=90,out=270] (dst-fs) 149 | edge [->,pipedep,in=150,out=270] (dst-ca) 150 | edge [->,pipedep,in=150,out=270] (dst-comp) 151 | edge [->,pipedep,in=90,out=270] (dst-xfer); 152 | \draw [->,pipedep] (src) -- (pretrans); 153 | \draw [->,pipedep] (pretrans) -- (posttrans); 154 | \draw [->,pipedep] (posttrans) -- (dst); 155 | \draw (pretrans.south) edge [->,pipedep,in=45,out=270] (layout1.north); 156 | \draw (pretrans.south) edge [->,pipedep,in=135,out=270] (layout2.north); 157 | \draw (layout1.south) edge [->,pipedep,in=90,out=315] (posttrans.north); 158 | \draw (layout2.south) edge [->,pipedep,in=90,out=225] (posttrans.north); 159 | % 160 | \begin{scope}[on background layer] 161 | \node (bg) [draw,fit=(src-top) (dst-xfer),inner ysep=16pt] {}; 162 | \node [anchor=north west] at (bg.north west) {#1}; 163 | \end{scope} 164 | % 165 | }} 166 | % 167 | \tikzset{barrier-vsxf-top/.pic={ 168 | % 169 | \node [pipestageinner] (src-top) {TOP}; 170 | \node [pipestageinner,right=of src-top] (src-bottom) {BOTT}; 171 | \node [pipestageinner,right=of src-bottom] (src-vs) {VERT}; 172 | \node [pipestageinner,right=of src-vs] (src-fs) {FRAG}; 173 | \node [pipestageinner,right=of src-fs] (src-ca) {COLR}; 174 | \node [pipestageinner,right=of src-ca] (src-comp) {COMP}; 175 | \node [pipestageinner,right=of src-comp] (src-xfer) {XFER}; 176 | \node [pipestageinner,yshift=-4cm] (dst-top) {TOP}; 177 | \node [pipestageinner,right=of dst-top] (dst-bottom) {BOTT}; 178 | \node [pipestageinner,right=of dst-bottom] (dst-vs) {VERT}; 179 | \node [pipestageinner,right=of dst-vs] (dst-fs) {FRAG}; 180 | \node [pipestageinner,right=of dst-fs] (dst-ca) {COLR}; 181 | \node [pipestageinner,right=of dst-ca] (dst-comp) {COMP}; 182 | \node [pipestageinner,right=of dst-comp] (dst-xfer) {XFER}; 183 | % 184 | \node [pipestage,node distance=8mm,below=of src-fs] (src) {SRC}; 185 | \node [pipestage,node distance=8mm,above=of dst-fs] (dst) {DST}; 186 | \node [pipestage,node distance=2mm,below=of src] (pretrans) {PRE\_TRANS}; 187 | \node [pipestage,node distance=2mm,above=of dst] (posttrans) {POST\_TRANS}; 188 | % 189 | \draw (src.north) 190 | edge [<-,pipedep,in=270,out=90] (src-top) 191 | edge [<-,pipedep,in=330,out=90] (src-vs) 192 | edge [<-,pipedep,in=270,out=90] (src-xfer); 193 | \draw (dst.south) 194 | edge [->,pipedep,in=90,out=270] (dst-top) 195 | edge [->,pipedep,in=30,out=270] (dst-bottom); 196 | \draw [->,pipedep] (src) -- (pretrans); 197 | \draw [->,pipedep] (pretrans) -- (posttrans); 198 | \draw [->,pipedep] (posttrans) -- (dst); 199 | % 200 | \begin{scope}[on background layer] 201 | \node (bg) [draw,fit=(src-top) (dst-xfer),inner ysep=16pt] {}; 202 | \node [anchor=north west] at (bg.north west) {#1}; 203 | \end{scope} 204 | % 205 | }} 206 | -------------------------------------------------------------------------------- /images/pipeline-barrier.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/philiptaylor/vulkan-sync/7e9508a666002d80a28aee53b50cc3afac6145f4/images/pipeline-barrier.png -------------------------------------------------------------------------------- /images/pipeline-barrier.tex: -------------------------------------------------------------------------------- 1 | \input{deps-common.tex} 2 | 3 | \begin{document} 4 | \begin{tikzpicture} 5 | 6 | \pic[xshift=10cm] (barrier1-) {barrier=Pipeline barrier}; 7 | 8 | \end{tikzpicture} 9 | \end{document} 10 | -------------------------------------------------------------------------------- /memory.md: -------------------------------------------------------------------------------- 1 | # Vulkan memory dependencies 2 | 3 | To understand memory barriers in Vulkan, it is helpful to understand the typical GPU memory architecture and why they need these barriers. Unfortunately there seems to be a lack of public documentation of how any of this works. 4 | 5 | This document is an attempt to describe and explain the memory architecture and how Vulkan's memory model maps onto it. I'm not an expert and it may be full of mistakes (in which case I'd appreciate a pointer or explanation of how it really works!) so don't take it too literally. 6 | 7 | 8 | ## Overview of caches 9 | 10 | Skip this section if you are already familiar with how caches work. 11 | 12 | GPUs have access to a large amount of memory. In discrete GPUs this is dedicated VRAM on the graphics card, in integrated GPUs and SoCs it's usually the system RAM and is shared with the CPU. 13 | 14 | This memory is a very long way away from the various processing units in the GPU that need to operate on data from memory - usually it's on a separate chip entirely - so it takes a long time to access and costs a lot of power. 15 | 16 | Caches try to improve this performance. They rely on the observation that if you read a byte from memory you're likely to read it again very soon (e.g. a value in a uniform buffer that is read for every fragment); and if you read a byte from memory you're likely to read nearby bytes very soon (e.g. if you read one vertex's data from a vertex attribute buffer, you'll probably be reading the next vertex too). 17 | 18 | Caches sit on the interface between the processing units and memory. When a single byte is read, the cache will request a larger amount from VRAM (perhaps 32 bytes or more, called a cache line). The cache will then store that whole cache line, so it can respond to subsequent read requests directly from that cached data instead of talking to VRAM again. The cache has a limited size (from kilobytes up to a few megabytes), so it will occasionally forget ('evict') old cache lines to make room for new ones. 19 | 20 | Since the GPU contains many different types of processing units, and many parallel instances of each type, it has a large number and variety of caches that are each tuned for their particular requirements. 21 | 22 | The major difficulty comes when data in VRAM is modified, while it is still cached in one or more caches. When that memory location is read via the cache, the cache may return its stale cached data instead of the latest copy from VRAM, resulting in unexpected and unpredictable behaviour. There are hardware techniques (*cache coherence protocols*) that let the cache detect when memory has been modified so it can drop its stale data and re-read from VRAM, but these are complex and excessively expensive when there are as many caches as a GPU has. The alternative is to rely on software to send a signal to the cache, warning it that a specific range of memory has been modified so the cache needs to drop the stale cache lines. 23 | 24 | In OpenGL, the GL driver is (mostly) responsible for deciding when to send these signals. In Vulkan, that responsibility (mostly) lies with the application instead. The application has to use memory barriers to provide the signals, through `vkCmdPipelineBarrier`, subpass dependencies, and the other synchronisation features. If the application gets it wrong, it may still run correctly on some devices all of the time, and on some devices some of the time, but randomly fail on some other devices at the most inconvenient possible time. 25 | 26 | The description above only considers caches that are used for memory reads. Some processing units can perform writes too, so the caches need to handle these. There are two general techniques (with lots of minor variations): 27 | 28 | * Write-through cache: The write is passed directly down to the next level. Additionally, if the write touches a cache line that's currently in the cache, that copy of the cache line will be updated too. This means e.g. a shader can write to a variable then immediately read it back and get the correct value directly from the cache. 29 | * Write-back cache: The write is not passed down immediately. Instead the cache line is updated, and a *dirty* flag is set on it. At some time in the future, the cache will *flush* the dirty cache line down to VRAM and remove the dirty flag. This means e.g. a shader can write to a single variable many times, and all those writes will be handled efficiently by the cache, and only the final result will be written to VRAM. 30 | 31 | This means that when an application wants to write some data in one part of the GPU then read it in another part, it has to flush the writes from the first cache down into VRAM, then invalidate the second cache so it will re-fetch the latest value from VRAM. 32 | 33 | As an extra complexity, caches can be nested: a processing unit can read from a small fast Level 1 (L1) cache, which reads from a larger slower L2 cache, which reads from huge very slow VRAM. In some GPUs there can be five or more levels of caches on top of the RAM. To coordinate properly between writes and reads in different parts of the system, they might need to flush and invalidate multiple caches in the correct sequence. 34 | 35 | 36 | ## GPU memory architecture 37 | 38 | The memory architecture of a modern GPU might look a little bit like this: 39 | 40 | ``` 41 | .--------------------------. .--------------------------. 42 | | CU: | | CU: | 43 | | | | | 44 | | [WI] [WI] [WI] [WI] [WI] | | [WI] [WI] [WI] [WI] [WI] | .---------------. 45 | | | | | | Host CPU: | 46 | | [ALU] [ALU] [ALU] [L/S] | | [ALU] [ALU] [ALU] [L/S] | | | 47 | | | | | | [Core] [Core] | 48 | | [L1$] [T$] [U$] [SM] | | [L1$] [T$] [U$] [SM] | | | | | 49 | '---|-----|-----|----------' '---|-----|-----|----------' | [L1$] [L1$] | 50 | | | | | | | | _|_______|_ | 51 | .---'-----'-----'-----------------'-----'-----'----------. .-------. .-------. .----------. .---------. | [____L2$____] | 52 | | | | | | | | | | | | _____|_____ | 53 | | L2 cache | | ROP | | ROP | | Transfer | | Display | | [____L3$____] | 54 | | | | | | | | | | | | | | 55 | '---------------------------.----------------------------' '---.---' '---.---' '----.-----' '----.----' '-------|-------' 56 | | | | | | | 57 | .---------------------------'----------------------------------'---------'----------'------------'--------------'-------. 58 | | | 59 | | VRAM | 60 | | | 61 | '-----------------------------------------------------------------------------------------------------------------------' 62 | ``` 63 | 64 | (This doesn't correspond to any one particular real GPU, it's a mixture of different ones based on the scant public documentation, plus a few guesses and a few intentional simplifications and many mistakes. The description here is focused on discrete GPUs over integrated ones, but the same general concepts should apply with some variation to all reasonable GPUs.) 65 | 66 | At the top there are a number of *compute units* (CUs) (aka Streaming Multiprocessors, Execution Units, etc). Each CU contains the state for a number of *work items* (WIs, aka threads); the state includes an instruction pointer and a bunch of general purpose registers for thread-private data. Each CU contains a number of shader cores (ALUs for arithmetic operations, load/store units for memory operations, etc) that execute instructions on work items. 67 | 68 | In this example, each CU also contains a number of different caches: 69 | * L1 data cache (read-write, for general buffer accesses) 70 | * Texture cache (read-only, for texture samplers) 71 | * Uniform cache (read-only, for dynamically-uniform access patterns (i.e. where every work item in a subgroup is expected to read from the same location)) 72 | 73 | All these caches from all the CUs are connected to a single shared L2 cache. The L2 cache is connected to VRAM. 74 | 75 | Each CU also contains a block of shared memory, which can be accessed by all the work items in a work group. A work group is defined to fit in a single CU, so there is no need to share shared memory outside the CU. 76 | 77 | Additionally there are a number of *ROPs* (render output units / raster operations pipelines), which are responsible for all colour/depth/stencil framebuffer attachment accesses. They probably don't go through the L2 cache; instead they contain their own specialised caches for the attachments. If the device supports framebuffer compression, the ROP is responsible for accumulating a small tile of pixels in its cache and then compressing the tile before writing it to VRAM. 78 | 79 | The host CPU has some connection to VRAM (perhaps over a PCIe bus), behind the CPU's own L1/L2 caches. The presentation engine (i.e. display controller) reads from VRAM, and there might be a transfer DMA engine that reads and writes VRAM directly. 80 | 81 | ### Coherency 82 | 83 | In general, none of the GPU caches are coherent with each other - a write via one cache might not be seen by a later read from a different cache. They rely on invalidate/flush signals from software to maintain correct behaviour. 84 | 85 | One important case is that an application's shader might choose to write to two adjacent bytes in a buffer from two concurrent work items in different CUs. The L1 cache must make sure this works correctly even when the writes are in the same cache line - both writes must eventually reach VRAM correctly merged together, without any explicit synchronisation from the software. Options include: 86 | 87 | * Implement L1 as a write-through cache: every write will be immediately sent to L2 along with a byte mask (indicating precisely which bytes are meant to be updated). The L2 cache will update the appropriate bytes in its own copy of the cache line. The L1 cache can either update and keep its copy of the cache line, or evict it. 88 | * Implement L1 as a write-back cache with a per-byte dirty mask. Every write will update the L1's data and mask. When the L1 decides to evict a cache line, it checks whether any bit in the dirty mask is set; if so then it flushes the line first, by sending the data and byte mask to L2, so only the dirty bytes get updated in L2. It may also flush lines eagerly to reduce the amount of dirty tracking needed. (If it's very eager then this is a lot like a write-combining cache.) 89 | * Don't have an L1 cache. 90 | 91 | The L2 cache should not be write-through (that would eliminate a lot of its performance benefit), and cannot afford per-byte dirty mask (cache is already very expensive and the masks would cost an extra 12.5%), so it is likely to be write-back with per-cache-line dirty bits. 92 | 93 | A shader can access variables defined with the `Coherent` decoration in SPIR-V; those accesses are defined to be coherent between different work items (potentially in different CUs), when they are accessing the same bytes through the same buffer view or image view. That means reads and writes to these variables must always go down to the L2 level, they cannot be satisfied from L1. `Coherent` shader variables don't need to be coherent with any non-shader accesses, so they don't have to go all the way to VRAM. 94 | 95 | Atomic operations are defined to be coherent with `Coherent` variables, so they have to be implemented similarly. (In the hardware the atomic behaviour can't be supported by the compute unit itself - the atomic request is typically passed down to the L2 cache, which may receive requests from many work items simultaneously and will execute them all with the correct atomicity.) 96 | 97 | An application might write to one byte from a shader, and write to an adjacent byte from a transfer operation. This requires the transfer unit to be coherent with the L2 cache, which means there must be some hardware coherency support. 98 | 99 | 100 | ### Attachments 101 | 102 | Writes to framebuffer attachments through the ROPs are not coherent with L2, and the ROPs usually write tiles rather than individual pixels - it would break if a shader tried to write to some pixels via an attachment and adjacent pixels via a storage image. The specification therefore (tries to) forbid this case - an image mustn't be accessed by any non-attachment path while it's being used as an attachment in a render pass. (Specifically, the ROP caches need to be flushed/invalidated between use as an attachment and as a non-attachment.) 103 | 104 | Input attachments are a bit tricky: an attachment can be used as both colour (or equivalently depth/stencil) and input in a single subpass. If the colour components written via the colour attachment and read via the input attachment are different, this is specified to be fine as there is no feedback loop - it doesn't matter whether the writes are seen by the reads or not. If there is feedback, Vulkan requires an explicit pipeline barrier. This means it is possible for a device to implement input attachments as textures with appropriate flushing/invalidating at the barrier, while other devices (mainly tile-based renderers) might implement them through the same data path as colour attachments. 105 | 106 | 107 | ### Host CPU accesses 108 | 109 | Memory accesses from the host have to deal with the CPU caches, which are quite different from GPU caches. In Vulkan, memory types with the `HOST_VISIBLE` flag can have three caching modes: 110 | 111 | * `HOST_COHERENT`: Usually implemented with write-combining on the CPU. Writes are cached a little bit - bytes written to the same cache line are collected in a write buffer, and eventually the write buffer will be flushed to VRAM in a single memory transaction. If only a partial cache line has been written, it may be sent with a mask or split into multiple transactions, so only the relevant bytes will be updated in VRAM. The timing of the "eventually" is important: pretty much any other operation that might be visible to the GPU (writes to hardware registers or uncached memory, atomics, etc) will automatically flush the write buffer, so the GPU will never receive a signal saying data is available before the data is actually available. Reads are fully uncached on the CPU and will always read from VRAM (so they are very expensive). 112 | 113 | * `HOST_CACHED`: Cached on the CPU. All reads and writes use the CPU's L1/L2/L3 caches. After the CPU has done some writes, `vkFlushMappedMemoryRanges` will flush dirty cache lines to VRAM. Before the CPU does any reads or writes, `vkInvalidateMappedMemoryRanges` will invalidate the CPU caches, to let it fetch the latest data from VRAM. Since the CPU don't have per-byte dirty flags, these flushes/invalidates can't work at a byte granularity - they work at `VkPhysicalDeviceLimits::nonCoherentAtomSize` granularity instead (typically the CPU cache line size of 64 bytes). Since the cache contains whole cache lines, if you write to only part of a cache line then the CPU will have to fetch the rest of the line from VRAM; that means this can actually be more expensive than write-combining memory for writes. 114 | 115 | * `HOST_CACHED` and `HOST_COHERENT`: Cached on the CPU, with some hardware mechanism to automatically ensure coherency between the GPU and CPU caches. 116 | 117 | 118 | ## Memory dependencies 119 | 120 | Vulkan defines the concept of memory dependencies or memory barriers to let the application provide any necessary cache-maintenance hints. 121 | 122 | Source memory dependencies are defined for writes of specifics types (`srcAccessMask`) in specific pipeline stages (`srcStageMask`), and they make these writes *available* to subsequent accesses. 123 | 124 | Destination memory dependencies are defined for reads and writes of specifics types (`dstAccessMask`) in specific pipeline stages (`dstStageMask`), and they make earlier available writes *visible* to subsequent accesses. 125 | 126 | To ensure the sequence of write, make-available, make-visible and read will occur in the right order, there must be execution dependencies between them. Both halves of the memory dependency can be provided in a single `vkCmdPipelineBarrier` command, or they can be provided in separate barriers as long as there is some execution dependency chain between them. 127 | 128 | At the hardware level, the combination of `accessMask` and `stageMask` identifies a specific cache or set of caches. Making a write 'available' means flushing a cache so the write ends up in some appropriate level of the memory hierarchy, and making it 'visible' means invalidating a cache so that subsequent reads will be satisfied from that same level of the memory hierarchy. 129 | 130 | Some care is needed when invalidating non-read-only caches: it is legal to have a make writes visible to a cache that already contains some dirty data in the same cache line, and that dirty data must not be lost. That means the invalidate must really be implemented as a flush-then-invalidate operation. 131 | 132 | There is no point ever specifying a `READ` access type in `srcAccessMask` - it doesn't make sense to flush a read, only a write. `WRITE` in `dstAccessMask` is trickier: for example in a write-through L1 cache it doesn't matter if there are stale cache lines, since writes will pass through stale cache lines just as well as fresh ones, so there is no need to invalidate the cache; but for a write-back L2 cache with per-cache-line dirty tracking it is still necessary to invalidate before a write, else stale bytes in the same cache line that don't get overwritten will later get flushed back to VRAM. 133 | 134 | Applications must therefore use `WRITE` access types in `srcAccessMask`, and `READ` and/or `WRITE` in `dstAccessMask` depending on whether it's a read-after-write or write-after-write situation. Operations like atomics and fragment blending must use `READ|WRITE` in `dstAccessMask`. 135 | 136 | We can look at our hypothetical GPU memory architecture and work out which cache operations are needed to implement memory dependencies. There are two reasonable-sounding choices as to which particular level of the memory hierarchy we should flush to: either VRAM or the L2 cache. We'll consider both options. We'll also make a few other assumptions: 137 | * The L1 data caches are write-through, so they never need flushing. 138 | * Input attachments are read like textures. 139 | 140 | We've omitted some shaders (geometry, tessellation, compute) that are identical to vertex shaders. We've also combined the early and late fragment test stages into a single column, since they behave identically. In all these tables, "-" means a `stageMask`/`accessMask` combination that doesn't make sense, since those stages never make accesses of those types; any memory dependencies on these combinations will be ignored. The `VK_ACCESS_MEMORY_...` types seem to be specified too vaguely to figure out what they're meant to do. 141 | 142 | Using VRAM as the level of coherency: 143 | 144 | | srcAccessMask \ srcStageMask | DRAW_INDIRECT | VERTEX_INPUT | VERTEX_SHADER | FRAGMENT_SHADER | FRAGMENT_TESTS | COLOR_ATTACHMENT_OUTPUT | TRANSFER | HOST | 145 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | 146 | | VK_ACCESS_SHADER_WRITE_BIT | - | - | flush L2 | flush L2 | - | - | - | - | 147 | | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | - | - | - | - | - | flush ROP | - | - | 148 | | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT | - | - | - | - | flush ROP | - | - | - | 149 | | VK_ACCESS_TRANSFER_WRITE_BIT | - | - | - | - | - | - | nothing | - | 150 | | VK_ACCESS_HOST_WRITE_BIT | - | - | - | - | - | - | - | nothing | 151 | | VK_ACCESS_MEMORY_WRITE_BIT | ??? | ??? | ??? | ??? | ??? | ??? | ??? | ??? | 152 | 153 | | dstAccessMask \ dstStageMask | DRAW_INDIRECT | VERTEX_INPUT | VERTEX_SHADER | FRAGMENT_SHADER | FRAGMENT_TESTS | COLOR_ATTACHMENT_OUTPUT | TRANSFER | HOST | 154 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | 155 | | VK_ACCESS_INDIRECT_COMMAND_READ_BIT | invalidate L2,L1 | - | - | - | - | - | - | - | 156 | | VK_ACCESS_INDEX_READ_BIT | - | invalidate L2,L1 | - | - | - | - | - | - | 157 | | VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT | - | invalidate L2,L1 | - | - | - | - | - | - | 158 | | VK_ACCESS_UNIFORM_READ_BIT | - | - | invalidate L2,U$ | invalidate L2,U$ | - | - | - | - | 159 | | VK_ACCESS_INPUT_ATTACHMENT_READ_BIT | - | - | - | invalidate L2,T$ | - | - | - | - | 160 | | VK_ACCESS_SHADER_READ_BIT | - | - | invalidate L2,L1,T$ | invalidate L2,L1,T$ | - | - | - | - | 161 | | VK_ACCESS_SHADER_WRITE_BIT | - | - | invalidate L2 | invalidate L2 | - | - | - | - | 162 | | VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | - | - | - | - | - | invalidate ROP | - | - | 163 | | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | - | - | - | - | - | invalidate ROP | - | - | 164 | | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT | - | - | - | - | invalidate ROP | - | - | - | 165 | | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT | - | - | - | - | invalidate ROP | - | - | - | 166 | | VK_ACCESS_TRANSFER_READ_BIT | - | - | - | - | - | - | nothing | - | 167 | | VK_ACCESS_TRANSFER_WRITE_BIT | - | - | - | - | - | - | nothing | - | 168 | | VK_ACCESS_HOST_READ_BIT | - | - | - | - | - | - | - | nothing | 169 | | VK_ACCESS_HOST_WRITE_BIT | - | - | - | - | - | - | - | nothing | 170 | | VK_ACCESS_MEMORY_READ_BIT | ??? | ??? | ??? | ??? | ??? | ??? | ??? | ??? | 171 | | VK_ACCESS_MEMORY_WRITE_BIT | ??? | ??? | ??? | ??? | ??? | ??? | ??? | ??? | 172 | 173 | Using L2 as the level of coherency: 174 | 175 | | srcAccessMask \ srcStageMask | DRAW_INDIRECT | VERTEX_INPUT | VERTEX_SHADER | FRAGMENT_SHADER | FRAGMENT_TESTS | COLOR_ATTACHMENT_OUTPUT | TRANSFER | HOST | 176 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | 177 | | VK_ACCESS_SHADER_WRITE_BIT | - | - | nothing | nothing | - | - | - | - | 178 | | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | - | - | - | - | - | flush ROP, invalidate L2 | - | - | 179 | | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT | - | - | - | - | flush ROP, invalidate L2 | - | - | - | 180 | | VK_ACCESS_TRANSFER_WRITE_BIT | - | - | - | - | - | - | invalidate L2 | - | 181 | | VK_ACCESS_HOST_WRITE_BIT | - | - | - | - | - | - | - | invalidate L2 | 182 | | VK_ACCESS_MEMORY_WRITE_BIT | ??? | ??? | ??? | ??? | ??? | ??? | ??? | ??? | 183 | 184 | | dstAccessMask \ dstStageMask | DRAW_INDIRECT | VERTEX_INPUT | VERTEX_SHADER | FRAGMENT_SHADER | FRAGMENT_TESTS | COLOR_ATTACHMENT_OUTPUT | TRANSFER | HOST | 185 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | 186 | | VK_ACCESS_INDIRECT_COMMAND_READ_BIT | invalidate L1 | - | - | - | - | - | - | - | 187 | | VK_ACCESS_INDEX_READ_BIT | - | invalidate L1 | - | - | - | - | - | - | 188 | | VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT | - | invalidate L1 | - | - | - | - | - | - | 189 | | VK_ACCESS_UNIFORM_READ_BIT | - | - | invalidate U$ | invalidate U$ | - | - | - | - | 190 | | VK_ACCESS_INPUT_ATTACHMENT_READ_BIT | - | - | - | invalidate T$ | - | - | - | - | 191 | | VK_ACCESS_SHADER_READ_BIT | - | - | invalidate L1,T$ | invalidate L1,T$ | - | - | - | - | 192 | | VK_ACCESS_SHADER_WRITE_BIT | - | - | nothing | nothing | - | - | - | - | 193 | | VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | - | - | - | - | - | flush L2, invalidate ROP | - | - | 194 | | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | - | - | - | - | - | flush L2, invalidate ROP | - | - | 195 | | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT | - | - | - | - | flush L2, invalidate ROP | - | - | - | 196 | | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT | - | - | - | - | flush L2, invalidate ROP | - | - | - | 197 | | VK_ACCESS_TRANSFER_READ_BIT | - | - | - | - | - | - | flush L2 | - | 198 | | VK_ACCESS_TRANSFER_WRITE_BIT | - | - | - | - | - | - | flush L2 | - | 199 | | VK_ACCESS_HOST_READ_BIT | - | - | - | - | - | - | - | fiush L2 | 200 | | VK_ACCESS_HOST_WRITE_BIT | - | - | - | - | - | - | - | flush L2 | 201 | | VK_ACCESS_MEMORY_READ_BIT | ??? | ??? | ??? | ??? | ??? | ??? | ??? | ??? | 202 | | VK_ACCESS_MEMORY_WRITE_BIT | ??? | ??? | ??? | ??? | ??? | ??? | ??? | ??? | 203 | 204 | 205 | 206 | ## Aliasing 207 | 208 | Most of our discussion has been about bytes, and the ability to write to adjacent bytes from different processing units. For buffers, the relationship between bytes and buffer elements is obvious. Vulkan allows multiple buffers to be bound to overlapping (*aliased*) regions of memory with well-defined results. Writes to different bytes through different buffers should work with no extra requirements. Accesses to the same bytes through different buffers require memory barriers - e.g. if you write to a byte through a storage buffer, then read it back through an aliased uniform buffer, you need to ensure the uniform cache has been invalidated before the read. 209 | 210 | (TODO: Actually, doesn't that apply even when it's a single buffer bound as two separate descriptors? What prevents problems in that case?) 211 | 212 | Images are more complicated. Vulkan says that linear images in `PREINITIALIZED` or `GENERAL` layouts have a well-defined layout in memory (called *host-accessible*) - you can determine exactly which bytes correspond to which pixels by calling `vkGetImageSubresourceLayout`. In these cases you can access the same underlying memory through an image or a buffer with well-defined behaviour, since they're just accessing the same bytes through different routes, and aliasing works the same as with buffers. 213 | 214 | Linear images in other layouts, and optimal images, do not have a well-defined layout - any byte might correspond to any pixel, or even to multiple pixels (see the Framebuffer Compression section below). In these cases, if you write a pixel through an image then try to read a byte through a buffer, the result is unknown and undefined; and similarly if you write a byte through a buffer then try to read through an image. 215 | 216 | Since there is not necessarily a 1:N relationship between pixels and bytes for optimal images, there may be difficulties when two different processing units try to write to nearby pixels of the same image. For example if a shader writes a single pixel, and the shader hardware knows how to write compressed images, it must load a whole tile from L2 then decompress and update and compress and write the whole tile back. If a transfer simultaneously tries to write a different pixel in the same tile, and supports compression in the same way, one of the updates will be lost. To avoid this situation the driver must avoid using compressed layouts whenever it is possible for multiple non-coherent units to write to an image (as indicated by its layout and usage flags). 217 | 218 | 219 | ## Framebuffer compression 220 | 221 | Framebuffers typically make up a significant bandwidth cost - they are large and are often read and written multiple times per frame (for blending and postprocessing). But they are rarely full of random noise, they usually contain areas of smooth colour that can be losslessly compressed fairly easily, allowing a significant bandwidth saving. 222 | 223 | In general, the idea is to split the framebuffer into tiles of maybe 8x8 pixels (256 bytes), where each tile corresponds to a reasonable multiple of the memory controller's transaction size - maybe it's a 256-bit memory bus so each 256B tile data is 8 transaction sizes. (The important thing about the transaction size is that bandwidth is counted in transactions, not bytes, so you want to send as few transactions as possible but fully utilise all the bytes in each transaction). For each framebuffer there is also a small array of compression metadata, with a few bits per tile. 224 | 225 | The metadata determines how to interpret the tile data, and its states could include: 226 | 227 | * Uncompressed: the 256B of tile data contains the raw RGBA values of all 8x8 pixels. 228 | * 2:1 compressed: the first 128B of tile data contains a compressed representation of the 8x8 pixels. The second 128B contains garbage. That means reads/writes to this tile only need 4 memory transactions, not 8, reducing the bandwidth cost by half. 229 | * 4:1 compressed: the first 64B of tile data contains a compressed representation. 230 | * 8:1 compressed: the first 32B of tile data contains a compressed representation. (There's not much point compressing at 16:1 or more - it's still going to need at least one memory transaction per tile.) 231 | * Cleared to (0,0,0,0): All of the tile data is garbage; any attempts to read this tile will just get the constant colour (0,0,0,0) without reading any tile data. This also means the device can clear an entire image extremely efficiently by writing only to the metadata array, it doesn't have to touch the tile data at all. 232 | * Cleared to (0,0,0,255), (255,255,255,0), (255,255,255,255), maybe a few other useful constants: Same thing, you can just spend more metadata bits to make the fast clear path support more colours. 233 | 234 | This means we need maybe 4 bits of metadata per tile, so it's only a ~0.5% increase in memory usage for all images, and a similar increase in bandwidth for incompressible data, to gain a significant decrease in bandwidth for compressible data. 235 | 236 | (In practice some framebuffer compression mechanisms are quite different to this, but they all involve having some metadata table that describes how and/or where to read the tile data.) 237 | 238 | The metadata might be stored alongside the image in its allocated device memory, or it might be stored in some special hidden memory. This means that opaque images are truly opaque: you can't use the host or a buffer-transfer command to copy their bytes to a different location and trust that they can still be used correctly as images, since you won't have copied the compression metadata. 239 | 240 | Incidentally, an image could be transitioned from uncompressed to compressed very cheaply by simply filling its metadata array with the "uncompressed" flag. This greatly reduces the transition cost, at the expense of not optimising for any subsequent reads; but if the image is most likely to be read only once or never before being overwritten (which is typical for framebuffers), this seems a sensible optimisation. On the other hand, converting from compressed to uncompressed necessarily involves rewriting any tile data that was previously compressed, so it's worth trying hard to avoid decompression. 241 | 242 | 243 | ## Buffer image granularity 244 | 245 | To support framebuffer compression, some implementations may associate the compression metadata with regions of memory, not with specific image resources, so the compression will work transparently for any processing unit that tries to read that memory without the need to pass compression flags through some other channel. 246 | 247 | The compression-enabled flag and a pointer to the compression metadata can be conveniently stored inside the page tables, which are cached in TLBs. Any memory access will already have to do a TLB lookup for the virtual-to-physical translation, so it gets this compression information for free at the same time, and only has the extra cost of fetching the metadata when it knows the page is compressed. Pages are typically 64KB (it's not a coincidence that Vulkan's standard sparse image block shapes are 64KB), so each page contain 256 tiles, and needs maybe 128 bytes of metadata per page, which is a reasonable size (comparable to the L1 cache line size and the TLB entry size). 248 | 249 | On the downside that means that if you store a compressed image somewhere, the entire surrounding 64KB page will always be interpreted as compressed, even if you actually wanted to store some uncompressed image or buffer nearby - so you need to keep compressed images separated from other resources by 64KB. 250 | 251 | Beyond compression, a similar technique can be used for optimally-tiled images, with the tiling mode flags stored in the page table and the de-tiling performed automatically for any accesses to the page. That means optimal images can be on the same page as each other, but must be kept on separate pages from linear images and buffers. 252 | 253 | `bufferImageGranularity` is Vulkan's way of exposing this requirement to applications. Some GPUs report a granularity of 64KB, matching the page size. Others have no such requirement, because the compression and tiling flags are part of the resource state instead (along with image format and dimensions etc), so they report a granularity of 1 byte. Unfortunately the variable name is actively misleading: it's not about buffers vs images, it's about optimal images vs everything else. 254 | 255 | 256 | ## References 257 | 258 | A few sources with some useful information or explanations: 259 | 260 | * NVIDIA: 261 | * http://arxiv.org/pdf/1509.02308.pdf ("Dissecting GPU Memory Hierarchy through Microbenchmarking") 262 | * http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf ("Demystifying GPU Microarchitecture through Microbenchmarking") 263 | * http://www.anandtech.com/show/8935/geforce-gtx-970-correcting-the-specs-exploring-memory-allocation/2 264 | * http://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf 265 | * http://docs.nvidia.com/cuda/parallel-thread-execution/#cache-operators 266 | * http://envytools.readthedocs.io/en/latest/hw/memory/g80-vram.html 267 | * https://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/ 268 | * AMD: 269 | * https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf 270 | * http://gpuopen.com/vulkan-device-memory/ 271 | * Intel: 272 | * https://software.intel.com/en-us/file/the-compute-architecture-of-intel-processor-graphics-gen9-v1d0pdf 273 | * http://www.realworldtech.com/sandy-bridge-gpu/8/ 274 | * General: 275 | * https://fgiesen.wordpress.com/2011/07/12/a-trip-through-the-graphics-pipeline-2011-part-9/ 276 | * https://fgiesen.wordpress.com/2011/10/09/a-trip-through-the-graphics-pipeline-2011-part-13/ 277 | * https://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/ 278 | --------------------------------------------------------------------------------