├── README.md ├── intro.tex ├── main.tex ├── memory.bib ├── memory.tex ├── sw_stack.bib └── sw_stack.tex /README.md: -------------------------------------------------------------------------------- 1 | # A Trip Through The Graphics Pipeline 2012 2 | 3 | This repository contains the current version of my text for "A Trip Through The Graphics Pipeline 2012", 4 | a cleaned-up and slightly updated version of my series of blog posts from 2011, in source form (.tex 5 | files). I intend to release the completed text under a Creative Commons BY-NC-SA license. 6 | 7 | -Fabian "ryg" Giesen 8 | -------------------------------------------------------------------------------- /intro.tex: -------------------------------------------------------------------------------- 1 | \chapter{Introduction} 2 | \label{ch:intro} 3 | In July 2011, I spontaneously started writing ``A trip through the Graphics Pipeline 2011'', a series of 4 | blog posts that would eventually number 13~parts and 35000~words. The goal was to explain the nuts and bolts 5 | of the graphics pipeline as implemented in modern high-end (desktop) GPUs. The only planning that went into it 6 | was a two-page outline that contained nothing more than a rough breakdown into separate posts, with 7 | maybe four or five keywords per part. I wrote the parts in sequence and posted them as soon as they were 8 | ready---initially at a rate of one per day, but slowing down quite a bit towards the end. The big advantage 9 | was that I got immediate feedback, but it also meant that the posts were all essentially in draft state and 10 | never got properly revised. There were also some holes and structural problems with the original outline that 11 | I was stuck with. This text is the attempt to rectify that situation by revising and expanding upon the original 12 | material and collecting it all in one place. 13 | 14 | The intended audience is programmers with an interest in the low-level workings of modern GPUs. Familiarity with 15 | the basic ideas of modern CPU architectures (caches, pipelining, SIMD code, multi-core architectures) is assumed, 16 | and a basic understanding of how hardware in general works doesn't hurt. 17 | 18 | This text is broken down into chapters, roughly corresponding to the different topics covered in the original 19 | series of blog posts. The chapters are fairly self-contained and initially proceed in ``dataflow order''---from the 20 | application generating draw calls all the day down to pixels being written to the frame buffer, explaining the most 21 | commonly used path (Vertex and Pixel Shaders, but none of the other shader types) first. Later chapters cover Geometry 22 | Shaders and the Tessellation pipeline, both of which extend the basic graphics pipeline with more front-end 23 | (pre-rasterization) stages. 24 | -------------------------------------------------------------------------------- /main.tex: -------------------------------------------------------------------------------- 1 | \documentclass[DIV10]{scrreprt} 2 | \usepackage{fourier} 3 | \usepackage{url} 4 | \usepackage[sectionbib]{chapterbib} 5 | \usepackage[sectionbib,square]{natbib} 6 | \usepackage{tikz} 7 | \usetikzlibrary{arrows,positioning} 8 | \definecolor{NavyBlue}{cmyk}{0.94,0.54,0,0} 9 | \usepackage[raiselinks=true,% 10 | colorlinks=true,% 11 | linkcolor=NavyBlue,% 12 | menucolor=NavyBlue,% 13 | citecolor=NavyBlue,% 14 | urlcolor=NavyBlue,% 15 | bookmarks=true,% 16 | bookmarksopenlevel=1,% 17 | bookmarksopen=true,% 18 | bookmarksnumbered=true,% 19 | hyperindex=true,% 20 | plainpages=false,% 21 | pdfpagelabels=true,% 22 | ]{hyperref} 23 | 24 | \newcommand{\code}[1]{\texttt{#1}} 25 | 26 | \setkomafont{title}{\bfseries} 27 | \setkomafont{disposition}{\bfseries} 28 | 29 | \begin{document} 30 | \bibliographystyle{plainnat} 31 | \setcitestyle{numbers} 32 | 33 | \title{A Trip Through The Graphics Pipeline 2012} 34 | \author{Fabian Giesen} 35 | \maketitle 36 | \tableofcontents 37 | 38 | % Include the chapters 39 | \include{intro} 40 | \include{sw_stack} 41 | \include{memory} 42 | 43 | \end{document} 44 | -------------------------------------------------------------------------------- /memory.bib: -------------------------------------------------------------------------------- 1 | @manual{membw-i7-3960x, 2 | Author = {Intel}, 3 | Title = {Intel Core i7-3960X Processor Extreme Edition}, 4 | Year = {2012}, 5 | URL = {http://ark.intel.com/products/63696/Intel-Core-i7-3960X-Processor-Extreme-Edition-(15M-Cache-up-to-3_90-GHz)} 6 | } 7 | @manual{membw-i7-2700k, 8 | Author = {Intel}, 9 | Title = {Intel Core i7-2700K Processor}, 10 | Year = {2012}, 11 | URL = {http://ark.intel.com/products/61275/Intel-Core-i7-2700K-Processor-(8M-Cache-up-to-3_90-GHz)} 12 | } 13 | @manual{membw-radeonhd-7970, 14 | Author = {Wikipedia}, 15 | Title = {Southern Islands (GPU family): Chipset table, Radeon HD 7970}, 16 | Year = {2012}, 17 | URL = {http://en.wikipedia.org/wiki/Southern_Islands_(GPU_family)#Chipset_table} 18 | } 19 | @manual{membw-geforce-gtx680, 20 | Author = {Wikipedia}, 21 | Title = {Comparison of Nvidia graphics processing units: GeForce 600 Series, GeForce GTX 680}, 22 | Year = {2012}, 23 | URL = {http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units#GeForce_600_Series} 24 | } 25 | @manual{memlat-i7, 26 | Author = {The Tech Report}, 27 | Title = {Memory subsystem performance}, 28 | Year = {2011}, 29 | URL = {http://techreport.com/articles.x/20188/5} 30 | } 31 | @manual{memlat-nvidia, 32 | Author = {Steve Rennich and NVIDIA Corporation}, 33 | Title = {Fundamental Optimizations: Global Memory}, 34 | Year = {2011}, 35 | URL = {http://www.stanford.edu/dept/ICME/docs/seminars/Rennich-2011-04-25.pdf} 36 | } 37 | @manual{elpida-gddr5, 38 | Author = {Elpida}, 39 | Title = {Introduction to GDDR5 SGRAM}, 40 | Year = {2010}, 41 | URL = {http://www.elpida.com/pdfs/E1600E10.pdf} 42 | } 43 | -------------------------------------------------------------------------------- /memory.tex: -------------------------------------------------------------------------------- 1 | \chapter{GPU Memory Subsystem and Host Interface} 2 | \label{ch:memory} 3 | \bibliographystyle{plainnat} 4 | 5 | Before a GPU can do anything with the DMA buffers written by the (kernel mode) driver, 6 | it needs to be able to read them first, which entails communication between CPU and GPU 7 | and interfacing with GPU memory. On the CPU side, processors have long since stopped 8 | passing every memory access to RAM chips, and instead employ a sophisticated memory 9 | hierarchy with multiple levels of caches.\footnote{Caches of various kinds are by far 10 | the biggest part of contemporary CPUs in terms of area.} GPU memory systems have very 11 | different goals and operate under different design constraints; while they still use 12 | caches, they generally tend to be smaller and favor different usage patterns. This unique 13 | type of memory architecture greatly influences the way shader cores handle loads and stores, 14 | and resonates through the whole design of modern GPUs. 15 | 16 | This chapter will also briefly cover the host interface, the part of the GPU that talks 17 | to the outside world (usually the CPU). 18 | 19 | \section{The memory subsystem} 20 | 21 | As a regular programmer, the most important thing to understand about the memory 22 | subsystem of GPUs is that, compared to a CPU, it is both incredibly fast and incredibly 23 | slow at the same time. Fast, in that GPU memory systems deliver incredibly amounts of 24 | \emph{bandwidth}: At the time of writing, the fastest desktop CPU you can buy from Intel, 25 | the Core i7-3960X, has a theoretical peak memory bandwidth of 51.2~GB/s~\citep{membw-i7-3960x}; 26 | a more pedestrian (but still very much high-end) Core i7-2700K maxes out at 21~GB/s~\citep{membw-i7-2700k}. 27 | For comparison, recent high-end discrete GPUs from AND and Nvidia boast memory bandwidths of 28 | 264~GB/s~\citep{membw-radeonhd-7970} and 192~GB/s~\citep{membw-geforce-gtx680}, respectively---a 29 | tremendous difference from the values we see for CPUs. 30 | 31 | At the same time, GPU memory subsytems are very slow, in the sense that they have high 32 | memory access latency. Again some numbers: memory access latency on Core~i7 level CPUs comes 33 | out at about 45ns~\citep{memlat-i7}, while Nvidia quotes about 400--800 cycles of memory access 34 | latency for Fermi generation chips~\citep{memlat-nvidia}. At the 1544MHz high-end Fermi parts 35 | typically run at, this means about 260--520ns---again a 5$\times$ difference between high-end 36 | CPUs and GPUs, but this time in the opposite direction. 37 | 38 | This is not a coincidence: GPUs want high memory bandwidth and are willing to trade a significant 39 | increase in latency (and also power draw, which I will ignore in this text) in return. This is part 40 | of a general pattern: GPUs greatly favor throughput over latency, so when there is an opportunity to 41 | trade one for the other, GPUs will go for throughput every time. Of course, this is easier said than 42 | done: optimizing for throughput works differently than optimizing for latency does, and software 43 | developers tend to be far more familiar with the latter. We will see this subject come up again and again 44 | throughout this text. 45 | 46 | But first things first: how do we get high memory bandwidth in the first place? There's several ways. 47 | The first and conceptually easiest is to just increase memory clock rate. All other things being equal, 48 | if we increase the number of memory transfers per second, we will manage to access more memory in a second. 49 | However, power draw goes up dramatically with increased clock frequencies, which limits practical clock rates; 50 | as of this writing, mainstream GPUs ship with memory clock rates of 1.5GHz or less.\footnote{GPU vendors 51 | sometimes like to quote memory \emph{data rate} instead of \emph{clock frequency}; for GDDR5, data rate is $4 \times$ 52 | the clock frequency, while for GDDR3 and DDR2/3 memory it is $2 \times$ the clock frequency.} 53 | 54 | If we can't increase the clock any further, the next option is to throw more hardware at the problem, by 55 | making the memory bus wider. To give a concrete example, a single GDDR5 memory module has an 56 | IO width of 32 bits~\citep{elpida-gddr5}, meaning that a GDDR5 chip transfers data 32 bits at a time.\footnote{Strictly 57 | speaking, GDDR5 memory also supports ``clamshell mode'' where two GDDR5 chips share the same address/command bus 58 | and only transfer 16 bits at a time; from the memory controller's point of view, normal and clamshell mode 59 | look basically the same, and I'll ignore the distinction here.} So if a GPU uses GDDR5 and has a 256-bit 60 | wide memory subsystem, that means the memory controller is connected to 8 memory chips. This in turns means that 61 | to max out memory bandwidth, there needs to be a transfer from or to each of those memory chips in every single clock 62 | cycle. 63 | 64 | Which leads to the third way to increase effective memory bandwidth: try not to waste any cycles! To 65 | maximize net bandwidth, GPU memory controllers buffer and reorder memory accesses very aggressively. 66 | Assuming that memory accesses are reasonably balanced across different memory chips, i.e. no chip is used 67 | much more often than any other, we can get fairly close to actually getting one memory transfer per memory 68 | chip and clock if we're willing to queue memory accesses up for a while. Memory transfers tend to be generated 69 | in short, high-rate bursts; queuing allows us to smooth out these bumps and turn them into a fairly steady 70 | level of memory requests that then gets serviced by the memory chips. In general, the longer the queues are, 71 | the smoother things will go. The flipside is that longer queues automatically mean --- and this is where the 72 | trade-off comes into play --- longer latency. CPUs try to keep queue lengths short (at least while they can, 73 | i.e.\@ as long as the memory system isn't under high load), which gives them low latency but also costs them 74 | effective memory bandwidth. GPUs queue very aggressively, so they are really good at keeping their memory chips 75 | busy, but they pay for it with hundreds of cycles of extra latency. 76 | 77 | However, this only gets us half the way towards our goal of not wasting cycles. We're doing our best to make 78 | sure that the memory chips are busy every cycle; however, there's several ways that a DRAM chip can be busy, 79 | and most of them don't involve any transfers of payload data, so these cycles are essentially lost as far as 80 | bandwidth is concerned. All of these ``wasted'' cycles are caused by the internal organization of DRAM. 81 | 82 | \section{How DRAM works} 83 | 84 | \bibliography{memory}{} 85 | -------------------------------------------------------------------------------- /sw_stack.bib: -------------------------------------------------------------------------------- 1 | @manual{wddm, 2 | Author = {Microsoft}, 3 | Title = {Windows Vista Display Driver Model (WDDM) Reference}, 4 | Year = {2006}, 5 | URL = {http://msdn.microsoft.com/en-us/library/windows/hardware/ff570595(v=vs.85).aspx} 6 | } 7 | @manual{umd-ddi, 8 | Author = {Microsoft}, 9 | Title = {User-Mode Display Driver Functions}, 10 | Year = {2006}, 11 | URL = {http://msdn.microsoft.com/en-us/library/windows/hardware/ff570118(v=VS.85).aspx} 12 | } 13 | -------------------------------------------------------------------------------- /sw_stack.tex: -------------------------------------------------------------------------------- 1 | \chapter{The Software Stack} 2 | \label{ch:swstack} 3 | \bibliographystyle{plainnat} 4 | 5 | This section assumes a reasonably recent Windows version (Vista or later), 6 | which use the WDDM~\citep{wddm} driver model. Older driver models (and other 7 | platforms) are somewhat different, but that's outside the scope of this 8 | text---I just picked WDDM because it's probably the most relevant model on 9 | PCs right now. The basic structure looks like this: 10 | 11 | \begin{figure}[h] 12 | \centering 13 | \begin{tikzpicture}[ 14 | every node/.style={ 15 | rectangle, 16 | minimum size=.25in, 17 | thick,draw=black 18 | }, 19 | edge/.style={stealth-stealth,thick}] 20 | \node (app) {Application}; 21 | \node (api) [below=of app] {D3D runtime} 22 | edge [edge] (app); 23 | \node (umd) [right=of api] {User-Mode Driver (UMD)} 24 | edge [edge] (api); 25 | \node (gkrnl) [below=of api] {Graphics kernel subsystem} 26 | edge [edge] (api); 27 | \node (kmd) [below=of gkrnl] {Kernel-Mode Driver (KMD)} 28 | edge [edge] (gkrnl); 29 | 30 | \end{tikzpicture} 31 | \label{fig:swstack:swstack} 32 | \caption{The software stack (D3D on WDDM).} 33 | \end{figure} 34 | 35 | \section{Application and API} 36 | 37 | It all starts with the application. On PC, all communication between an app and 38 | the GPU is mediated by the graphics API; apps may occasionally get direct 39 | access to memory that's GPU-addressable (such as Vertex Buffers or Textures), 40 | but on PC they can't directly generate native GPU commands\footnote{Not 41 | officially, anyway; since the UMD writes command buffers and runs in user mode, 42 | an app could conceivably figure out where the UMD stores its current write 43 | pointer and insert GPU commands manually, but that's not exactly supported or 44 | recommended behavior.}---all that has to go through the API and the driver. 45 | 46 | The API is the recipient of the app's resource creation, state-setting, and draw 47 | calls. The API runtime keeps track of the current state your app has set, 48 | validates parameters and does other error and consistency checking, manages 49 | user-visible resources, and may or may not validate shader code and shader 50 | linkage (it does in D3D, while in OpenGL this is handled at the driver level). 51 | It can also merge batches if possible and remove redundant state changes. It 52 | then packages it all up nicely and hands it over to the graphics driver---more 53 | precisely, the user-mode driver. 54 | 55 | \section{The User-Mode Driver (UMD)} 56 | 57 | This is where most of the ``magic'' on the CPU side happens. As the name 58 | suggests, it is user-mode code; it's running in the same context and address 59 | space as the app (and the API runtime) and has no elevated privileges 60 | whatsoever. It contains most of the complicated bits of a graphics driver, 61 | which also means that if an app crashes the driver, it will probably happen in 62 | there. That's a good thing: since it's all user-mode code running in the app's 63 | address space, it may take down the running program, but it's unlikely to 64 | affect other processes---contrast previous models, where a driver problem could 65 | (and frequently would) cause a blue screen and crash the whole system. 66 | The UMD is just a normal DLL. It's called ``nvd3dum.dll'' (Nvidia) or 67 | ``atiumd*.dll'' (AMD). It implements a lower-level API, the 68 | DDI~\citep{umd-ddi}, that is called by D3D; this API is fairly similar to the 69 | one you're seeing on the surface, but a bit 70 | more explicit about things like memory and state management. 71 | 72 | This module is where things like shader compilation happen. D3D passes a 73 | pre-validated shader token stream\footnote{For the curious, the format of 74 | D3D9-level Shader bytecode, i.e.\@ versions up to 3.0, is described in the 75 | regular WDDM docs that are a part of the WDK; documentation for later bytecode 76 | versions is only in the (non-public) WGF specs.} to the UMD---i.e.\@ it's 77 | already checked that the code is valid in the sense of being syntactically 78 | correct and obeying D3D constraints (using the right types, not using more 79 | textures/samplers than 80 | available, not exceeding the number of available constant buffers, stuff like 81 | that). This is compiled from HLSL code and usually has quite a number of 82 | high-level optimizations (various loop optimizations, dead-code elimination, 83 | constant propagation, predicating ifs etc.) applied to it---this is good news 84 | since it means the driver benefits from all these relatively costly 85 | optimizations that have been performed at compile time. However, it also has a 86 | bunch of lower-level optimizations (such as register allocation and loop 87 | unrolling) applied that drivers would rather do themselves; long story short, 88 | this usually just gets immediately turned into a intermediate representation 89 | (IR) and then compiled some more; shader hardware is close enough to D3D 90 | bytecode that compilation doesn't need to work wonders to give good results 91 | (and the HLSL compiler having done some of the high-yield and high-cost 92 | optimizations already definitely helps), but there's still lots of low-level 93 | details (such as HW resource limits and scheduling constraints) that D3D 94 | neither knows nor cares about, so this is not a trivial process. 95 | 96 | For games and other well-known apps, programmers at Nvidia/AMD tend to look at 97 | the shader code and write hand-optimized versions for whatever graphics 98 | architecture is current at the time, because driver compilers aren't perfect. 99 | This well-known app detection and shader substitution happens in the UMD too. 100 | Drivers also used to do somewhat sleazier things like using ``optimized'' 101 | shaders in benchmarks that used cheaper texture filtering or using simpler 102 | shaders for far-away objects, all to get better benchmark scores. Luckily we 103 | seem to have moved past that now (or at least the cheating has gotten less 104 | obvious). 105 | 106 | To make matters more interesting, some of the API state may actually end up 107 | being compiled into the shader. To give an example, relatively exotic (or at 108 | least infrequently used) features such as texture borders are probably not 109 | implemented in full generality the texture sampler, but emulated with extra 110 | code in the shader (or just not supported at all). This means that there's 111 | sometimes multiple versions of the same shader floating around, for different 112 | combinations of API states. And while D3D presents the fiction of completely 113 | independent Vertex, Pixel, Geometry etc. shaders, the driver may decide to 114 | compile e.g.\@ Vertex and Geometry shaders together to minimize the amount of 115 | glue code between them. 116 | 117 | Incidentally, this is also the reason why you'll often see a delay the first 118 | time you use a new shader or resource; a lot of the creation/compilation work 119 | is deferred by the driver and only executed when it's actually necessary (you 120 | wouldn't believe how much unused crap some apps create!). Graphics programmers 121 | know the other side of the story---if you want to make sure something is 122 | actually created (as opposed to just having memory reserved), you need to issue 123 | a dummy draw call that uses it to ``warm it up''. That's the price you have to 124 | pay for this kind of lazy evaluation. 125 | 126 | The UMD also gets to deal with fun things like all the D3D9 ``legacy'' shader 127 | versions and the fixed function pipeline---yes, all of that 128 | will get faithfully passed through by D3D. The 3.0 shader profile is fairly 129 | reasonable, but 2.0 is crufty and the various 1.x 130 | shader versions are seriously weird---remember 1.3 pixel shaders? Or, for that 131 | matter, the fixed-function vertex pipeline with vertex lighting and such? 132 | Support for all that's still there in D3D and the guts of every modern graphics 133 | driver, though of course everything just gets translated into regular shader 134 | code at this point. 135 | 136 | Then there's things like memory management. The UMD will get things like 137 | texture creation commands and need to provide space for them. Now, the UMD does 138 | not actually own or directly manage video memory; the OS manages video memory 139 | and the KMD (Kernel-Mode Driver) is responsible for whatever magic is necessary 140 | to actually allocate memory in the Aperture (a system-memory region accessible 141 | to the graphics card) or in Video Memory. But it is the UMD that needs to 142 | initiate such allocation requests, and since they're not cheap, it might also 143 | decide to bundle small resources into larger memory blocks and suballocate from 144 | them. 145 | 146 | The UMD can, however, write command buffers.\footnote{In the original (blog) 147 | version of this text, I used the terms ``command buffer'' and ``DMA buffer'' 148 | interchangably; this version goes into more detail on WDDM, where the two names 149 | refer to separate concepts, so I'll have to be more precise with the 150 | terminology; apologies for any confusion this might cause.} A command buffer 151 | contains rendering commands in whatever format the hardware understands. The 152 | UMD will convert all the state-changing and rendering commands received from 153 | the API into this form, and also insert some new ones that are requisite glue 154 | (for example, shaders might need to be uploaded into a special memory area 155 | before they can be used for the first time). 156 | 157 | There's a problem here, though; as I mentioned, the UMD does not actually 158 | ``own'' the graphics card, nor does it get to manage video memory. So while the 159 | UMD is putting together the command buffer, it does not actually know where 160 | textures, vertex buffers etc.\@ will be in memory by the time the graphics card 161 | gets to see the commands. So it can't put the right offsets in just yet; 162 | instead, each command buffer comes with an associated ``allocation list'' that 163 | lists all blocks of memory that need to be in memory for the command buffer to 164 | run and a ``patch list'' that gives the locations in the command buffer where 165 | their eventual addresses have to be written to. These patches are applied later 166 | by the KMD. 167 | 168 | In general, drivers will try to put as much of the actual processing into the 169 | UMD as possible; the UMD is user-mode code, so anything that runs in it doesn't 170 | need any costly kernel-mode transitions, it can freely allocate system memory, 171 | farm work out to multiple threads, and so on: it's just a regular DLL (even 172 | though it's loaded by the API, not directly by the app). This has advantages 173 | for 174 | driver development too: if the UMD crashes, the app might crash with it, but 175 | not the whole system; it can just be replaced while the system is running (it's 176 | just a DLL!); it can be debugged with a regular debugger; and so on. So it's 177 | not only efficient, it's also convenient. 178 | 179 | However, there's a problem: there's one UMD instance per app that uses the 3D 180 | pipeline, but all of them are trying to access a single shared resource, the 181 | GPU (even in a dual-GPU configuration, only one of them is actually the main 182 | display). In a multi-tasking OS, there's two possible solutions to this 183 | problem. Either make sure to only grant exclusive access to the GPU to one app 184 | at a time (this was how D3D fullscreen mode originally worked), or introduce 185 | some layer that arbitrates access to the GPU and hands all processes their 186 | time-slices. And since by now even regular window management uses 3D 187 | functionality for rendering, the exclusive model isn't workable anymore. So we 188 | need something that acts as the gatekeeper for access to the GPU. In the case 189 | of WDDM, that something is the DirectX graphics kernel subsystem (aka 190 | \code{Dxgkrnl.sys}), an OS component which contains---among other things---the 191 | GPU scheduler and the video memory manager. In WDDM, the UMD doesn't talk to the 192 | kernel layer directly, but has to go through the D3D runtime again; however, 193 | the runtime basically just forwards the calls to the right kernel-mode entry 194 | points. 195 | 196 | \section{GPU scheduler and Video Memory Manager} 197 | 198 | The GPU scheduler is exactly that: it arbitrates access to the 3D pipeline 199 | using time-slicing, the same way that threads are multiplexed onto a (usually 200 | much smaller) number of logical processors. A context switch incurs, at the 201 | very least, some state switching on the GPU (which requires extra commands). 202 | And of course, different apps likely use very different sets of resources, so 203 | things might need to get swapped into (or out of) video memory. At any given 204 | point in time, only one process actually has its command buffers forwarded to 205 | the KMD; the remaining command buffers just get queued up. 206 | 207 | Except it's not actually command buffers (in WDDM lingo) that get sent to the 208 | GPU at all; WDDM calls the buffers that actually get sent to the hardware ``DMA 209 | buffers''. The UMD calls the ``Render'' entry point with a command buffer, 210 | allocation list and patch list; all this eventually gets forwarded to the KMD, 211 | which has to validate that command buffer and turn it into 212 | a (hardware-executable) DMA buffer with a new allocation list and patch list. 213 | In theory, because there's this translation step, command buffers could be 214 | completely different from hardware DMA buffers; but in practice it's far more 215 | efficient (and also simpler) to keep the two very similar, if not identical. 216 | 217 | But before the DMA buffer can actually be submitted, the scheduler needs to 218 | ensure that all required resources are available. If everything mentioned in 219 | the allocation list is already in video memory, the DMA buffer is good to go. 220 | If not, however, the allocation list is sent to the video memory manager first. 221 | 222 | The video memory manager keeps track of the current contents of video memory. 223 | When presented with an allocation list, it has to figure out how to get all 224 | resources into GPU-accessible memory (either video memory or the Aperture). 225 | This will involve some uploads from system to video memory, but some resources 226 | that aren't used for a given DMA buffer might also get moved from video memory 227 | back to system memory. Or resource might be shuffled around in video memory 228 | (defragmented) to make sure there's a large enough hole for a large resource. 229 | Finally, the video memory manager might also find out that there's just no way 230 | to fit all allocations required by a given DMA buffer into memory at once, and 231 | request that it be split into several smaller pieces.\footnote{To facilitate 232 | DMA buffer splitting, patch lists contain not just a patch locations, but also 233 | split locations so the video memory manager can determine split points without 234 | being able to read the (hardware-dependent) DMA buffer format.} If all went 235 | well, the video memory manager ends up with a list of resource movement 236 | commands: copy texture A from system memory to video memory, move vertex buffer 237 | B in video memory to compact free space, and so forth. It then calls a special 238 | KMD entry point to turn these commands into hardware-consumable DMA buffers 239 | that perform memory transfer operations (so-called ``paging buffers''). 240 | A rendering DMA buffer might take no paging buffers at all (if everything is 241 | already in GPU-accessible memory), or it might need several. 242 | 243 | Either way, after the video memory manager is done, the final locations for all 244 | resources are known, so the scheduler calls the KMD to patch the right offsets 245 | into the DMA buffer. And then, finally, the DMA buffers are actually ready to 246 | be submitted to the GPU: first the paging buffers, then the rendering command 247 | buffer. If the buffer was split, this page-then-render sequence is repeated 248 | several times. The KMD will then actually submit the DMA buffer to the GPU. 249 | 250 | In the other direction, once the GPU is finished with a DMA buffer, it's 251 | supposed to report back to the driver, which in turn pings the graphics kernel 252 | subsystem. The video memory manager needs this information to know when it's 253 | safe to actually free/recycle memory (dynamic vertex buffers, for example); 254 | allocations can't be freed or reused while the GPU might still reference them! 255 | 256 | \section{The Kernel-Mode Driver (KMD)} 257 | 258 | This is the component that actually deals with the hardware (WDDM-speak for 259 | this is ``Display miniport driver'', by the way, but I prefer KMD). There may 260 | be multiple 261 | UMD instances running at any one time, but there's only ever one KMD, and if 262 | that crashes, the graphics subsystem is in trouble; it used to be ``blue 263 | screen''-magnitude trouble, but starting with WDDM, Windows actually knows how 264 | to kill a crashed driver and reload it---progress! But since the video memory 265 | manager tries to keep resources in video memory (which might've been partially 266 | overwritten by the GPU reset that typically happens during KMD initialization), 267 | apps still need to re-create everything that wasn't cached in system memory. 268 | And of course, if the KMD didn't just crash, but went on scribbling all over 269 | kernel memory, all bets are off. 270 | 271 | From the discussion above, we know a lot of things that the KMD does by now: it 272 | handles resource allocation requests from the UMD (but not their physical 273 | placement, which is done by the Video Memory Manager). It can render command 274 | buffers into DMA buffers, build paging buffers, patch DMA buffers, and submit 275 | them to the GPU. It hands a list of different memory types (and their physical 276 | memory locations) to the video memory manager, and tells it what things can (or 277 | should) go where. It also keeps track of different contexts, which contain 278 | state information; this is necessary so the correct state can be restored on 279 | a context switch. 280 | 281 | There's also a lot of things we haven't seen yet. There's functions related to 282 | video mode switching, page flipping, the hardware mouse cursor, power-saving 283 | functionality, the hardware watchdog timer (resets the GPU if it doesn't 284 | respond within a set time interval), deal with hardware interrupts, and so 285 | forth. There's some functions that deal with swizzling ranges; we'll see 286 | texture swizzling later in section~\ref{sec:tex:swizzle}. And of course there's 287 | video overlays, which are special because they go through the DRM-infested 288 | ``protected media path'', to ensure that say a Blu-Ray player can push its 289 | precious decoded video pixels to the GPU without other user-mode processes 290 | getting their hands on the decompressed, decrypted data. 291 | 292 | But back to our DMA buffers: what does the KMD do with them once they've been 293 | submitted? The answer is one we'll be seeing a lot through this text: put them 294 | in another buffer. Or, more precisely, put their addresses into a buffer. 295 | Another DMA buffer, to be precise, though this one is usually quite small and 296 | a ring buffer. Instead of rendering commands, the ring buffer just contains 297 | calls into the actual DMA buffers we just prepared. It's still just a buffer in 298 | (GPU-accessible) memory, though, and for anything to happen, the GPU must be 299 | told about it. There's usually two (GPU) hardware registers in play here: one 300 | is a read pointer, which is the current read position on the GPU side (this is 301 | how far it has consumed commands so far), and the other one is the write 302 | pointer, which is the location up to which the KMD has written commands to the 303 | ring buffer. Whenever the GPU is done with a DMA buffer, it'll return to look 304 | at the main ring buffer. If its current read pointer is different from the 305 | write pointer, there's more commands to process; if not, there's been no new 306 | DMA buffers submitted in the meantime, so the GPU will be idle for a while 307 | (until the write pointer changes). Communication with the GPU will be explained 308 | in more detail in chapter~\ref{ch:memory}. 309 | 310 | \section{Aside: OpenGL and other platforms} 311 | 312 | OpenGL is fairly similar to what I just described, except there's not as sharp 313 | a distinction between the API and UMD layer. And unlike D3D, the (GLSL) shader 314 | compilation is not handled by the API at all, it's all done by the driver. An 315 | unfortunate side effect is that there are as many GLSL frontends as there are 316 | 3D hardware vendors, all of them basically implementing the same spec, but with 317 | their own bugs and idiosyncrasies---everyone who's ever shipped a product using 318 | GL knows how much pain that causes to app authors. And it also means that the 319 | drivers 320 | have to do all the optimizations themselves whenever they get to see the 321 | shaders---including expensive optimizations. The D3D bytecode format is really 322 | a cleaner solution for this problem: there's only one compiler (so no slightly 323 | incompatible dialects between different vendors!) and it allows for some 324 | costlier data-flow analysis than you would normally do. 325 | 326 | Open Source implementations of GL tend to use either Mesa or Gallium3D, both of 327 | which have a single shared GLSL frontend that generates a device-independent IR 328 | and supports multiple pluggable backends for actual hardware. In other words, 329 | that space is fairly similar to the D3D model of sharing the front-end for 330 | different implementations and only specializing the low-level codegen that's 331 | actually different for every piece of hardware. 332 | 333 | \section{Further Reading} 334 | 335 | This section is just a coarse overview of the D3D10+/WDDM graphics stack; other 336 | implementations work somewhat differently, although the basic entities (driver, 337 | API, video memory manager, scheduler, command/DMA buffer dispatch) exist 338 | everywhere in some form or another. More details can be found, for example, in 339 | the official WDDM documentation~\citep{wddm}. 340 | 341 | \bibliography{sw_stack}{} 342 | --------------------------------------------------------------------------------