├── README.md
├── intro.tex
├── main.tex
├── memory.bib
├── memory.tex
├── sw_stack.bib
└── sw_stack.tex


/README.md:
--------------------------------------------------------------------------------
1 | # A Trip Through The Graphics Pipeline 2012
2 | 
3 | This repository contains the current version of my text for "A Trip Through The Graphics Pipeline 2012",
4 | a cleaned-up and slightly updated version of my series of blog posts from 2011, in source form (.tex
5 | files). I intend to release the completed text under a Creative Commons BY-NC-SA license.
6 | 
7 | -Fabian "ryg" Giesen
8 | 


--------------------------------------------------------------------------------
/intro.tex:
--------------------------------------------------------------------------------
 1 | \chapter{Introduction}
 2 | \label{ch:intro}
 3 | In July 2011, I spontaneously started writing ``A trip through the Graphics Pipeline 2011'', a series of
 4 | blog posts that would eventually number 13~parts and 35000~words. The goal was to explain the nuts and bolts
 5 | of the graphics pipeline as implemented in modern high-end (desktop) GPUs. The only planning that went into it
 6 | was a two-page outline that contained nothing more than a rough breakdown into separate posts, with
 7 | maybe four or five keywords per part. I wrote the parts in sequence and posted them as soon as they were
 8 | ready---initially at a rate of one per day, but slowing down quite a bit towards the end. The big advantage
 9 | was that I got immediate feedback, but it also meant that the posts were all essentially in draft state and
10 | never got properly revised. There were also some holes and structural problems with the original outline that
11 | I was stuck with. This text is the attempt to rectify that situation by revising and expanding upon the original
12 | material and collecting it all in one place.
13 | 
14 | The intended audience is programmers with an interest in the low-level workings of modern GPUs. Familiarity with
15 | the basic ideas of modern CPU architectures (caches, pipelining, SIMD code, multi-core architectures) is assumed,
16 | and a basic understanding of how hardware in general works doesn't hurt.
17 | 
18 | This text is broken down into chapters, roughly corresponding to the different topics covered in the original
19 | series of blog posts. The chapters are fairly self-contained and initially proceed in ``dataflow order''---from the
20 | application generating draw calls all the day down to pixels being written to the frame buffer, explaining the most
21 | commonly used path (Vertex and Pixel Shaders, but none of the other shader types) first. Later chapters cover Geometry
22 | Shaders and the Tessellation pipeline, both of which extend the basic graphics pipeline with more front-end
23 | (pre-rasterization) stages.
24 | 


--------------------------------------------------------------------------------
/main.tex:
--------------------------------------------------------------------------------
 1 | \documentclass[DIV10]{scrreprt}
 2 | \usepackage{fourier}
 3 | \usepackage{url}
 4 | \usepackage[sectionbib]{chapterbib}
 5 | \usepackage[sectionbib,square]{natbib}
 6 | \usepackage{tikz}
 7 | \usetikzlibrary{arrows,positioning}
 8 | \definecolor{NavyBlue}{cmyk}{0.94,0.54,0,0}
 9 | \usepackage[raiselinks=true,%
10 |         colorlinks=true,%
11 |         linkcolor=NavyBlue,%
12 |         menucolor=NavyBlue,%
13 |         citecolor=NavyBlue,%
14 |         urlcolor=NavyBlue,%
15 |         bookmarks=true,%
16 |         bookmarksopenlevel=1,%
17 |         bookmarksopen=true,%
18 |         bookmarksnumbered=true,%
19 |         hyperindex=true,%
20 |         plainpages=false,%
21 |         pdfpagelabels=true,%
22 |         ]{hyperref}
23 | 
24 | \newcommand{\code}[1]{\texttt{#1}}
25 | 
26 | \setkomafont{title}{\bfseries}
27 | \setkomafont{disposition}{\bfseries}
28 | 
29 | \begin{document}
30 | \bibliographystyle{plainnat}
31 | \setcitestyle{numbers}
32 | 
33 | \title{A Trip Through The Graphics Pipeline 2012}
34 | \author{Fabian Giesen}
35 | \maketitle
36 | \tableofcontents
37 | 
38 | % Include the chapters
39 | \include{intro}
40 | \include{sw_stack}
41 | \include{memory}
42 | 
43 | \end{document}
44 | 


--------------------------------------------------------------------------------
/memory.bib:
--------------------------------------------------------------------------------
 1 | @manual{membw-i7-3960x,
 2 |         Author = {Intel},
 3 |         Title = {Intel Core i7-3960X Processor Extreme Edition},
 4 |         Year = {2012},
 5 |         URL = {http://ark.intel.com/products/63696/Intel-Core-i7-3960X-Processor-Extreme-Edition-(15M-Cache-up-to-3_90-GHz)}
 6 | }
 7 | @manual{membw-i7-2700k,
 8 |         Author = {Intel},
 9 |         Title = {Intel Core i7-2700K Processor},
10 |         Year = {2012},
11 |         URL = {http://ark.intel.com/products/61275/Intel-Core-i7-2700K-Processor-(8M-Cache-up-to-3_90-GHz)}
12 | }
13 | @manual{membw-radeonhd-7970,
14 |         Author = {Wikipedia},
15 |         Title = {Southern Islands (GPU family): Chipset table, Radeon HD 7970},
16 |         Year = {2012},
17 |         URL = {http://en.wikipedia.org/wiki/Southern_Islands_(GPU_family)#Chipset_table}
18 | }
19 | @manual{membw-geforce-gtx680,
20 |         Author = {Wikipedia},
21 |         Title = {Comparison of Nvidia graphics processing units: GeForce 600 Series, GeForce GTX 680},
22 |         Year = {2012},
23 |         URL = {http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units#GeForce_600_Series}
24 | }
25 | @manual{memlat-i7,
26 |         Author = {The Tech Report},
27 |         Title = {Memory subsystem performance},
28 |         Year = {2011},
29 |         URL = {http://techreport.com/articles.x/20188/5}
30 | }
31 | @manual{memlat-nvidia,
32 |         Author = {Steve Rennich and NVIDIA Corporation},
33 |         Title = {Fundamental Optimizations: Global Memory},
34 |         Year = {2011},
35 |         URL = {http://www.stanford.edu/dept/ICME/docs/seminars/Rennich-2011-04-25.pdf}
36 | }
37 | @manual{elpida-gddr5,
38 |         Author = {Elpida},
39 |         Title = {Introduction to GDDR5 SGRAM},
40 |         Year = {2010},
41 |         URL = {http://www.elpida.com/pdfs/E1600E10.pdf}
42 | }
43 | 


--------------------------------------------------------------------------------
/memory.tex:
--------------------------------------------------------------------------------
 1 | \chapter{GPU Memory Subsystem and Host Interface}
 2 | \label{ch:memory}
 3 | \bibliographystyle{plainnat}
 4 | 
 5 | Before a GPU can do anything with the DMA buffers written by the (kernel mode) driver,
 6 | it needs to be able to read them first, which entails communication between CPU and GPU
 7 | and interfacing with GPU memory. On the CPU side, processors have long since stopped
 8 | passing every memory access to RAM chips, and instead employ a sophisticated memory
 9 | hierarchy with multiple levels of caches.\footnote{Caches of various kinds are by far
10 | the biggest part of contemporary CPUs in terms of area.} GPU memory systems have very
11 | different goals and operate under different design constraints; while they still use
12 | caches, they generally tend to be smaller and favor different usage patterns. This unique
13 | type of memory architecture greatly influences the way shader cores handle loads and stores,
14 | and resonates through the whole design of modern GPUs.
15 | 
16 | This chapter will also briefly cover the host interface, the part of the GPU that talks
17 | to the outside world (usually the CPU).
18 | 
19 | \section{The memory subsystem}
20 | 
21 | As a regular programmer, the most important thing to understand about the memory
22 | subsystem of GPUs is that, compared to a CPU, it is both incredibly fast and incredibly
23 | slow at the same time. Fast, in that GPU memory systems deliver incredibly amounts of
24 | \emph{bandwidth}: At the time of writing, the fastest desktop CPU you can buy from Intel,
25 | the Core i7-3960X, has a theoretical peak memory bandwidth of 51.2~GB/s~\citep{membw-i7-3960x};
26 | a more pedestrian (but still very much high-end) Core i7-2700K maxes out at 21~GB/s~\citep{membw-i7-2700k}.
27 | For comparison, recent high-end discrete GPUs from AND and Nvidia boast memory bandwidths of
28 | 264~GB/s~\citep{membw-radeonhd-7970} and 192~GB/s~\citep{membw-geforce-gtx680}, respectively---a
29 | tremendous difference from the values we see for CPUs.
30 | 
31 | At the same time, GPU memory subsytems are very slow, in the sense that they have high
32 | memory access latency. Again some numbers: memory access latency on Core~i7 level CPUs comes
33 | out at about 45ns~\citep{memlat-i7}, while Nvidia quotes about 400--800 cycles of memory access
34 | latency for Fermi generation chips~\citep{memlat-nvidia}. At the 1544MHz high-end Fermi parts
35 | typically run at, this means about 260--520ns---again a 5$\times$ difference between high-end
36 | CPUs and GPUs, but this time in the opposite direction.
37 | 
38 | This is not a coincidence: GPUs want high memory bandwidth and are willing to trade a significant
39 | increase in latency (and also power draw, which I will ignore in this text) in return. This is part
40 | of a general pattern: GPUs greatly favor throughput over latency, so when there is an opportunity to
41 | trade one for the other, GPUs will go for throughput every time. Of course, this is easier said than
42 | done: optimizing for throughput works differently than optimizing for latency does, and software
43 | developers tend to be far more familiar with the latter. We will see this subject come up again and again
44 | throughout this text.
45 | 
46 | But first things first: how do we get high memory bandwidth in the first place? There's several ways.
47 | The first and conceptually easiest is to just increase memory clock rate. All other things being equal,
48 | if we increase the number of memory transfers per second, we will manage to access more memory in a second.
49 | However, power draw goes up dramatically with increased clock frequencies, which limits practical clock rates;
50 | as of this writing, mainstream GPUs ship with memory clock rates of 1.5GHz or less.\footnote{GPU vendors
51 | sometimes like to quote memory \emph{data rate} instead of \emph{clock frequency}; for GDDR5, data rate is $4 \times$
52 | the clock frequency, while for GDDR3 and DDR2/3 memory it is $2 \times$ the clock frequency.}
53 | 
54 | If we can't increase the clock any further, the next option is to throw more hardware at the problem, by
55 | making the memory bus wider. To give a concrete example, a single GDDR5 memory module has an
56 | IO width of 32 bits~\citep{elpida-gddr5}, meaning that a GDDR5 chip transfers data 32 bits at a time.\footnote{Strictly
57 | speaking, GDDR5 memory also supports ``clamshell mode'' where two GDDR5 chips share the same address/command bus
58 | and only transfer 16 bits at a time; from the memory controller's point of view, normal and clamshell mode
59 | look basically the same, and I'll ignore the distinction here.} So if a GPU uses GDDR5 and has a 256-bit
60 | wide memory subsystem, that means the memory controller is connected to 8 memory chips. This in turns means that
61 | to max out memory bandwidth, there needs to be a transfer from or to each of those memory chips in every single clock
62 | cycle.
63 | 
64 | Which leads to the third way to increase effective memory bandwidth: try not to waste any cycles! To
65 | maximize net bandwidth, GPU memory controllers buffer and reorder memory accesses very aggressively.
66 | Assuming that memory accesses are reasonably balanced across different memory chips, i.e. no chip is used
67 | much more often than any other, we can get fairly close to actually getting one memory transfer per memory
68 | chip and clock if we're willing to queue memory accesses up for a while. Memory transfers tend to be generated
69 | in short, high-rate bursts; queuing allows us to smooth out these bumps and turn them into a fairly steady
70 | level of memory requests that then gets serviced by the memory chips. In general, the longer the queues are,
71 | the smoother things will go. The flipside is that longer queues automatically mean --- and this is where the
72 | trade-off comes into play --- longer latency. CPUs try to keep queue lengths short (at least while they can,
73 | i.e.\@ as long as the memory system isn't under high load), which gives them low latency but also costs them
74 | effective memory bandwidth. GPUs queue very aggressively, so they are really good at keeping their memory chips
75 | busy, but they pay for it with hundreds of cycles of extra latency.
76 | 
77 | However, this only gets us half the way towards our goal of not wasting cycles. We're doing our best to make
78 | sure that the memory chips are busy every cycle; however, there's several ways that a DRAM chip can be busy,
79 | and most of them don't involve any transfers of payload data, so these cycles are essentially lost as far as
80 | bandwidth is concerned. All of these ``wasted'' cycles are caused by the internal organization of DRAM.
81 | 
82 | \section{How DRAM works}
83 | 
84 | \bibliography{memory}{}
85 | 


--------------------------------------------------------------------------------
/sw_stack.bib:
--------------------------------------------------------------------------------
 1 | @manual{wddm,
 2 |         Author = {Microsoft},
 3 |         Title = {Windows Vista Display Driver Model (WDDM) Reference},
 4 |         Year = {2006},
 5 |         URL = {http://msdn.microsoft.com/en-us/library/windows/hardware/ff570595(v=vs.85).aspx}
 6 | }
 7 | @manual{umd-ddi,
 8 |         Author = {Microsoft},
 9 |         Title = {User-Mode Display Driver Functions},
10 |         Year = {2006},
11 |         URL = {http://msdn.microsoft.com/en-us/library/windows/hardware/ff570118(v=VS.85).aspx}
12 | }
13 | 


--------------------------------------------------------------------------------
/sw_stack.tex:
--------------------------------------------------------------------------------
  1 | \chapter{The Software Stack}
  2 | \label{ch:swstack}
  3 | \bibliographystyle{plainnat}
  4 | 
  5 | This section assumes a reasonably recent Windows version (Vista or later),
  6 | which use the WDDM~\citep{wddm} driver model. Older driver models (and other
  7 | platforms) are somewhat different, but that's outside the scope of this
  8 | text---I just picked WDDM because it's probably the most relevant model on
  9 | PCs right now. The basic structure looks like this:
 10 | 
 11 | \begin{figure}[h]
 12 | \centering
 13 | \begin{tikzpicture}[
 14 |     every node/.style={
 15 |         rectangle,
 16 |         minimum size=.25in,
 17 |         thick,draw=black
 18 |     },
 19 |     edge/.style={stealth-stealth,thick}]
 20 |     \node (app)                           {Application};
 21 |     \node (api)   [below=of app]          {D3D runtime}
 22 |         edge [edge]                       (app);
 23 |     \node (umd)   [right=of api]          {User-Mode Driver (UMD)}
 24 |         edge [edge]                       (api);
 25 |     \node (gkrnl) [below=of api]          {Graphics kernel subsystem}
 26 |         edge [edge]                       (api);
 27 |     \node (kmd)   [below=of gkrnl]        {Kernel-Mode Driver (KMD)}
 28 |         edge [edge]                       (gkrnl);
 29 | 
 30 | \end{tikzpicture}
 31 | \label{fig:swstack:swstack}
 32 | \caption{The software stack (D3D on WDDM).}
 33 | \end{figure}
 34 | 
 35 | \section{Application and API}
 36 | 
 37 | It all starts with the application. On PC, all communication between an app and
 38 | the GPU is mediated by the graphics API; apps may occasionally get direct
 39 | access to memory that's GPU-addressable (such as Vertex Buffers or Textures),
 40 | but on PC they can't directly generate native GPU commands\footnote{Not
 41 | officially, anyway; since the UMD writes command buffers and runs in user mode,
 42 | an app could conceivably figure out where the UMD stores its current write
 43 | pointer and insert GPU commands manually, but that's not exactly supported or
 44 | recommended behavior.}---all that has to go through the API and the driver.
 45 | 
 46 | The API is the recipient of the app's resource creation, state-setting, and draw
 47 | calls. The API runtime keeps track of the current state your app has set,
 48 | validates parameters and does other error and consistency checking, manages
 49 | user-visible resources, and may or may not validate shader code and shader
 50 | linkage (it does in D3D, while in OpenGL this is handled at the driver level).
 51 | It can also merge batches if possible and remove redundant state changes. It 
 52 | then packages it all up nicely and hands it over to the graphics driver---more 
 53 | precisely, the user-mode driver.
 54 | 
 55 | \section{The User-Mode Driver (UMD)}
 56 | 
 57 | This is where most of the ``magic'' on the CPU side happens. As the name 
 58 | suggests, it is user-mode code; it's running in the same context and address 
 59 | space as the app (and the API runtime) and has no elevated privileges 
 60 | whatsoever. It contains most of the complicated bits of a graphics driver, 
 61 | which also means that if an app crashes the driver, it will probably happen in 
 62 | there.  That's a good thing: since it's all user-mode code running in the app's 
 63 | address space, it may take down the running program, but it's unlikely to 
 64 | affect other processes---contrast previous models, where a driver problem could 
 65 | (and frequently would) cause a blue screen and crash the whole system.
 66 | The UMD is just a normal DLL. It's called ``nvd3dum.dll'' (Nvidia) or 
 67 | ``atiumd*.dll'' (AMD).  It implements a lower-level API, the 
 68 | DDI~\citep{umd-ddi}, that is called by D3D; this API is fairly similar to the 
 69 | one you're seeing on the surface, but a bit
 70 | more explicit about things like memory and state management.
 71 | 
 72 | This module is where things like shader compilation happen. D3D passes a
 73 | pre-validated shader token stream\footnote{For the curious, the format of 
 74 | D3D9-level Shader bytecode, i.e.\@ versions up to 3.0, is described in the 
 75 | regular WDDM docs that are a part of the WDK; documentation for later bytecode 
 76 | versions is only in the (non-public) WGF specs.} to the UMD---i.e.\@ it's 
 77 | already checked that the code is valid in the sense of being syntactically 
 78 | correct and obeying D3D constraints (using the right types, not using more 
 79 | textures/samplers than
 80 | available, not exceeding the number of available constant buffers, stuff like
 81 | that). This is compiled from HLSL code and usually has quite a number of
 82 | high-level optimizations (various loop optimizations, dead-code elimination,
 83 | constant propagation, predicating ifs etc.) applied to it---this is good news
 84 | since it means the driver benefits from all these relatively costly
 85 | optimizations that have been performed at compile time. However, it also has a
 86 | bunch of lower-level optimizations (such as register allocation and loop
 87 | unrolling) applied that drivers would rather do themselves; long story short,
 88 | this usually just gets immediately turned into a intermediate representation
 89 | (IR) and then compiled some more; shader hardware is close enough to D3D
 90 | bytecode that compilation doesn't need to work wonders to give good results
 91 | (and the HLSL compiler having done some of the high-yield and high-cost
 92 | optimizations already definitely helps), but there's still lots of low-level
 93 | details (such as HW resource limits and scheduling constraints) that D3D
 94 | neither knows nor cares about, so this is not a trivial process.
 95 | 
 96 | For games and other well-known apps, programmers at Nvidia/AMD tend to look at 
 97 | the shader code and write hand-optimized versions for whatever graphics 
 98 | architecture is current at the time, because driver compilers aren't perfect.  
 99 | This well-known app detection and shader substitution happens in the UMD too.  
100 | Drivers also used to do somewhat sleazier things like using ``optimized'' 
101 | shaders in benchmarks that used cheaper texture filtering or using simpler 
102 | shaders for far-away objects, all to get better benchmark scores. Luckily we 
103 | seem to have moved past that now (or at least the cheating has gotten less 
104 | obvious).
105 | 
106 | To make matters more interesting, some of the API state may actually end up 
107 | being compiled into the shader. To give an example, relatively exotic (or at 
108 | least infrequently used) features such as texture borders are probably not 
109 | implemented in full generality the texture sampler, but emulated with extra
110 | code in the shader (or just not supported at all). This means that there's 
111 | sometimes multiple versions of the same shader floating around, for different 
112 | combinations of API states. And while D3D presents the fiction of completely 
113 | independent Vertex, Pixel, Geometry etc. shaders, the driver may decide to 
114 | compile e.g.\@ Vertex and Geometry shaders together to minimize the amount of 
115 | glue code between them.
116 | 
117 | Incidentally, this is also the reason why you'll often see a delay the first
118 | time you use a new shader or resource; a lot of the creation/compilation work
119 | is deferred by the driver and only executed when it's actually necessary (you
120 | wouldn't believe how much unused crap some apps create!). Graphics programmers
121 | know the other side of the story---if you want to make sure something is
122 | actually created (as opposed to just having memory reserved), you need to issue
123 | a dummy draw call that uses it to ``warm it up''. That's the price you have to 
124 | pay for this kind of lazy evaluation.
125 | 
126 | The UMD also gets to deal with fun things like all the D3D9 ``legacy'' shader 
127 | versions and the fixed function pipeline---yes, all of that
128 | will get faithfully passed through by D3D. The 3.0 shader profile is fairly 
129 | reasonable,  but 2.0 is crufty and the various 1.x
130 | shader versions are seriously weird---remember 1.3 pixel shaders? Or, for that
131 | matter, the fixed-function vertex pipeline with vertex lighting and such?  
132 | Support for all that's still there in D3D and the guts of every modern graphics
133 | driver, though of course everything just gets translated into regular shader 
134 | code at this point.
135 | 
136 | Then there's things like memory management. The UMD will get things like
137 | texture creation commands and need to provide space for them. Now, the UMD does 
138 | not actually own or directly manage video memory; the OS manages video memory 
139 | and the KMD (Kernel-Mode Driver) is responsible for whatever magic is necessary 
140 | to actually allocate memory in the Aperture (a system-memory region accessible 
141 | to the graphics card) or in Video Memory. But it is the UMD that needs to 
142 | initiate such allocation requests, and since they're not cheap, it might also 
143 | decide to bundle small resources into larger memory blocks and suballocate from 
144 | them.
145 | 
146 | The UMD can, however, write command buffers.\footnote{In the original (blog)
147 | version of this text, I used the terms ``command buffer'' and ``DMA buffer''
148 | interchangably; this version goes into more detail on WDDM, where the two names 
149 | refer to separate concepts, so I'll have to be more precise with the 
150 | terminology; apologies for any confusion this might cause.} A command buffer 
151 | contains rendering commands in whatever format the hardware understands. The 
152 | UMD will convert all the state-changing and rendering commands received from 
153 | the API into this form, and also insert some new ones that are requisite glue 
154 | (for example, shaders might need to be uploaded into a special memory area 
155 | before they can be used for the first time).
156 | 
157 | There's a problem here, though; as I mentioned, the UMD does not actually 
158 | ``own'' the graphics card, nor does it get to manage video memory. So while the 
159 | UMD is putting together the command buffer, it does not actually know where 
160 | textures, vertex buffers etc.\@ will be in memory by the time the graphics card 
161 | gets to see the commands. So it can't put the right offsets in just yet;
162 | instead, each command buffer comes with an associated ``allocation list'' that 
163 | lists all blocks of memory that need to be in memory for the command buffer to 
164 | run and a ``patch list'' that gives the locations in the command buffer where 
165 | their eventual addresses have to be written to. These patches are applied later 
166 | by the KMD.
167 | 
168 | In general, drivers will try to put as much of the actual processing into the
169 | UMD as possible; the UMD is user-mode code, so anything that runs in it doesn't
170 | need any costly kernel-mode transitions, it can freely allocate system memory, 
171 | farm work out to multiple threads, and so on: it's just a regular DLL (even 
172 | though it's loaded by the API, not directly by the app). This has advantages 
173 | for
174 | driver development too: if the UMD crashes, the app might crash with it, but 
175 | not the whole system; it can just be replaced while the system is running (it's
176 | just a DLL!); it can be debugged with a regular debugger; and so on. So it's
177 | not only efficient, it's also convenient.
178 | 
179 | However, there's a problem: there's one UMD instance per app that uses the 3D 
180 | pipeline, but all of them are trying to access a single shared resource, the 
181 | GPU (even in a dual-GPU configuration, only one of them is actually the main 
182 | display).  In a multi-tasking OS, there's two possible solutions to this 
183 | problem.  Either make sure to only grant exclusive access to the GPU to one app 
184 | at a time (this was how D3D fullscreen mode originally worked), or introduce 
185 | some layer that arbitrates access to the GPU and hands all processes their 
186 | time-slices.  And since by now even regular window management uses 3D 
187 | functionality for rendering, the exclusive model isn't workable anymore. So we 
188 | need something that acts as the gatekeeper for access to the GPU. In the case 
189 | of WDDM, that something is the DirectX graphics kernel subsystem (aka 
190 | \code{Dxgkrnl.sys}), an OS component which contains---among other things---the 
191 | GPU scheduler and the video memory manager. In WDDM, the UMD doesn't talk to the
192 | kernel layer directly, but has to go through the D3D runtime again; however,
193 | the runtime basically just forwards the calls to the right kernel-mode entry
194 | points.
195 | 
196 | \section{GPU scheduler and Video Memory Manager}
197 | 
198 | The GPU scheduler is exactly that: it arbitrates access to the 3D pipeline 
199 | using time-slicing, the same way that threads are multiplexed onto a (usually 
200 | much smaller) number of logical processors. A context switch incurs, at the 
201 | very least, some state switching on the GPU (which requires extra commands).  
202 | And of course, different apps likely use very different sets of resources, so 
203 | things might need to get swapped into (or out of) video memory. At any given 
204 | point in time, only one process actually has its command buffers forwarded to 
205 | the KMD; the remaining command buffers just get queued up.
206 | 
207 | Except it's not actually command buffers (in WDDM lingo) that get sent to the 
208 | GPU at all; WDDM calls the buffers that actually get sent to the hardware ``DMA 
209 | buffers''. The UMD calls the ``Render'' entry point with a command buffer, 
210 | allocation list and patch list; all this eventually gets forwarded to the KMD, 
211 | which has to validate that command buffer and turn it into 
212 | a (hardware-executable) DMA buffer with a new allocation list and patch list.  
213 | In theory, because there's this translation step, command buffers could be 
214 | completely different from hardware DMA buffers; but in practice it's far more 
215 | efficient (and also simpler) to keep the two very similar, if not identical.
216 | 
217 | But before the DMA buffer can actually be submitted, the scheduler needs to 
218 | ensure that all required resources are available. If everything mentioned in 
219 | the allocation list is already in video memory, the DMA buffer is good to go.  
220 | If not, however, the allocation list is sent to the video memory manager first.
221 | 
222 | The video memory manager keeps track of the current contents of video memory.  
223 | When presented with an allocation list, it has to figure out how to get all 
224 | resources into GPU-accessible memory (either video memory or the Aperture).  
225 | This will involve some uploads from system to video memory, but some resources 
226 | that aren't used for a given DMA buffer might also get moved from video memory 
227 | back to system memory. Or resource might be shuffled around in video memory 
228 | (defragmented) to make sure there's a large enough hole for a large resource.  
229 | Finally, the video memory manager might also find out that there's just no way 
230 | to fit all allocations required by a given DMA buffer into memory at once, and 
231 | request that it be split into several smaller pieces.\footnote{To facilitate 
232 | DMA buffer splitting, patch lists contain not just a patch locations, but also 
233 | split locations so the video memory manager can determine split points without 
234 | being able to read the (hardware-dependent) DMA buffer format.} If all went 
235 | well, the video memory manager ends up with a list of resource movement 
236 | commands: copy texture A from system memory to video memory, move vertex buffer 
237 | B in video memory to compact free space, and so forth. It then calls a special 
238 | KMD entry point to turn these commands into hardware-consumable DMA buffers 
239 | that perform memory transfer operations (so-called ``paging buffers'').  
240 | A rendering DMA buffer might take no paging buffers at all (if everything is 
241 | already in GPU-accessible memory), or it might need several.
242 | 
243 | Either way, after the video memory manager is done, the final locations for all 
244 | resources are known, so the scheduler calls the KMD to patch the right offsets 
245 | into the DMA buffer. And then, finally, the DMA buffers are actually ready to 
246 | be submitted to the GPU: first the paging buffers, then the rendering command 
247 | buffer. If the buffer was split, this page-then-render sequence is repeated 
248 | several times. The KMD will then actually submit the DMA buffer to the GPU.
249 | 
250 | In the other direction, once the GPU is finished with a DMA buffer, it's 
251 | supposed to report back to the driver, which in turn pings the graphics kernel 
252 | subsystem. The video memory manager needs this information to know when it's 
253 | safe to actually free/recycle memory (dynamic vertex buffers, for example); 
254 | allocations can't be freed or reused while the GPU might still reference them!
255 |  
256 | \section{The Kernel-Mode Driver (KMD)}
257 | 
258 | This is the component that actually deals with the hardware (WDDM-speak for 
259 | this is ``Display miniport driver'', by the way, but I prefer KMD).  There may 
260 | be multiple
261 | UMD instances running at any one time, but there's only ever one KMD, and if
262 | that crashes, the graphics subsystem is in trouble; it used to be ``blue 
263 | screen''-magnitude trouble, but starting with WDDM, Windows actually knows how 
264 | to kill a crashed driver and reload it---progress! But since the video memory 
265 | manager tries to keep resources in video memory (which might've been partially 
266 | overwritten by the GPU reset that typically happens during KMD initialization), 
267 | apps still need to re-create everything that wasn't cached in system memory.  
268 | And of course, if the KMD didn't just crash, but went on scribbling all over 
269 | kernel memory, all bets are off.
270 | 
271 | From the discussion above, we know a lot of things that the KMD does by now: it 
272 | handles resource allocation requests from the UMD (but not their physical 
273 | placement, which is done by the Video Memory Manager). It can render command 
274 | buffers into DMA buffers, build paging buffers, patch DMA buffers, and submit 
275 | them to the GPU. It hands a list of different memory types (and their physical 
276 | memory locations) to the video memory manager, and tells it what things can (or 
277 | should) go where. It also keeps track of different contexts, which contain 
278 | state information; this is necessary so the correct state can be restored on 
279 | a context switch.
280 | 
281 | There's also a lot of things we haven't seen yet. There's functions related to 
282 | video mode switching, page flipping, the hardware mouse cursor, power-saving 
283 | functionality, the hardware watchdog timer (resets the GPU if it doesn't 
284 | respond within a set time interval), deal with hardware interrupts, and so 
285 | forth.  There's some functions that deal with swizzling ranges; we'll see 
286 | texture swizzling later in section~\ref{sec:tex:swizzle}. And of course there's 
287 | video overlays, which are special because they go through the DRM-infested 
288 | ``protected media path'', to ensure that say a Blu-Ray player can push its 
289 | precious decoded video pixels to the GPU without other user-mode processes 
290 | getting their hands on the decompressed, decrypted data.
291 | 
292 | But back to our DMA buffers: what does the KMD do with them once they've been 
293 | submitted? The answer is one we'll be seeing a lot through this text: put them 
294 | in another buffer. Or, more precisely, put their addresses into a buffer.  
295 | Another DMA buffer, to be precise, though this one is usually quite small and 
296 | a ring buffer. Instead of rendering commands, the ring buffer just contains 
297 | calls into the actual DMA buffers we just prepared. It's still just a buffer in 
298 | (GPU-accessible) memory, though, and for anything to happen, the GPU must be 
299 | told about it. There's usually two (GPU) hardware registers in play here: one 
300 | is a read pointer, which is the current read position on the GPU side (this is 
301 | how far it has consumed commands so far), and the other one is the write 
302 | pointer, which is the location up to which the KMD has written commands to the 
303 | ring buffer. Whenever the GPU is done with a DMA buffer, it'll return to look 
304 | at the main ring buffer. If its current read pointer is different from the 
305 | write pointer, there's more commands to process; if not, there's been no new 
306 | DMA buffers submitted in the meantime, so the GPU will be idle for a while 
307 | (until the write pointer changes). Communication with the GPU will be explained 
308 | in more detail in chapter~\ref{ch:memory}.
309 | 
310 | \section{Aside: OpenGL and other platforms}
311 | 
312 | OpenGL is fairly similar to what I just described, except there's not as sharp
313 | a distinction between the API and UMD layer. And unlike D3D, the (GLSL) shader
314 | compilation is not handled by the API at all, it's all done by the driver. An
315 | unfortunate side effect is that there are as many GLSL frontends as there are
316 | 3D hardware vendors, all of them basically implementing the same spec, but with
317 | their own bugs and idiosyncrasies---everyone who's ever shipped a product using 
318 | GL knows how much pain that causes to app authors. And it also means that the 
319 | drivers
320 | have to do all the optimizations themselves whenever they get to see the
321 | shaders---including expensive optimizations. The D3D bytecode format is really
322 | a cleaner solution for this problem: there's only one compiler (so no slightly
323 | incompatible dialects between different vendors!) and it allows for some
324 | costlier data-flow analysis than you would normally do.
325 | 
326 | Open Source implementations of GL tend to use either Mesa or Gallium3D, both of
327 | which have a single shared GLSL frontend that generates a device-independent IR
328 | and supports multiple pluggable backends for actual hardware. In other words,
329 | that space is fairly similar to the D3D model of sharing the front-end for
330 | different implementations and only specializing the low-level codegen that's
331 | actually different for every piece of hardware.
332 | 
333 | \section{Further Reading}
334 | 
335 | This section is just a coarse overview of the D3D10+/WDDM graphics stack; other 
336 | implementations work somewhat differently, although the basic entities (driver, 
337 | API, video memory manager, scheduler, command/DMA buffer dispatch) exist 
338 | everywhere in some form or another. More details can be found, for example, in 
339 | the official WDDM documentation~\citep{wddm}.
340 | 
341 | \bibliography{sw_stack}{}
342 | 


--------------------------------------------------------------------------------