├── Bifrost.adoc
├── Midgard.md
├── README.md
├── Utgard-GP.md
└── Utgard-PP.md


/Bifrost.adoc:
--------------------------------------------------------------------------------
  1 | = Introduction
  2 | 
  3 | This page aims to describes, in a hopefully understandable way, what I've discovered so far about the instruction set for ARM's Bifrost architecture. In addition to the technical details, I'll try and give a higher-level view of what's going on; that is, _why_ I think ARM did something, and the design tradeoffs involved, rather than just the instruction encoding and what it does. Some of this can be ascertained from patents, articles, etc., but some of it will necessarily be guesswork.
  4 | 
  5 | = Public information from ARM
  6 | ARM has publicly explained various parts of the ISA in various places. Obviously, any public information from ARM is incredibly useful in the reverse-engineering process. The two main sources of information are articles/presentations and patents. Articles and presentations are aimed at a general audience, and so they may not go into the detail required to actually understand the ISA, but they are good at giving a broad overview of ARM's intentions in designing the microarchitecture. Patents tend to be more specific, since ARM needs to describe the technical details more precisely for their patent to be accepted, but they're also full of legalese that can be confusing and tedious at times.
  7 | 
  8 | == Reading patents
  9 | *Note:* the usual IANAL disclaimer applies here. The information here is only meant to be used for reverse engineering, I can't vouch for its accuracy, don't use it for anything legally significant without consulting a lawyer first, etc. You've been warned!
 10 | 
 11 | Usually, for our purpose, it's better to skip the claims section and focus on the "embodiments" section. The claims section is important legally, since it governs when ARM can sue someone for patent infringement (in some sense, it's the "meat" of the patent), but we aren't concerned about that. The embodiments section, on the other hand, is supposed to provide the patent examiner and a potential judge/jury context about the invention it's disclosing by describing how ARM is actually using it in its own products. An "embodiment" here is just a way that the invention could actually be used in an actual, shipping product. Technically, how the patent is actually used in practice is irrelevant from a legal point of view, so you'll see language like "in one possible embodiment, ...", but it's usually obvious which "possible embodiment" is the one that ARM is actually using -- the patent will go into detail on only one of the possible embodiments, or explicitly talk about a "preferred embodiment" which is usually the one they're actually using.
 12 | 
 13 | == List o' Links
 14 | - http://www.anandtech.com/show/10375/arm-unveils-bifrost-and-mali-g71[Anandtech article], gives a general overview
 15 | - https://patents.google.com/patent/GB2540970A/en[UK Patent Application GB2540970A], on the Bifrost clause architecture
 16 | - https://patents.google.com/patent/US20160364209A1/en[US Patent Application US 2016-0364209 A1], describing a way to implement more efficient argument reduction for logarithms using a lookup table. It seems that the Itanium math libraries used the exact same technique, described in a paper http://www.cl.cam.ac.uk/~jrh13/papers/itj.pdf[here], specifically in the "novel reduction" section. Yes, this means the patent application is probably bogus. No, I'm not a lawyer.
 17 | - https://patents.google.com/patent/WO2017125700A1/en[WO2017125700A1]. This seems to describe encoding two different commutative operations with the same opcode, differentiating between them by whether the larger source comes first (what about when the sources are the same?). I haven't come across this.
 18 | 
 19 | = Overview
 20 | Bifrost is a clause-oriented architecture. Instructions are grouped into _clauses_ that consist of instructions that can be executed back-to-back without any scheduling decisions. A clause consists of a _clause header_ with information required for scheduling followed by a series of instructions and constants (immediates) referenced by the instructions. Instead of choosing instructions to run, the scheduler chooses clauses to run. As we'll see, instruction decoding is also per-clause instead of per-instruction. Instructions and immediates are packed into a clause and unpacked at run-time. That way, instructions can be packed to save space, and the unpacking cost is amortized over the entire clause.
 21 | 
 22 | Internally, each instruction consists of a register read/write stage, an FMA pipeline stage, and an ADD pipeline stage. The register stage is decoupled in the ISA from the functional units, which simplifies the decode hardware. The register stage bits describe which registers to read/write with various ports, while the FMA and ADD stages have only 3 bits for each source which describe which port to read for each source. In addition, the result of both stages from the previous cycle appears as another "port", allowing instructions to bypass the register file to save power and reduce register pressure, avoiding spilling. In addition, the register stage can read a given constant embedded in the clause or a uniform register, which is another "port." Finally, there's one last port, which is always 0 in the FMA unit but is the result of the FMA unit from the same instruction in the ADD unit. For example, an unfused multiply-add is done by doing a multiply in the FMA unit, using the 0 port for the sum, and then an addition in the ADD unit with the same port (here representing the result of the multiply) in the ADD unit.
 23 | 
 24 | Note that in addition to the execution unit, which we've been describing, there are a few other units:
 25 | 
 26 | - Varying interpolation unit
 27 | - Attribute fetching unit
 28 | - Load/store unit
 29 | - Texture unit
 30 | 
 31 | The execution unit interacts with these fixed-function blocks through special, variable-latency instructions in the FMA and ADD units. They bypass the usual, fixed-latency mechanism for reading/writing registers, and as such instructions in the same clause can't assume that registers have been read/written after the instruction is done (TODO: verify this for registers being read, and not just written). Instead, any dependent instructions must be put into a separate clause with the appropriate dependencies in the clause header set.
 32 | 
 33 | = Clauses
 34 | Conceptually, each clause consists of a clause header, followed by one or more 78-bit instruction words and then zero or more 60-bit constants. (Constants are actually 64 bits, but they're loaded the same port as uniform registers and share the same field in the instruction word, which includes 7 bits to choose which uniform register to load, some of which would be unused for constants, so ARM decided to be clever and stick the bottom 4 bits in each instruction where the constant is loaded, so the actual constants in the instruction stream are only 60 bits). But the instruction fetching hardware only works in 128-bit quadwords, so each clause has to be a multiple of 128 bits. To make the representation of the clauses as compact as possible which still making the decoding circuitry relatively simple, the instructions are packed so that two 128-bit quadwords can store 3 78-bit instructions, or 3 128-bit quadwords can store 4 instructions and a 60-bit constant. There were some bits left over, which seem to have been used to obviate the need to keep track of state between each word, simplifying the decoder and making it possible to decode the quadwords in parallel. Thus, the quadwords can be (almost) arbitrarily reordered while still retaining the meaning of the clause. (It's unknown whether this works in practice, but theoretically it could be done.) Each format fully describes which instruction(s) in the decoded clause the bits in the quadword represent, and whether one of those instructions is the last instruction.
 35 | 
 36 | == Quadword formats
 37 | The bottom 8 bits of each 128-bit quadword are a "tag" that's read by the decoding circuitry. They describe how to interpret the rest of the word, as well as possibly containing some bits from some of the instructions to decode if there were more bits to spare. Each possible tag value is described below.
 38 | 
 39 | [cols="8*"]
 40 | |============================
 41 | | 0            | 0            | 1             | 0            | 1          3+| `I1`
 42 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
 43 | |============================
 44 | 
 45 | This is the beginning of a clause with more than one instruction. The next 75 bits of the quadword are the low 75 bits of instruction 0, and `I1` gives the high 3 bits. The high 45 bits of the quadword contain the clause header.
 46 | 
 47 | [cols="8*"]
 48 | |============================
 49 | | 0            | `S`          | 0             | 0            | 1          3+| `I1`
 50 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
 51 | |============================
 52 | 
 53 | This is the beginning of a clause with only one instruction. The next 75 bits of the quadword are the low 75 bits of instruction 0, and `I1` gives the high 3 bits. The high 45 bits of the quadword contain the clause header. If `S` is one, then the clause is finished after this quadword. Otherwise, there are later quadwords (that must contain constants).
 54 | 
 55 | [cols="8*"]
 56 | |============================
 57 | | 0            | `S`          | 0             | 0            | 0            | 0            | 1            | 1
 58 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
 59 | |============================
 60 | 
 61 | This quadword contains instruction 1, which is the final instruction. As usual, the next 75 bits are the low 75 bits of instruction 1, and the high 3 bits of the quadword contain the high 3 bits of instruction 1. As before, there may be following constants, in which case `S` is set to 0.
 62 | 
 63 | [cols="8*"]
 64 | |============================
 65 | | 0            | 0            | 1             | 0            | 0          3+| `I1`
 66 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
 67 | |============================
 68 | 
 69 | This quadword contains instruction 1 and part of instruction 2. The next 75 bits are the low 75 bits of instruction 1, and `I1` contains its high 3 bits. After that, the next 45 bits are the low 45 bits of instruction 2.
 70 | 
 71 | [cols="8*"]
 72 | |============================
 73 | | 0            | `S`          | 0             | 0            | 0            | 1            | 0            | 0
 74 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
 75 | |============================
 76 | 
 77 | This quadword contains instruction 2, which is the final instruction, plus a constant. The next 60 bits are the constant, after which 15 bits are unused, and then there are 30 bits for instruction 2. The high 3 bits of the quadword are the high 3 bits of instruction 2.
 78 | 
 79 | [cols="8*"]
 80 | |============================
 81 | | 0            | `S`          | 0             | 0            | 0            | 1            | 0            | 1
 82 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
 83 | |============================
 84 | 
 85 | This quadword contains the final instruction 3 and part of instruction 2. The next 75 bits are the low bits of instruction 3, while the 30 bits after that are the next 30 bits of instruction 2. The high 3 bits of the quadword are the high 3 bits of instruction 2, while the next-highest are the high 3 bits of instruction 3.
 86 | 
 87 | [cols="8*"]
 88 | |============================
 89 | | 0            | 0            | 0             | 0            | 0            | 0            | 0            | 1
 90 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
 91 | |============================
 92 | 
 93 | The same as the above, except that instruction 3 isn't the final instruction.
 94 | 
 95 | [cols="8*"]
 96 | |============================
 97 | | 1            | 0          3+| `I1`                                      3+| `I2`
 98 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
 99 | |============================
100 | 
101 | The next 75 bits are the low 75 bits of instruction 3, and the high 3 bits of instruction 3 are given by `I1`. The next 30 bits give bits 45-74 of instruction 2, and `I2` gives the high 3 bits of instruction 2. Finally, the remaining high 15 bits are the low 15 bits of the first constant.
102 | 
103 | [cols="8*"]
104 | |============================
105 | | 0            | `S`          | 0             | 1            | 0          3+| `I1`
106 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
107 | |============================
108 | 
109 | This must come after a word with the previous format, and gives instruction 4, which must be the final instruction, and the rest of the constant. The next 75 bits are the low 75 bits of instruction 4, and `I1` contains the high 3 bits. The remaining 45 bits give the high 45 bits of the first constant.
110 | 
111 | [cols="8*"]
112 | |============================
113 | | 0            | 1            | 1             | 0            | 0          3+| `I1`
114 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
115 | |============================
116 | 
117 | This quadword contains instruction 4 and part of instruction 5. The next 75 bits are the low 75 bits of instruction 4, and `I1` contains its high 3 bits. After that, the next 45 bits are the low 45 bits of instruction 5.
118 | 
119 | [cols="8*"]
120 | |============================
121 | | 0            | `S`          | 0             | 0            | 0            | 1            | 1            | 1
122 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
123 | |============================
124 | 
125 | This quadword contains the final instruction 6 and part of instruction 5. The next 75 bits are the low bits of instruction 6, while the 30 bits after that are the next 30 bits of instruction 5. The high 3 bits of the quadword are the high 3 bits of instruction 5, while the next-highest are the high 3 bits of instruction 6.
126 | 
127 | [cols="8*"]
128 | |============================
129 | | 0            | `S`          | 0             | 0            | 0            | 1            | 1            | 0
130 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
131 | |============================
132 | 
133 | This quadword contains part of instruction 5, which is the final instruction, plus a constant. The next 60 bits are the constant, after which 15 bits are unused, and then there are 30 bits for instruction 5. The high 3 bits of the quadword are the high 3 bits of instruction 5.
134 | 
135 | [cols="8*"]
136 | |============================
137 | | 1            | 1          3+| `I1`                                      3+| `I2`
138 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
139 | |============================
140 | 
141 | The next 75 bits are the low 75 bits of instruction 6, and the high 3 bits of instruction 6 are given by `I1`. The next 30 bits give bits 45-74 of instruction 5, and `I2` gives the high 3 bits of instruction 5. Finally, the remaining high 15 bits are the low 15 bits of the first constant.
142 | 
143 | [cols="8*"]
144 | |============================
145 | | 0            | `S`          | 0             | 1            | 1          3+| `I1`
146 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
147 | |============================
148 | 
149 | This must come after a word with the previous format, and gives instruction 7, which must be the final instruction, and the rest of the constant. The next 75 bits are the low 75 bits of instruction 7, and `I1` contains the high 3 bits. The remaining 45 bits give the high 45 bits of the first constant.
150 | 
151 | [cols="8*"]
152 | |============================
153 | | 0            | `S`          | 1             | 1          4+| `pos`
154 | | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}  | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp} | {nbsp}{nbsp}
155 | |============================
156 | 
157 | This format contains two 60-bit constants in the rest of the 120 bits of the quadword. The `pos` bits describe where in the instruction stream the constants are. In particular, it encodes the total number of instructions in the clause and the number of constants before the current two. Since the packing algorithm below can only produce some of these combinations, not every possible pair of instructions and constants is representable. The table below lists the ones that are currently known.
158 | 
159 | [options="header"]
160 | |============================
161 | | `pos` | Instructions | Constants
162 | | 0     | 1            | 0
163 | | 1     | 2            | 0
164 | | 2     | 4            | 0
165 | | 3     | 3            | 1
166 | | 4     | 5            | 1
167 | | 5     | 4            | 2
168 | | 6     | 7            | 0
169 | | 7     | 6            | 1
170 | | 8     | 5            | 3
171 | | 9     | 8            | 1
172 | | a     | 7            | 2
173 | | b     | 6            | 3
174 | | c     | 8            | 3
175 | | d     | 7            | 4
176 | |============================
177 | 
178 | There is a trivial limit for the number of constants per clause: since each instruction can only load one (64-bit) constant, there can be at most as many constants as instructions. It seems that the blob compiler currently refuses to add more than 5 constants to an 8-instruction clause, although it's happy to use at least 6 constants for a 7-quadword clause. Beyond that, the limit isn't known. However, there are only two remaining values for `pos` that haven't been observed (e and f), so there isn't much space left for more constants.
179 | 
180 | == Algorithm for packing clauses
181 | 
182 | This section describes how the blob compiler seems to use these formats to pack as many instructions and constants into as few words as possible. There may be other equivalent ways to do it, but given how complicated all the different formats are, and that the hardware decoder and encoding algorithm were developed in tandem, it's probably best to stick with what the blob does.
183 | 
184 | First, we assign instructions to quadwords. We may assign an entire instruction to a quadword, or we may split an instruction across two quadwords.
185 | 
186 | - If only one instruction, assign it to the first quadword.
187 | - Assign the second instruction to the second quadword.
188 | - Split the third instruction across the second and third quadwords.
189 | - Assign the fifth instruction to the third quadword.
190 | - Assign the sixth instruction to the fourth quadword.
191 | - Split the seventh instruction acrosss the fourth and fifth quadword.
192 | - Assign the eighth instruction to the fifth quadword.
193 | 
194 | Simply go down the list until there are no more instructions left.
195 | 
196 | Now, we assign constants to quadwords if we have any. We do this by looking at the last quadword, and do the following:
197 | 
198 | - If it only contains an instruction that was split across two quadwords, then there are 75 bits free. Put the constant where the next instruction would have gone, and use the appropriate format to indicate that.
199 | - If it only contains one instruction, and the previous quadword has an instruction and a split instruction, then we can split the constant across the last two instructions.
200 | 
201 | For any remaining constants, we simply add quadwords with two constants each. Note that in some cases, we need to add a "dummy" constant, even when the clause doesn't use any constants, because there's no format that does what we want. For example, say that we have a clause with 5 instructions and no constants. The fourth quadword is supposed to contain only instruction 4, which is the final instruction, but there is no format for that. Instead, we add a constant split across the third and fourth quadwords, since there is a format with a final instruction 4 and part of a constant. From a design point of view, this reduces the number of possible formats, which reduces the complexity of the decoder and means less bits are needed to describe the format.
202 | 
203 | == Clause Header
204 | 
205 | The clause header mainly contains information about "variable-latency" instructions like SSBO loads/stores/atomics, texture sampling, etc. that use a separate functional units. There can be at most one variable-latency instruction per clause. It also indicates when execution should stop, and has some information about branching. The format of the header is as follows:
206 | 
207 | [options="header"]
208 | |============================
209 | | Field                        | Bits
210 | | unknown                      | 18
211 | | Register                     | 6
212 | | Scoreboard dependencies      | 8
213 | | Scoreboard entry             | 3
214 | | Instruction type             | 4
215 | | unknown                      | 1
216 | | Next clause instruction type | 4
217 | | unknown                      | 1
218 | |============================
219 | 
220 | === Register field
221 | 
222 | A lot of variable-latency instructions have to interact with the register file in ways that would be awkward to express in the usual manner, i.e. with the per-instruction register field. For example, the STORE instruction has to read up to 4 32-bit registers, which the usual pathways for reading a register can't handle -- they're designed for reading up to three 32-bit or 64-bit registers each cycle, and it also needs to load a 64-bit address from registers. The LOAD instruction can't write to the register until the operation has finished, possibly well after the instruction executes. For cases like these, there's a "register" field in the clause header that lets the variable-latency instruction read/write one, or a sequence of, registers, in a manner different than the usual one. Since there can only be one variable-latency instruction per clause, this field isn't ambiguous about which instruction it applies to. If more than one register is being read from or written to, it must be a power of two, and the register field must be aligned to that power of two. For example, a two-register source could be R0-R1 (if the register field is 0), R2-R3 (register field is 2), R4-R5, etc. Or a four-register source could be R0-R3, R4-R7, etc.
223 | 
224 | === Dependency tracking
225 | 
226 | No instructions depend upon a high-latency instruction in the same clause and so all the intra-clause scheduling can be done by the compiler. On the other hand, instructions in one clause might depend on a variable-latency instruction in another clause, and the latency obviously can't be known beforehand, so some kind of inter-clause dependency tracking mechanism must exist. Bifrost uses a six-entry https://en.wikipedia.org/wiki/Scoreboarding[scoreboard], with the scoreboard entries and dependencies manually described by the compiler. Each clause has a scoreboard entry, which corresponds to a given bit in the scoreboard. When the clause is dispatched, the bit is set, and when the variable-latency instruction in the clause completes, the bit is cleared. Any clauses afterwards can set that clause's scoreboard entry in its "scoreboard dependencies" bitfield to halt execution until the variable-latency instruction completes.
227 | 
228 | As a concrete example, consider this program:
229 | 
230 | [source,glsl]
231 | ----
232 | layout(std430, binding = 0) buffer myBuffer {
233 |     int inVal1, inVal2, outVal;
234 | }
235 | 
236 | void main() {
237 |     outVal = inVal1 + inVal2;
238 | }
239 | ----
240 | 
241 | It might get translated into something like this, in assembly-like pseudocode (assuming for a second that the loads don't get combined):
242 | 
243 | [source]
244 | ----
245 | {
246 | LOAD.i32 R0, ptr + 0x0
247 | }
248 | {
249 | LOAD.i32 R1, ptr + 0x4
250 | }
251 | {
252 | ADD.i32 R0, R0, R1
253 | STORE.i32 R0, ptr + 0x8
254 | }
255 | ----
256 | 
257 | The third clause must depend on the first two, although the first two are independent and can be executed in any order. The dependency bits to express this would be:
258 | 
259 | [options="header"]
260 | |============================
261 | | Clause | Scoreboard entry | Scoreboard dependencies
262 | | 1 | 0 | 00000000
263 | | 2 | 1 | 00000000
264 | | 3 | 2 | 00000011
265 | |============================
266 | 
267 | Since the first two clauses have no dependencies, they will be started in-order, one immediately after the other. They will queue up two requests to the load/store unit, with scoreboard tags of 0 and 1, and set bits 0 and 1 of the scoreboard. The first load will clear bit 0 of the scoreboard (based on the tag that was sent with the load) when it is finished, and the second load will clear bit 1. The third clause has bits 0 and 1 set in the dependencies, so it will will wait for bits 0 and 1 to clear before executing. Therefore, it won't run until both of the loads have been completed.
268 | 
269 | The final wrinkle in all of this is that the scoreboard dependencies encoded in the clause are actually the dependencies before the _next_ clause is ready to execute. So in the above example, the actual encoding for the clauses would look like:
270 | 
271 | [options="header"]
272 | |============================
273 | | Clause | Scoreboard entry | Scoreboard dependencies
274 | | 1 | 0 | 00000000
275 | | 2 | 1 | 00000011
276 | | 3 | 2 | 00000000
277 | |============================
278 | 
279 | The first clause in a program implicitly has no dependencies. This scheme makes it possible to determine whether the next clause can be run before actually fetching it, presumably simplifying the hardware scheduler a little.
280 | 
281 | In addition to the normal 6 scoreboard entries available for clauses to wait on other clauses, there are two more entries reserved for tile operations. Bit 6 is cleared when depth and stencil values have been written for earlier fragments, so that the depth and stencil tests can safely proceed. The ATEST instruction (see patent) must wait on this bit. Bit 7 is cleared when blending has been completed for earlier fragments and the results written to the tile buffer, so that blending is possible. The BLEND instruction must wait on this bit. The blob makes also BLEND wait on bit 6, but I don't think that's necessary since it also waits on ATEST which waits on bit 6. These scoreboard entries provide similar functionality to the branch-on-no-dependency instruction on Midgard.
282 | 
283 | === Instruction type
284 | 
285 | The "instruction type" and "next clause instruction type" fields tell whether the clause has a variable-latency instruction, and if it does, which kind. Unsurprisingly, the "next clause instruction type" field applies to the next clause to be executed. If the clause doesn't have any variable-latency instructions, then the whole scoreboarding mechanism is skipped -- the clause is always executed immediately and it never sets or clears any scoreboard bits.
286 | 
287 | [options="header"]
288 | |============================
289 | | Value | Instruction type
290 | | 0     | no variable-latency instruction
291 | | 5     | SSBO store
292 | | 6     | SSBO load
293 | |============================
294 | 
295 | TODO: fill out this table
296 | 
297 | = Instructions
298 | 
299 | Now that we know how instructions and constants are packed into clauses, let's take a look at the instructions themselves. Each instruction is executed 3 stages, the register read/write stage, the FMA stage, and the ADD stage, each of which corresponds to a part of the 78 bit instruction word. We'll describe what they do, and the format of their part of the instruction word, in the next sections. We'll go by the order they're executed, as well as the order of the bits in the instruction word.
300 | 
301 | == Register read/write (35 bits)
302 | 
303 | As the name suggests, this stage reads from and writes to the register file. The current instruction reads from the register file at the same time that the previous instruction writes to the register file. Thus, the field contains both reads from the current instruction and writes from the previous instruction. Presumably, the scheduler makes this happen by interleaving the execution of multiple clauses from different quads. It only executes one instruction from a given quad every 3 cycles, so that the register write phase of one instruction happens at the same time as the register read phase of the next. Of course, it's possible that the FMA and ADD stages take more than 1 cycle, and more threads are interleaved as a consequence; this is a microarchitectural decision that's not visible to us. The result is that a write to a register that's immediately read by the next instruction won't work, but that's never necessary anyways thanks to the passthrough sources detailed later.
304 | 
305 | The register file has four ports, two read ports, a read/write port, and a write port. Thus, up to 3 registers can be read during an instruction. These ports are represented directly in the instruction word, with a field for telling each port the register address to use. There are three outputs, corresponding to the three read ports, and two inputs, corresponding to the FMA and ADD results from the previous stage. The ports are controlled through what the ARM patent calls the "register access descriptor," which is a 4-bit entry that says what each of the ports should do. Finally, there is the uniform/const port, which is responsible for loading uniform registers and constants embedded in the clause. Note that the uniforms and constants share the same port, which means that only one uniform or one constant (but not both) can be loaded for an instruction. This port supplies 64 bits of data, though, so two 32-bit parts of the same 64-bit value can be accessed in the same instruction.
306 | 
307 | The format of the register part of the instruction word is as follows:
308 | 
309 | [options="header"]
310 | |============================
311 | | Field               | Bits
312 | | Uniform/const       | 8
313 | | Port 2 (read/write) | 6
314 | | Port 3 (write)      | 6
315 | | Port 0 (read)       | 5
316 | | Port 1 (read)       | 6
317 | | Control             | 4
318 | |============================
319 | 
320 | Control is what ARM calls the "register access descriptor." To save bits, if the Control field is 0, then Port 1 is disabled, and the field for Port 1 instead contains the "real" Control field in the upper 4 bits. Bit 1 is set to 1 if Port 0 is disabled, and bit 0 is reused as the high bit of Port 0, allowing you to still access all 64 registers. If the Control field isn't 0, then both Port 0 and Port 1 are always enabled. In this way, the Control field only needs to describe how Port 2 and Port 3 are configured, except for the magic 0 value, reducing the number of bits required.
321 | 
322 | Before we get to the actual format of the Control field, though, we need to describe one more subtlety. Each instruction's register field contains the writes for the previous instruction, but what about the writes of the last instruction in the clause? Clauses should be entirely self-contained, so we can't look at the first instruction in the next clause. The answer turns out to be that the first instruction in the clause contains the writes for the last instruction. There are a few extra values for the control field, marked "first instruction," which are only used for the first instruction of a clause. The reads are processed normally, but the writes are delayed until the very end of the clause, after the last instruction. The list of values for the control field is below:
323 | 
324 | [options="header"]
325 | |============================
326 | | Value | Meaning
327 | | 1     | Write FMA with Port 2
328 | | 3     | Write FMA with Port 2, read with Port 3
329 | | 4     | read with Port 3
330 | | 5     | Write ADD with Port 2
331 | | 6     | Write ADD with Port 2, read with Port 3
332 | | 8     | Nothing, first instruction
333 | | 9     | Write FMA, first instruction
334 | | 11    | Nothing
335 | | 12    | read with Port 3, first instruction
336 | | 15    | Write FMA with Port 2, write ADD with Port 3
337 | |============================
338 | 
339 | Unlike the other ports, the uniform/const port always loads 64 bits at a time. If an FMA or ADD instruction only needs 32 bits of data, the high 32 bits or low 32 bits are selected later in the source field, described below.
340 | 
341 | The uniform/const bits describe what the uniform/const port should load. If the high bit is set, then the low 7 bits describe which pair of 32-bit uniform registers to load. For example, 10000001 would load from uniform registers 2 and 3. If the high bit isn't set, then the next-highest 3 bits indicate what 64-bit immediate to load, while the low 4 bits contain the low 4 bits of the constant. The mapping from from bits to constants is a little strange:
342 | 
343 | [options="header"]
344 | |============================
345 | | Field value | Constant loaded
346 | | 4           | 0
347 | | 5           | 1
348 | | 6           | 2
349 | | 7           | 3
350 | | 2           | 4
351 | | 3           | 5
352 | | 0           | special (see below)
353 | | 1           | unknown (also special?)
354 | |============================
355 | 
356 | The uniform/const port also supports loading a few "special" 64-bit constants that aren't inline immediates, or loaded through the normal uniform mechanism. These usually have to do with fixed-function hardware like blending, alpha testing, etc. The known ones are listed below:
357 | 
358 | [options="header"]
359 | |============================
360 | | Field value | Special constant
361 | | 05 | Alpha-test data (used with ATEST)
362 | | 06 | gl_FragCoord pointer
363 | | 08-0f | Blend descriptors 0-7 (used with BLEND to indicate which output to blend with)
364 | |============================
365 | 
366 | The gl_FragCoord pointer is a pointer to an array, indexed by gl_SampleID in R61 (see the varying interpolation section), of 16-bit vec2's that when loaded with a normal LOAD instruction, gives the sample (xy) position used for calculating gl_FragCoord.
367 | 
368 | == Source fields
369 | 
370 | When the FMA and ADD stages want to use the result of the register stage, they do so through a 3-bit source field in the instruction word. There are as many source fields are there are sources for each operation. The following table shows the meaning of this field:
371 | 
372 | [options="header"]
373 | |============================
374 | | Field value | Meaning
375 | | 0           | Port 0
376 | | 1           | Port 1
377 | | 2           | Port 3
378 | .2+| 3        | FMA: always 0
379 | | ADD: result of FMA unit from same instruction
380 | | 4           | Low 32 bits of uniform/const
381 | | 5           | High 32 bits of uniform/const
382 | | 6           | Result of FMA from previous instruction (FMA passthrough)
383 | | 7           | Result of ADD from previous instruction (ADD passthrough)
384 | |============================
385 | 
386 | == FMA (23 bits)
387 | 
388 | Both the FMA and ADD units have various instruction formats. The high bits are always an opcode, of varying length. They must have the property that no opcode from one format is a prefix for another opcode in a different format. This guarantees that no instruction is ambiguous. Since there's no format tag, it would seem that decoding which format each instruction has is complicated, although it's possible that some trick is used to speed it up. In the disassmbler, we just try matching each opcode with the actual one, masking off irrelevant bits. I'm only going to list the categories here, and not the actual opcodes; I'll leave the former to the disassembler source code, simply because there are a lot of them and it's tedious to type them all up, and error-prone too. The disassembler is at least easy to test, so the chances of making a mistake are lower.
389 | 
390 | === One Source (FMAOneSrc)
391 | 
392 | [options="header"]
393 | |============================
394 | | Field   | Bits
395 | | Src0    | 3
396 | | Opcode  | 20
397 | |============================
398 | 
399 | === Two Source (FMATwoSrc)
400 | 
401 | [options="header"]
402 | |============================
403 | | Field   | Bits
404 | | Src0    | 3
405 | | Src1    | 3
406 | | Opcode  | 17
407 | |============================
408 | 
409 | === Floating-point Comparisons (FMAFcmp)
410 | 
411 | [options="header"]
412 | |============================
413 | | Field               | Bits
414 | | Src0                | 3
415 | | Src1                | 3
416 | | Src1 absolute value | 1
417 | | unknown             | 1
418 | | Src1 negate         | 1
419 | | unknown             | 3
420 | | Src0 absolute value | 1
421 | | Comparison op       | 3
422 | | Opcode              | 7
423 | |============================
424 | 
425 | Where the comparison ops are given by:
426 | 
427 | [options="header"]
428 | |============================
429 | Value   | Meaning
430 | | 0 | Ordered Equal
431 | | 1 | Ordered Greater Than
432 | | 2 | Ordered Greater Than or Equal
433 | | 3 | Unordered Not-Equal
434 | | 4 | Ordered Less Than
435 | | 5 | Ordered Less Than or Equal
436 | |============================
437 | 
438 | === Two Source with Floating-point Modifiers (FMATwoSrcFmod)
439 | 
440 | [options="header"]
441 | |============================
442 | | Field               | Bits
443 | | Src0                | 3
444 | | Src1                | 3
445 | | Src1 absolute value | 1
446 | | Src0 negate         | 1
447 | | Src1 negate         | 1
448 | | unknown             | 3
449 | | Src0 absolute value | 1
450 | | unknown             | 2
451 | | Outmod              | 2
452 | | Opcode              | 6
453 | |============================
454 | 
455 | The output modifier (Outmod) is given by:
456 | 
457 | [options="header"]
458 | |============================
459 | Value   | Meaning
460 | | 0 | Nothing
461 | | 1 | max(output, 0)
462 | | 2 | clamp(output, -1, 1)
463 | | 3 | saturate - clamp(output, 0, 1)
464 | |============================
465 | 
466 | === Three Source (FMAThreeSrc)
467 | 
468 | [options="header"]
469 | |============================
470 | | Field   | Bits
471 | | Src0    | 3
472 | | Src1    | 3
473 | | Src2    | 3
474 | | Opcode  | 14
475 | |============================
476 | 
477 | === Three Source with Floating Point Modifiers (FMAThreeSrcFmod)
478 | 
479 | [options="header"]
480 | |============================
481 | | Field               | Bits
482 | | Src0                | 3
483 | | Src1                | 3
484 | | Src2                | 3
485 | | unknown             | 3
486 | | Src0 absolute value | 1
487 | | unknown             | 2
488 | | Outmod              | 2
489 | | Src0 negate         | 1
490 | | Src1 negate         | 1
491 | | Src1 absolute value | 1
492 | | Src2 absolute value | 1
493 | | Opcode              | 2
494 | |============================
495 | 
496 | === Four Source (FMAFourSrc)
497 | 
498 | [options="header"]
499 | |============================
500 | | Field   | Bits
501 | | Src0    | 3
502 | | Src1    | 3
503 | | Src2    | 3
504 | | Src3    | 3
505 | | Opcode  | 11
506 | |============================
507 | 
508 | == ADD (20 bits)
509 | 
510 | The instruction formats for ADD are similar, except it can only have up to 2 sources instead of FMA's four sources.
511 | 
512 | 
513 | === One Source (ADDOneSrc)
514 | 
515 | [options="header"]
516 | |============================
517 | | Field   | Bits
518 | | Src0    | 3
519 | | Opcode  | 17
520 | |============================
521 | 
522 | === Two Source (ADDTwoSrc)
523 | 
524 | [options="header"]
525 | |============================
526 | | Field   | Bits
527 | | Src0    | 3
528 | | Src1    | 3
529 | | Opcode  | 14
530 | |============================
531 | 
532 | === Two Source with Floating-point Modifiers (ADDTwoSrcFmod)
533 | 
534 | [options="header"]
535 | |============================
536 | | Field               | Bits
537 | | Src0                | 3
538 | | Src1                | 3
539 | | Src1 absolute value | 1
540 | | Src0 negate         | 1
541 | | Src1 negate         | 1
542 | | unknown             | 2
543 | | Outmod              | 2
544 | | unknown             | 2
545 | | Src0 absolute value | 1
546 | | Opcode              | 4
547 | |============================
548 | 
549 | === Floating-point Comparisons (ADDFcmp)
550 | 
551 | [options="header"]
552 | |============================
553 | | Field               | Bits
554 | | Src0                | 3
555 | | Src1                | 3
556 | | Comparison op       | 3
557 | | unknown             | 2
558 | | Src0 absolute value | 1
559 | | Src1 absolute value | 1
560 | | Src0 negate         | 1
561 | | Opcode              | 6
562 | |============================
563 | 
564 | = Special instructions
565 | 
566 | This section describes some of the instructions that interact with fixed-function units like the texture sampler, varying interpolation unit, etc. They oftentimes use a special format, since they have to pass extra information to the fixed-function block.
567 | 
568 | == Texture instructions
569 | 
570 | A single instruction encoding is used for every texture operation. What operation to do, and what texture/sampler to use, is encoded in a 32-bit control word which is loaded from the uniform/immediate port. The intention is that it's an immediate, basically used to extend the instruction. Trying to encode everything in the instruction wouldn't go too well, simply because there are too many possible combinations.
571 | 
572 | The texture instruction is encoded similar to a normal two-source instruction, with the bottom 6 bits devoted to sources, except that bit 6 is also used to encode whether the control word comes from the low 32 bits or high 32 bits of the port. The instruction can read up to 6 registers, with two coming from normal sources and four coming from the special per-clause register (hence they must be adjacent, starting on a multiple of 4). The texture unit then writes the result to the same per-clause register. Hence, those 4 registers are (sometimes) read and then later written.
573 | 
574 | The format of the control word is as follows:
575 | 
576 | [options="header"]
577 | |============================
578 | | Field                        | Bits
579 | | Sampler Index/Indirects      | 4
580 | | Texture Index                | 7
581 | | Separate Sampler and Texture | 1
582 | | Filter                       | 1
583 | | unknown                      | 2
584 | | Texel Offset                 | 1
585 | | Shadow                       | 1
586 | | Array                        | 1
587 | | Texture Type                 | 2
588 | | Compute LOD                  | 1
589 | | No LOD Bias                  | 1
590 | | Calculate Gradients          | 1
591 | | unknown                      | 1
592 | | Result Type                  | 4
593 | | unknown                      | 4
594 | |============================
595 | 
596 | Unlike other architectures, where samplers and textures are just handles passed around, Bifrost seems to use an older binding-table-based approach. There is are two tables, one for samplers and one for textures, containing all the state for each, and the texture instruction chooses an index into each table. If the "Separate Sampler and Texture" bit is set, then the "Texture Index" supplies the index into the texture table, and "Sampler Index/Indirect" provides the index into the sampler table. If the bit is unset, then by default, both indices come from from the "Texture Index" field. However, the "Sampler Index/Indirects" field then provides a bitmask of indirect sources. If the low bit is set, the texture index is "indirect," i.e. it is obtained by a register source. Similarly, if the second-lowest bit is set, the sampler index is indirect. The next two bits are unknown, and always observed as 11. Note that indirect indices are only possible if "Separate Sampler and Texture" is unset -- if it is set, then both indices are always taken directly from their respective fields in the control word.
597 | 
598 | The "filter" bit is unset for `texelFetch` and `textureGather` operations, where the normal bilinear filtering pipeline is bypassed. The "texel offset" is used for `textureOffset`, and specifies an extra register source for the offset. "Shadow" is used for shadow textures, and implies an extra source for the depth reference. "Array" is similarly used for array textures, and implies another source for the array index. Texture Type indicates the type of texture, as in the following table:
599 | 
600 | [options="header"]
601 | |============================
602 | | Field Value | Type
603 | | 0 | Cube
604 | | 1 | Buffer
605 | | 2 | 2D
606 | | 3 | 3D
607 | |============================
608 | 
609 | The next three bits select between `texture`, `texture` with an LOD bias passed, `textureLod`, and `textureGrad`. All of these provide different ways of changing the computed Level of Detail before sampling the texture. The usage is as follows:
610 | 
611 | [options="header"]
612 | |============================
613 | | GLSL operation          | Compute LOD | No LOD bias | Calculate Gradients
614 | | plain `texture`         | 1 | 1 | 1
615 | | `texture` with LOD bias | 1 | 0 | 1
616 | | `textureLod`            | 0 | 0 | 1
617 | | `textureGrad`           | 1 | 1 | 0
618 | |============================
619 | 
620 | The actual gradients for `textureGrad` would take up too many registers, so they get stored in a separate earlier texture instruction with some unknown fields changed.
621 | 
622 | Finally, the result type is interpreted based on the following table:
623 | 
624 | [options="header"]
625 | |============================
626 | | Value | Result Type
627 | | 4 | 32-bit floating point
628 | | 14 | 32-bit signed integer
629 | |15 | 32-bit unsigned integer
630 | |============================
631 | 
632 | TODO: compact texture instructions, dual texture instructions
633 | 
634 | == Varying Interpolation
635 | 
636 | The varying interpolation unit is responsible for, well, interpolating varyings. This gets more complicated with support for per-sample shading and centroid interpolation, which is where most of the complexity comes from. In per-sample shading, the fragment shader is run for each sample, compared to traditional multisampling where the fragment shader is run once per pixel, and samples have different colors only if they are covered by different triangles (or with alpha-to-coverage). In that case, the coordinates used for sampling have to come from the current sample. If the fragment shader is run per-pixel as usual, then the coordinates are normally taken from the center of the pixel. But this doesn't work so well for multisampling, where the sample might not actually lie on the triangle if the triangle doesn't go through the center of the pixel. So, there's an alternate "centroid" sampling method that chooses the coordinates based on the covered samples. Finally, the shader can specify an offset for the coordinates through `interpolateAtOffset` in GLSL. After all this, the result is the coordinates of the point to be sampled. These coordinates are converted to barycentric coordinates, and then the usual barycentric interpolation is applied to the varyings at all three corners of the triangle to get the final result.
637 | 
638 | On Bifrost, this is all handled through a single instruction. Varyings are always vec4-aligned, i.e. the address is always expressed in 128-bit units, even though there are instructions for interpolating single floats, vec2's, vec3's and vec4's. The instruction always has at least one source, which is always R61. Before the shader starts, R61 is filled out as follows: the low 16 bits contain the sample mask (`gl_SampleMaskIn`), the next 4 bits contain the sample ID if doing per-sample shading (`gl_SampleID`), and the rest of the bits are unknown (always 0?). The sample mask is needed for doing centroid sampling, while the sample ID is needed for doing per-sample shading.
639 | 
640 | The format of the instruction as follows:
641 | 
642 | [options="header"]
643 | |============================
644 | | Field                        | Bits
645 | | Src0                         | 3
646 | | Address                      | 5
647 | | Components                   | 2
648 | | Interpolation Type           | 2
649 | | Reuse previous coordinates   | 1
650 | | Flat shading                 | 1
651 | | Opcode (=10100)              | 5
652 | |============================
653 | 
654 | The address field requires some explanation. Only addresses under 20 can be encoded directly. This table shows exactly how the field is decoded.
655 | 
656 | [options="header"]
657 | |============================
658 | | Bit pattern | Meaning
659 | | `0xxxx`       | Interpolate at address `0xxxx`.
660 | | `100xx`       | Interpolate at address `100xx`.
661 | | `10110`       | gl_FragCoord.w
662 | | `10111`       | gl_FragCoord.z
663 | | `11xxx`       | The address is indirect, given by an additional source indicated by `xxx`.
664 | |============================
665 | 
666 | To get the number of components, add one to the components field.
667 | 
668 | The interpolation field has the following meaning:
669 | 
670 | [options="header"]
671 | |============================
672 | | Value | Meaning
673 | | 0 | Force per-fragment sampling (when per-sample shading is enabled).
674 | | 1 | Centroid sampling.
675 | | 2 | Normal -- per-sample shading if enabled through GL state, otherwise pixel center.
676 | | 3 | Use explicit coordinates loaded through a previous instruction.
677 | |============================
678 | 
679 | If the coordinates computed would be the same as the previous interpolation instruction, the "reuse previous coordinates" bit can be set. Finally, the "flat shading" bit enables flat shading, where the varying is chosen from one of the triangle corners based on the GL provoking vertex rules.
680 | 


--------------------------------------------------------------------------------
/Midgard.md:
--------------------------------------------------------------------------------
  1 | # Midgard Architecture
  2 | 
  3 | ## Publicly available information
  4 | 
  5 | ARM has been a lot more open this time with the architecture behind the T6xx. For a good overview with some slides from ARM, see [this Anandtech article](http://www.anandtech.com/show/6136/arm-announces-8core-2nd-gen-malit600-gpus). T6xx is the first Mali unified architecture; unlike the Mali 200/400, the vertex and fragment shaders use the same pipelines. There are 3 separate pipelines: ALU, Load/Store, and Texture lookup (A, L, and T in the verbose output of the compiler). The Mali-600 target for the compiler (T604, T622, T624, T628) has 2 ALU's and so can excecute 2 ALU ops per cycle, and the T650 target (T658, T678) has 4 ALU's.
  6 | 
  7 | ## Patents
  8 | 
  9 | [Data processing apparatus and method for processing a received workload in order to generate result data](https://www.google.com/patents/US20120304194?hl=en&sa=X&ei=lpT0UZP-MPjk4AOZu4HwDg&ved=0CDsQ6AEwAQ)
 10 | 
 11 | [Processing order with integer inputs and floating point inputs](https://www.google.com/patents/US20120299935?hl=en&sa=X&ei=lpT0UZP-MPjk4AOZu4HwDg&ved=0CDQQ6AEwAA)
 12 | 
 13 | [Floating-point vector normalisation](https://www.google.com/patents/WO2012038708A1?cl=en&hl=en&sa=X&ei=lpT0UZP-MPjk4AOZu4HwDg&ved=0CEIQ6AEwAg)
 14 | 
 15 | [Vector floating point argument reduction](https://www.google.com/patents/US20120078987?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CF0Q6AEwBjgK)
 16 | 
 17 | [Next-instruction-type field](https://www.google.com/patents/EP2585906A1?cl=en&hl=en&sa=X&ei=SZn0UdHEOrj54AOOsYHIBg&ved=0CDsQ6AEwAQ)
 18 | 
 19 | [Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=lpP0UcbvDvOt4AP_7IDgAg&ved=0CD0Q6AEwAQ)
 20 | 
 21 | [Number format pre-conversion instructions](https://www.google.com/patents/US20120215822?hl=en&sa=X&ei=1pn0UYb8Orix4APMnICQBg&ved=0CFIQ6AEwBA)
 22 | 
 23 | [Graphics processing](https://www.google.com/patents/US20120223946?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CEgQ6AEwAzgK)
 24 | 
 25 | [Embedded opcode within an intermediate value passed between instructions](https://www.google.com/patents/US20120204006?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CHIQ6AEwCTgK)
 26 | 
 27 | [Processor with a flag to switch between using dedicated hardware to execute a function and executing the function in software](https://patents.google.com/patent/GB2481819A/en) (Describes the blending mechanism)
 28 | 
 29 | ## Instruction format
 30 | 
 31 | It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instructions. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Each instruction is always started only after the previous instruction has fully completed, and like on the Mali 200 PP the pipeline is barreled so a number of threads, potentially with different shaders, are running at once (see next-instruction-type patent). Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:
 32 | 
 33 |     3 - Texture (4 words)
 34 |     5 - Load/Store (4 words)
 35 |     8 - ALU (4 words)
 36 |     9 - ALU (8 words)
 37 |     A - ALU (12 words)
 38 |     B - ALU (16 words)
 39 | 
 40 | The next 4 bits (bits 4-7) store the type of the next instruction (presumably for prefetch purposes), except if either the instruction is the last instruction or it's the second-to-last and the last instruction is an ALU instruction, in which case the value of 1 is used.
 41 | 
 42 | ## Comparison to Mali-200 PP
 43 | 
 44 | It seems like the architecture for the T6xx pipeline is based off of the Mali-200 PP pipeline. Both are barreled with a relatively deep pipeline that can execute a number of threads, possibly with different shaders/uniforms/other state at the same time. The main difference is that the single, large pipeline is broken down into 3 smaller pipelines. This simplifies the logic; the ALU pipelines don't need to know how to access memory, the Load/Store and texture pipelines don't need to access work registers or uniform registers, and only the texture pipeline needs to have logic for synchronizing threads in a group and exchanging values for computing derivatives. There are more work registers, and now there's a uniform register file, in addition to normal uniform buffers accessed through the load/store pipeline, and perhaps a Radeon-like register sharing mechanism (the compiler now reports the number of work registers and uniform registers used)
 45 | 
 46 | ## Registers
 47 | 
 48 | The ALU pipeline can read/write to 32 128-bit registers, which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.
 49 | 
 50 | ## Special Registers
 51 | 
 52 |     r24 - can mean "unused" for 1-src instructions, or a pipeline register
 53 |     r26 - inline constant
 54 |     r27 - load/store offset when used as output register
 55 |     r28-r29 - texture pipeline results
 56 |     r31.w - conditional select input when written to in scalar add ALU
 57 | 
 58 | r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.
 59 | 
 60 | ## ALU Words
 61 | 
 62 | The first (32-bit) word is a control word which, in addition to the usual 8-bit tag in bits 0-7, contains a bitfield describing which ALU's in the pipeline are in use. There are 5 ALU's, in addition to a framebuffer write (?) and branch unit.
 63 | 
 64 |     0-3: instruction word type (0x8-0xB)
 65 |     4-7: next instruction word type
 66 |     17: vector multiply ALU (48 bits)
 67 |     19: scalar add ALU (32 bits)
 68 |     21: vector add ALU (48 bits)
 69 |     23: scalar multiply ALU (32 bits)
 70 |     25: LUT / multiply ALU 2 (48 bits)
 71 |     26: compact output write/branch (16 bits)
 72 |     27: extended output write/branch (48 bits)
 73 | 
 74 | It's not clear why only every other bit is used for the ALU's (fp64?).
 75 | 
 76 | After the control word comes a series of 16-bit words, one for each enabled ALU (up to 5) which control the input and output registers for each ALU. After that come the actual fields for each ALU/unit, whose sizes are noted in the table above. The instruction word is then padded with 0's to make sure it is a multiple of 4 words. Finally, embedded constants may be inserted, which consist of 4 32-bit numbers, interpreted as 4 IEEE 32-bit floats if the input is a float.
 77 | 
 78 | ### Register word format
 79 | 
 80 |     0-4: input 1 register
 81 |     5-9: input 2 register / inline constant
 82 |     10-14: output register
 83 |     15: input 2 inline constant
 84 | 
 85 | The register 2 inline constant is a way to store a 16-bit float directly in the instruction. The upper 5 bits (15-11) are stored where the input 2 register would normally go, and the lower 11 bits (0-10) are stored in the ALU field as defined below. The constant is splattered across all 4 components of the input. This is much more compact than the normal embedded constants, but much more limited as well.
 86 | 
 87 | ### Vector ALU word format
 88 | 
 89 | The vector multiply, add, and LUT ALU's share the same instruction format.
 90 | 
 91 |     0-7: opcode
 92 |     8-9: input/output mode
 93 |         1 - half (16-bit)
 94 |         2 - full (32-bit)
 95 |     10: input 1 abs
 96 |     11: input 1 neg
 97 |     if input/output mode is half:
 98 |         12: input 1 replicate lower half-register
 99 |         13: input 1 replicate upper half-register
100 |     otherwise:
101 |         12: input 1 half-register selection (high or low)
102 |         13: unused
103 |     14: input 1 half-register (when output is a full register)
104 |     15-22: input 1 swizzle
105 |     23: input 2 abs
106 |     24: input 2 neg
107 |     if "input 2 inline constant" set:
108 |         25-35: input 2 inline constant low 11 bits
109 |         25-27: inline const 8-10
110 |         28-35: inline const 0-7
111 |     otherwise:
112 |         if input/output mode is half:
113 |             25: input 2 replicate lower half-register
114 |             26: input 2 replicate upper half-register
115 |         otherwise:
116 |             25: input 2 half-register selection (high or low)
117 |             26: unused
118 |         27: input 2 half-register (when output is a full register)
119 |         28-35: input 2 swizzle
120 |     36-37: output size override
121 |         0 - half, write to lower half
122 |         1 - half, write to upper half
123 |         2 - normal
124 |         Note: I've only seen this for comparison instructions that compare two full floats or ints and need to return a half float
125 |     38-39: output modifier
126 |         0 - none
127 |         1 - clamp positive
128 |         2 - output integer
129 |         3 - saturate
130 |     40-47: write mask
131 |         2 bits for each output when 32-bit, 1 bit when 16-bit
132 | 
133 | When the register mode is set to half, the operation is performed on the high and low half-registers at the the same time. The low 4 bits of the write mask control what components of the low half-register are written, and the high 4 bits control the high half-register. Normally, the operation is performed on the input 1 low register and input 2 low register to produce the output low register, and on the input 1 high register and input 2 high register to produce the high register. This can be overwritten, however, by the "input 1/2 replicate lower/upper half-register" bits which cause the given half-register to be used as an input to both operations at once.
134 | 
135 | ### Scalar ALU word format
136 | 
137 | The scalar multiply and add ALU's have the same format as well.
138 | 
139 |     0-7: opcode
140 |     8: input 1 abs
141 |     9: input 1 negate
142 |     10: input 1 size (0 = half, 1 = full)
143 |     if input 1 size = full
144 |         11: unused
145 |         12-13: input 1 component
146 |     otherwise:
147 |         11-12: input 1 component
148 |         13: input 1 half-register selection (high or low)
149 |     if "input 2 inline constant" set:
150 |         14-24: input 2 inline constant low 11 bits
151 |         14-15: inline const 9-10
152 |         16: inline const 8
153 |         17-19: inline const 5-7
154 |         20-24: inline const 0-4
155 |     otherwise:
156 |         14: input 2 abs
157 |         15: input 2 negate
158 |         16: input 2 size (0 = half, 1 = full)
159 |         17-18: input 2 component
160 |         19-24: unknown
161 |     25: unknown
162 |     26-27: output modifier
163 |         0 - none
164 |         1 - clamp positive
165 |         2 - output integer
166 |         3 - saturate
167 |     28: output size
168 |         0 - half
169 |         1 - full
170 |     if output size = full
171 |         29: unused
172 |         30-31: output component
173 |     otherwise:
174 |         29-30: output component
175 |         31: output half-register selection (high or low)
176 | 
177 | ### Opcodes
178 | 
179 |     10 - fadd
180 |     14 - fmul
181 |     28 - fmin
182 |     2C - fmax
183 |     30 - fmov
184 |     36 - ffloor
185 |     37 - fceil
186 |     3C - fdot3
187 |     3D - fdot3r
188 |     3E - fdot4
189 |     3F - freduce
190 |     40 - iadd
191 |     46 - isub
192 |     58 - imul
193 |     7B - imov
194 |     80 - feq
195 |     81 - fne
196 |     82 - flt (less than)
197 |     83 - fle (less than or equal)
198 |     99 - f2i
199 |     A0 - ieq
200 |     A1 - ine
201 |     A4 - ilt
202 |     A5 - ile
203 |     C5 - csel (conditional select)
204 |     B8 - i2f
205 |     E8 - fatan_pt2
206 |     F0 - frcp (reciprocal)
207 |     F2 - frsqrt (inverse square root, 1/sqrt(x))
208 |     F3 - fsqrt (square root)
209 |     F4 - fexp2 (2^x)
210 |     F5 - flog2
211 |     F6 - fsin
212 |     F7 - fcos
213 |     F9 - fatan_pt1
214 |         Note: for sin and cos, the input needs to be divided by pi
215 | 
216 | 
217 | Pseudocode for how atan/atan2 is implemented:
218 | 
219 |     vec4 temp1.xzw = fatan_pt1(x, y); //Note: a vec4 temporary is required, although the write mask is xzw so the y component isn't affected
220 |     float result = fatan_pt2(temp1.x, temp1.z * temp1.w);
221 | 
222 | To do atan instead of atan2, replace y with 1.0. asin and acos are implemented just like in the Mali 200 PP.
223 | 
224 | ### Compact branch/writeout
225 | 
226 | This field is used for encoding branches, as well as framebuffer writes. So far, we've only figured out branches. The branch offsets possible are rather limited. If you need a greater offset, you need to use the normal encoding instead. The bottom three bits are the opcode:
227 | 
228 |     001 - unconditional branch
229 |     010 - conditional branch
230 |     111 - branch or write to framebuffer (see patent)
231 | 
232 | The rest of the 16-bit word is different depending on the opcode. For conditional branches and framebuffer writes, it looks like this:
233 | 
234 |     0-2: opcode (=010)
235 |     3-6: Type of instruction to branch to
236 |         Note: this complements the next-instruction-type field that every instruction has, which gives the type of the next instruction to execute if there is no branching.
237 |     7-13: Offset to branch to
238 |         This gives the low 7 bits of the offset in units of quadwords (16 bytes) relative to the next instruction that would be executed. The offset is sign-extended to the full range.
239 |     14-15: Branch condition
240 |         01 - branch if r31.w is false
241 |         10 - branch if r31.w is true
242 |         11 - branch if writeout dependencies not yet met
243 | 
244 | For unconditional branches, it looks like this:
245 | 
246 |     0-2: opcode (=001)
247 |     3-6: Type of instruction to branch to
248 |         Same as conditional branches
249 |     7-8: unknown
250 |         Always 01 so far.
251 |     9-15: Offset to branch to
252 |         Same as conditional branches.
253 |         
254 | ### Extended branch/writeout
255 | 
256 | This is just a larger form of the above, with more space for the offset field in order to encode larger offsets. All of the fields have the same meaning. It looks like this:
257 | 
258 |     0-2: Opcode
259 |     3-6: Type of instruction to branch to
260 |     7-8: Unknown
261 |         Always 01 so far.
262 |     9-31: Branch offset
263 |         Also sign-extended.
264 |     32-33: Branch condition
265 |     34-47: 7 copies of branch condition above (Haven't seen anything different so far)
266 | 
267 | ## Load/store words
268 | 
269 | The load/store word consists of the standard 8-bit tag, followed by two 60-bit instructions whose format is described below. Each instruction can load or store up to 128 bits at once.
270 | 
271 |     0-7: opcode
272 |         03 - noop (no load/store)
273 |         94 - load attribute (32-bit)
274 |         95 - load attribute (16-bit)
275 |         98 - load varying (32-bit)
276 |         99 - load varying (16-bit)
277 |         AC - load uniform (16-bit)
278 |         B0 - load uniform (32-bit)
279 |         D4 - store varying (32-bit)
280 |         D5 - store varying (16-bit)
281 |     8-12: source/destination register
282 |     13-16: mask
283 |     17-24: swizzle
284 |     25-50: unknown
285 |     51-59: load/store address
286 | 
287 | The mask and swizzle acts like a move instruction. For example, a load with a mask of xzw and a swizzle of xywz means "take the x, w, and z components of the input and move them into the x, z, and w components of the register respectively."
288 | 
289 | TODO: indirect access
290 | 
291 | TODO: uniform buffers
292 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # mali-isa-docs
2 | A collection of reverse-engineered documentation for the instruction sets for various generations of Mali GPU's. This is mostly my own work, and since it is not "official," there may be inaccuracies, missing things, etc.
3 | 
4 | Documentation for Bifrost is [here](Bifrost.adoc).
5 | 
6 | Documentation for Midgard [here](Midgard.md).
7 | 
8 | Utgard wasn't a unified architecture, so there are two completely different engines for vertex shaders and fragment shaders, the former called the GP and the latter the PP. Documentation for the GP is [here](Utgard-GP.md) and documentation for the PP is [here](Utgard-PP.md).
9 | 


--------------------------------------------------------------------------------
/Utgard-GP.md:
--------------------------------------------------------------------------------
  1 | # Utgard GP
  2 | 
  3 | The GP (used for vertex shaders on Utgard) has a scalar VLIW architecture. Each instruction has a field for 2 addition ALU's, 2 multiplication ALU's, a complex ALU, a passthrough ALU, an attribute load unit, a register load unit, a uniform/temporary load unit, and a varying/register/temporary store unit. Instructions are fixed-length - each instruction consists of 4 words. Constants are implemented internally by uniforms.
  4 | 
  5 | Unlike a normal CPU, there are no explicit output registers for the ALU's, nor are there any explicit input registers. Instead, the input field(s) for each ALU can directly reference the ALU results from previous instructions (see below). However, there are 16 registers that can be used when two instructions cannot be scheduled so that one references the result of the other (either directly, or through one or more passthroughs), or for special cases such as loops. Only one (4-component) register may be loaded & stored per instruction, and storing registers and temporaries shares some of the same fields as storing varyings.
  6 | 
  7 | # Temporaries
  8 | 
  9 | Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields, which are set to 0.
 10 | 
 11 | ## Output Transformation
 12 | 
 13 | gl_Position is implemented internally as a varying; it seems that it is hard-coded to varying 0. The compiler implements some transforms internally to convert the value calculated for gl_Position in the shader to the actual value sent to the hardware. In particular (in pseudocode):
 14 | 
 15 |     uniform vec4 gl_mali_ViewportTransform[2];
 16 |     gl_position_actual.w = clamp(1.0 / gl_Position.w, -1e10, 1e10);
 17 |     gl_position_actual.xyz  = gl_Position.xyz * gl_position_actual.w * gl_mali_ViewportTransform[0].xyz + gl_mali_ViewportTransform[1].xyz;
 18 | 
 19 | gl_PointSize is also implemented internally as a varying. However, its position doesn't appear to be fixed. There are also some transforms involved:
 20 | 
 21 |     uniform vec4 gl_mali_PointSizeParameters;
 22 |     gl_PointSize_actual = clamp(gl_PointSize, gl_mali_PointSizeParameters.x, gl_mali_PointSizeParameters.y) * gl_mali_PointSizeParameters.z;
 23 | 
 24 | ## Complex functions
 25 | 
 26 | Complex functions are implemented using multiply ALU opcodes 1 and 3, as well as various complex ALU opcodes. The computation looks like this:
 27 | 
 28 |     output = mul.op1(complex(input), mul.op3(input, input), complex(input), input)
 29 | 
 30 | for exp2, it looks like this (note the addition of pass.op4):
 31 | 
 32 |     output = mul.op1(complex.exp2(pass.op4(input)), mul.op3(pass.op4(input), pass.op4(input)), complex.exp2(pass.op4(input)), pass.op4(input))
 33 | 
 34 | and finally for log2 it looks like this:
 35 | 
 36 |     output = pass.op5(mul.op1(complex.log2(input), mul.op3(input, input), complex.log2(input), input))
 37 | 
 38 | it would seem that pass.op5 performs the opposite of pass.op4.
 39 | 
 40 | ## Inputs to the ALU's
 41 | 
 42 | These are the known inputs:
 43 | 
 44 |     0-3:   Register 0 Output [0, current] (Register/Attribute)
 45 |     4-7:   Register 1 Output [0, current] (Register)
 46 |     8:     Unused, same as 21? (seen in m200_hw_workarounds.c nop shader)
 47 |     9-11:  Unknown
 48 |     12-15: Load Result [0, current] (Uniform/Temporary)
 49 |     16,17: Accumulator 0,1 Output [-1, last instruction]
 50 |     18,19: Multiplier 0,1 Output [-1, last instruction]
 51 |     20:    Passthrough Output [-1, last instruction]
 52 |     21:    Unused/nop (i.e. this ALU is not used during this instruction)
 53 |     22:    Complex Output [-1, last instruction]
 54 |     22:    Identity/Passthrough (0 for add, 1 for multiply)
 55 |              Accumulator 0,1 Input 1: add(a, -ident) means pass(a)
 56 |              Multiplier  0,1 Input 1: mul(a,  ident) means pass(a)
 57 |     23:    Passthrough Output [-2, two instructions ago]
 58 |     24,25: Accumulator 0,1 Output [-2, two instructions ago]
 59 |     26,27: Multiplier 0,1 Output [-2, two instructions ago]
 60 |     28-31: Register 0 Output [-1, last instruction] (Register/Attribute)
 61 | Note: If attribute_load_en is disabled then the attribute slot can be used to load registers too.
 62 | 
 63 | ## Latencies
 64 | 
 65 | Temporaries have a latency of 4 instructions, i.e. writes take 4 cycles to appear. Registers have a similar latency of 3 instructions. Writes to address registers 1-3 have a latency of 4 instructions. Writes to address register 0 (temporary store) have no latency though, so it can be set in the same instruction as the temporary store itself. The complex1 operation has a latency of 2 cycles.
 66 | 
 67 | Instruction format:
 68 | 
 69 |     0-4:   Multiply 0 Input A
 70 |     5-9:   Multiply 0 input B
 71 |     10-14: Multiply 1 Input A (Wide-Operation Input C)
 72 |     15-19: Multiply 1 Input B (Wide-Operation Input D)
 73 |     20:    Multiply 0 Output Negate
 74 |     21:    Multiply 1 Output Negate
 75 |     22-26: Accumulator 0 Input A
 76 |     27-31: Accumulator 0 Input B
 77 |     32-36: Accumulator 1 Input A
 78 |     37-41: Accumulator 1 Input B
 79 |     42:    Accumulator 0 Input A Negate
 80 |     43:    Accumulator 0 Input B Negate
 81 |     44:    Accumulator 1 Input A Negate
 82 |     45:    Accumulator 1 Input B Negate
 83 |     46-54: Load Address (Uniform/Temporary)
 84 |     55-57: Load Offset (Uniform/Temporary)
 85 |         0   - Address Register 0? (Never seen)
 86 |         1   - Address Register 1
 87 |         2   - Address Register 2
 88 |         3   - Address Register 3
 89 |         4-6 - Unknown (Never seen)
 90 |         7   - Unused (No offset)
 91 |     58-61: Register 0 Address (Register/Attribute)
 92 |     62:    Register 0 Attribute (Load attribute in Register 0 unit)
 93 |     63-66: Register 1 Address
 94 |     67: Store 0 Temporary (Store Temporary in Store 0)
 95 |     68: Store 1 Temporary (Store Temporary in Store 1)
 96 |     69: Branch
 97 |     70: Branch Target Low (< 0x100)
 98 |     71-73: Store 0 Input X (Register/Varying/Temporary)
 99 |     74-76: Store 0 Input Y (Register/Varying/Temporary)
100 |     77-79: Store 1 Input Z (Register/Varying/Temporary)
101 |     80-82: Store 1 Input W (Register/Varying/Temporary)
102 |         0 - Accumulator 0 Output
103 |         1 - Accumulator 1 Output
104 |         2 - Multiplier 0 Output
105 |         3 - Multiplier 1 Output
106 |         4 - Passthrough Output
107 |         5 - Unknown
108 |         6 -  Complex Output
109 |         7 - Unused (Don't store)
110 |     83-85: Accumulator (0 & 1) opcode
111 |         0 - add
112 |         1 - floor
113 |         2 - sign
114 |         3 - unknown
115 |         4 - greater-equal/step (a >= b)
116 |         5 - less-than (src0 < src1)
117 |         6 - min/logical and (a && b)
118 |         7 - max/logical or (a || b)
119 |         note: abs(a) is implemented as max(a, -a)
120 |     86-89: Complex OpCode
121 |         0 - unused
122 |         2 - exp2 (Partial)
123 |         3 - log2 (Partial)
124 |         4 - inverse sqrt (Partial)
125 |         5 - reciprocal (Partial)
126 |         9 - passthrough
127 |         10 - Set Address Register 0 & Address Register 1 from result of passthrough unit
128 |         12 - Set Address Register 0 (Temporary Store address)
129 |         13 - Set Address Register 1
130 |         14 - Set Address Register 2
131 |         15 - Set Address Register 3
132 |     90-93: Store 0 Address (Varying/Register/Temporary)
133 |     94:    Store 0 Varying (Store Varying in Store 0)
134 |     95-98: Store 1 Address (Varying/Register/Temporary)
135 |     99:    Store 1 Varying (Store Varying in Store 1)
136 |     100-102: Multiply (0 & 1) OpCode
137 |         0 - multiply (out = a * b)
138 |         1 - complex 1 (inverse, inverse sqrt, etc.)
139 |             takes all four inputs as arguments
140 |             This instruction has a latency of 2 cycles.
141 |         3 - complex 2 (inverse, inverse sqrt, etc.)
142 |             takes first two inputs as arguments,
143 |             the other two are normal (multiply)
144 |         4 - select (out = (b ? a : c), wide operation)
145 |         5-7: unknown
146 |     103-105: Passthrough OpCode
147 |         2 - passthrough (out = in)
148 |         6 - clamp (out = max(min(in, uniform.x), uniform.y))
149 |         0-1,3-5,7: unknown
150 |     106-110: Complex Input
151 |     111-115: Passthrough Input
152 |     116-119: Unknown (Treating as flags)
153 |         0 - Normal
154 |         12 - Temporary Write
155 |         13 - Branch
156 |     120-127: Branch Target (Absolute address, bit 9 is  inverse of Branch Target Low)
157 | 
158 | 
159 | 
160 | ## Lima Vertex Pipeline
161 | [<img src="http://img857.imageshack.us/img857/6044/limavertexpipeline.png">](http://img857.imageshack.us/img857/6044/limavertexpipeline.png)
162 | 


--------------------------------------------------------------------------------
/Utgard-PP.md:
--------------------------------------------------------------------------------
  1 | # PP architecture
  2 | 
  3 | The architecture consists of a large pipeline, consisting of a number of vector and scalar units which can be enabled and disabled by the control word. In particular, there are two vector ALU's, one of which can do addition, another which can do multiplication.  There are also two similar scalar ALU's, and the "combiner" capable of executing scalar-vector multiplies as well as various complex/transcdental functions (sqrt, rcp, sin, cos, exp2, etc.). Furthermore, there are varying, uniform/temporary, and texture load units, a temporary store unit, and a branching unit for implementing jumps. Each unit can affect/produce results which are used by all later units, as if all 6 registers are passed between each unit in the pipeline (see "Lima Fragment Pipeline" below). The only exception is that each vector ALU unit is executed in parallel with its scalar counterpart - so the vector and scalar multiply ALU's run in parallel, and so do the vector and scalar addition ALU's. Furthermore, to reduce register pressure, there are a number of "pipeline registers". A pipeline register is a direct connection between two units in the pipeline, in addition to the normal registers which are passed between every unit. For more details on registers (including pipeline registers), see the "Registers" section below. To overcome the pipeline stall issues inherent in such a long pipeline (128 stages for Mali-200, see [this page](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka12787.html)), the architecture is likely barrelled and interleaves execution of a large number of fragments at once, and scheduling is done by the machine in order to minimize stalls.
  4 | The instruction stream is compressed down from a maximum of 18-words per instruction dependant on what units are in use.
  5 | The remaining bits give each unit individual instructions and constants.
  6 | 
  7 | The encodings are variable length 32-bit aligned.
  8 | 
  9 | ## Control Word Encoding
 10 | 
 11 | Fragment shaders begin with a 32-bit control word.
 12 | The control word dictates what units will be used in each instruction.
 13 | The lowest 5 bits of the control word determines the number 32-bit DWORDs in the entire instruction "word" (including the control word).
 14 | 
 15 | ## Speculation on derivatives
 16 | 
 17 | Apparently, most GPU's processes fragments in groups of 2x2; I suspect ours does this as well. Normally, each shader in the group is contained from one another, except for derivatives. Derivatives are an extension in gles 2.0 (OES_standard_derivatives), however Mali does support it. Usually, derivatives are accomplished by each pixel swapping the value passed into the function with either it's vertical (dFdy) or horizontal (dFdx) neighbor in the group, and then computing the difference with it's own value. Furthermore, all 4 shaders have to execute this swap-and-difference instruction at once; I believe control bit 6 synchronizes the shaders (I'm not sure what happens when each shader takes a different side of a branch).  Bit 6 is enabled for texture fetching because the processor computes the level of detail by computing the screen-space derivative of the texture coordinates, and therefore needs to compute derivatives as part of fetching the texture.
 18 | 
 19 | ### Bitfield Description
 20 | 
 21 |      0...4: {  } Instruction Length
 22 |      5:     {0 } Output to gl_FragColor and end program.
 23 |      6:     {0 } Inter-thread synchronization?
 24 |      7:     {34} Varying Fetch
 25 |      8:     {62} Texture Sampler
 26 |      9:     {41} Uniform/Temporary Fetch
 27 |     10:     {43} Vec4 Multiply ALU
 28 |     11:     {30} Scalar Multiply ALU
 29 |     12:     {44} Vec4 Addition ALU
 30 |     13:     {31} Scalar Addition ALU
 31 |     14:     {30} Vec4-Scalar Multiply/Transcendental Scalar ALU
 32 |     15:     {41} Temporary Write/Framebuffer Read
 33 |     16:     {73} Branch/Discard
 34 |     17:     {64} Vec4 Constant Fetch 0
 35 |     18:     {64} Vec4 Constant Fetch 1
 36 |     19..24: {  } Next Instruction Length (prefetch)
 37 |     25:     {  } Prefetch enable
 38 |     28..31  {  } Unknown (=0)
 39 |     
 40 |     
 41 |     {n} means this bit set adds n bits to the instruction 
 42 | 
 43 | Prefetch enable is set for all instructions which can possibly reach the next instruction, i.e. all instructions except for the last instruction and discard instructions. For all instructions, the next instruction length is set to the instruction length of the next instruction (duh) except for the last instruction where it's set to 0; presumably this is for prefetching, since it will work if you set a too high value for this field but it will crash if you set a value too low. If you add the n's together for each enabled unit, and then round upward to the nearest 32-bit word, (+1 word for the control word) they should equal the bottom 5 bits.
 44 | 
 45 | 
 46 | ### Vector Swizzling
 47 | These 8 bits encode how to swizzle a 4-component vector in the vec4 addition and multiplication opcodes.The bits are grouped into 4 groups of 2. Bits 0 and 1 tell what goes in the first element, 2 and 3 what goes in the second element, etc. For example: 
 48 | 
 49 |      00 = 00 00 00 00 = .xxxx
 50 |      04 = 00 00 01 00 = .xyxx
 51 |      10 = 00 01 00 00 = .xxyx
 52 |      14 = 00 01 01 00 = .xyyx
 53 |      24 = 00 10 01 00 = .xyzx
 54 |      40 = 01 00 00 00 = .xxxy
 55 |      44 = 01 00 01 00 = .xyxy
 56 |      50 = 01 01 00 00 = .xxyy
 57 |      54 = 01 01 01 00 = .xyyy
 58 |      E4 = 11 10 01 00 = .xyzw (identity)
 59 | 
 60 | 
 61 | ## Opcode formatting
 62 | 
 63 | The opcodes are placed in the order of the bits in the control word: Varying loading (bit 5) will always come first if enabled, followed by texture sampling (bit 6), then uniform loading (bit 7), etc.
 64 | 
 65 | To encode the opcode for each enabled unit in the control word, start with bit 0 of the first word. Then, for each unit, append the opcode to the current word. If the opcode overflows the current word, then start at bit 0 the next one.
 66 | 
 67 | Note that, because the bits are numbered right-to-left but the words are numbered left-to-right, this means that the opcodes usually look "broken" when looking at the instruction word-for-word. For example, a varying load followed by some other opcode might look like this:
 68 | 
 69 | aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa  bbbb bbbb bbbb bbbb bbbb bbbb bbbb bbAA ...
 70 | 
 71 | where the varying load opcode is actually:
 72 | AA aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
 73 | 
 74 | Note that bits 32-33 of the opcode "overflowed" into bits 0-1 of the next word.
 75 | 
 76 | ##Registers
 77 | 
 78 | It appears there are 6 vec4 registers actually used out of up to 12 possible, which are encoded as four bits in a control field. The last 4 registers are read-only, "pipeline registers" and are hardcoded as:
 79 | 
 80 |     12 - Vec4 constant 0
 81 |     13 - Vec4 constant 1
 82 |     14 - Texture sampler result
 83 |     15 - Uniform fetch result
 84 | 
 85 | Also, each of these registers can be divided into 4 scalar registers when used as scalar inputs or outputs, for 24 scalar registers and 16 read-only special registers, which are encoded as six bits in a control field. Furthermore, when these registers run out there is a larger number of "temporary variables", which can be read using the uniform/temporary fetch (bit 9) and written using the temporary write (bit 15).  When writing to a vector register, you can specify a "write mask" or "output mask". Conceptually, all 4 components of the result are calculated, but only the components specified by the write mask are actually written to the register.
 86 | 
 87 | There also exists various "pipeline registers" (four of them listed above) which are only used within one particular instruction (i.e. within the pipeline) and are used as the inputs/outputs of various units. The pipeline registers can be grouped into:
 88 | 
 89 | * Constants
 90 | * Texture Sample registers
 91 | * Uniform fetch register
 92 | * ALU registers (^vmul, ^fmul or ^v0, ^s0)
 93 | 
 94 | ## Deciphered Opcodes
 95 | 
 96 |     control[7], Varying Fetch
 97 |     
 98 |     00 mmmm dddd iiii iiOO 00oo oo00 0aa0 ssss
 99 |     Or, for loading from a register (used for loading texture coordinates from a register):
100 |     00 mmmm dddd SSSS SSSS ANrr rr00 0000 01pp
101 |     Or, for normalizing a vec3 input register (uses most of the same fields as loading from a register):
102 |     00 mmmm dddd SSSS SSSS ANrr rr00 1000 1010
103 |     
104 |     m - Mask, (0001 = float, 0011 = vec2, 0111 = vec3, 1111 = vec4)
105 |     d - Destination Register
106 |         Note: writing to register 15 discards the output (used for loading texture coordinates)
107 |     i - Varying Index
108 |     a - alignment
109 |         It seems that varyings (floats) can be loaded in aligned groups of 1, 2, or 4.
110 |         This specifies how many to load at once. Note that the alignment affects the addressing;
111 |         for example, loading from an index of x at an alignment of 4 is equivalent to loading from 2*x
112 |          and 2*x+1 at an alignment of 2.
113 |         00 - no alignment (load 1 float)
114 |         01 - alignment by 2 (load 2 floats)
115 |         11 - alignment by 4 (load 4 floats)
116 |     A - absolute value input modifier
117 |     N - negate input modifier
118 |     r - input register
119 |     o - Offset register - vector part
120 |         1111: no offset (0)
121 |     O - Offset register - scalar part
122 |         Note: it seems the offset register is formed by ooooOO.
123 |         However, I haven't been able to test this theory because I haven't gotten the compiler
124 |         to produce a value for O other than 11. 
125 |     s - source:
126 |         00pp - normal varying
127 |         01pp - register (see second instruction format)
128 |         1000 - varying, input to textureCube()
129 |         1001 - register, input to textureCube()
130 |         1010 - vec3 normalize (see third instruction format)
131 |         1011 - gl_FragCoord
132 |         1100 - gl_PointCoord
133 |         1101 - gl_FrontFacing
134 |     S - swizzle descriptor for source
135 |     p - perspective (used for texture2DProj)
136 |         00 - normal
137 |         10 - divide by z
138 |         11 - divide by w
139 | 
140 |     Note: for gl_FragCoord and gl_PointCoord, the shader has to apply some transforms to get the right value.
141 |     For gl_FragCoord, it looks like this in pseudocode:
142 |     gl_FragCoord.xyz = gl_FragCoord_orig.xyz;
143 |     gl_FragCoord.w = 1.0 / gl_FragCoord_orig.w;
144 | 
145 |     And for gl_PointCoord:
146 |     uniform vec4 gl_mali_PointCoordScaleBias; //created by the compiler
147 |     gl_PointCoord = gl_mali_PointCoordScaleBias.zw + gl_mali_PointCoordScaleBias.xy * gl_PointCoord_orig;
148 | 
149 |     control[8], Texture fetch
150 | 
151 |     00 1110 0100 0000 0000 01ss ssss ssss ssot tttt 0000 0lb0 0000 cccc ccrr rrrr
152 | 
153 |     The coordinates for the texture fetch are always the output of the varying load.
154 | 
155 |     s - sampler index
156 |     o - sampler index register offset enable
157 |     c - sampler index offset register
158 |     t - sampler type
159 |         00000 - sampler2D
160 |         11111 - samplerCube
161 |     l - lod register enable
162 |     b - explicit lod
163 |         If true, the LOD register specifies the actual LOD.
164 |         If false, the LOD register specifies an offset applied to the normally-calculated LOD.
165 |     r - lod register
166 | 
167 |     control[9], Uniform/Temporary Fetch
168 | 
169 |     i iiii iiii iiii iiio rrrr rr00 0000 aa00 0000 00ss
170 | 
171 |     i - source index
172 |     a - alignment
173 |         00 - 1-aligned
174 |         01 - 2-aligned
175 |         10 - 4-aligned
176 |     s - source
177 |         00 - uniform
178 |         11 - temporary
179 |     o - register offset enable
180 |     r - offset register
181 |     
182 |     control[10], Vec4 Multiply ALU
183 | 
184 |     ooo ooMM mmmm dddd CCaa aaaa aaAA AADD bbbb bbbb BBBB
185 | 
186 |     Everything except the opcode is the same as the vec4 addition ALU.
187 | 
188 |     Opcode:
189 |         00xxx - arg0 * arg1 * 2^x, where x is in two's-complement format
190 |         01000 - not(arg0)
191 |         01001 - and(arg0, arg1)
192 |         01010 - or(arg0, arg1)
193 |         01011 - xor(arg0, arg1)
194 |         01100 - notEqual(arg0, arg1)
195 |         01101 - lessThan(arg0, arg1)
196 |         01110 - lessThanEqual(arg0, arg1)
197 |         01111 - equal(arg0, arg1)
198 |         10000 - min(arg0, arg1)
199 |         10001 - max(arg0, arg1)
200 |         11111 - arg1 (passthough)
201 | 
202 |     control[11], Scalar Multiply ALU
203 | 
204 |     oo oooM Medd dddd AAaa aaaa BBbb bbbb
205 | 
206 |     o - opcode:
207 |         00xxx - arg0 * arg1 * 2^x where x is in two's complement format
208 |         01000 - not(arg0)
209 |         01001 - and(arg0, arg1)
210 |         01010 - or(arg0, arg1)
211 |         01011 - xor(arg0, arg1)
212 |         01100 - notEqual(arg0, arg1)
213 |         01101 - lessThan(arg0, arg1)
214 |         01110 - lessThanEqual(arg0, arg1)
215 |         10001 - max(arg0, arg1)
216 |         10000 - min(arg0, arg1)
217 |         11111 - arg1 (passthough)
218 |      
219 |     e - output enable, set to 0 when outputting to scalar accumulate (above)   
220 |     d - destination register
221 |     A - argument 0 modifiers, bit 0 is abs() and bit 1 is negate
222 |     a - argument 0 register
223 |     B - argument 1 modifiers
224 |     b - argument 1 register
225 |     M - output modifier:
226 |         00 - passthrough
227 |         01 - saturate - clamp(output, 0.0, 1.0)
228 |         10 - max(0.0, output)
229 |         11 - round to integer
230 | 
231 |     control[12], Vec4 Addition ALU
232 | 
233 |     iooo ooMM mmmm dddd CCaa aaaa aaAA AADD bbbb bbbb BBBB
234 |     
235 |     i - whether to get Argument 1 from the vector multiplication ALU (above)
236 |     o - opcode:
237 |         00000 - arg0 + arg1
238 |         00100 - fract(arg1)
239 |         01000 - notEqual(arg0, arg1)
240 |         01100 - floor(arg1)
241 |         01101 - ceil(arg1)
242 |         01011 - equal(arg0, arg1)
243 |         01001 - lessThan(arg0, arg1)
244 |         01010 - lessThanEqual(arg0, arg1)
245 |         01111 - max(arg0, arg1)
246 |         01110 - min(arg0, arg1)
247 |         10000 - sum3 - dest.xyzw = sum of first 3 components of arg1
248 |         10001 - sum4 - dest.xyzw = sum of all components of arg1
249 |             Note: for sum3 and sum4, the output is broadcast to all channels - 
250 |             you can use the write mask to select which component to write to
251 |         10100 - dFdx(arg0, arg1)
252 |         10101 - dFdy(arg0, arg1)
253 |             Note: dFdx(x) is actually implemented as dFdx(-x, x) (same for dFdy)
254 |             See "Speculation on derivatives" above
255 |         11111 - arg1 (passthrough)
256 |     m - Mask, same as varying fetch opcode
257 |     d - Destination register
258 |     C - Argument 0 modifier
259 |     a - Argument 0 Swizzle descriptor
260 |     A - Argument 0 Source
261 |     D - Argument 1 modifier
262 |     b - Argument 1 Swizzle descriptor
263 |     B - Argument 1 Source
264 |     M - output modifier:
265 |         00 - don't round
266 |         01 - saturate - clamp(output, 0.0, 1.0)
267 |         10 - max(0.0, output)
268 |         11 - round to integer
269 | 
270 |     Modifiers:
271 |     bit 0 - absolute value
272 |     bit 1 - negate
273 |     
274 |     control[13], Scalar Addition ALU
275 | 
276 |     ioo ooor r1dd dddd AAaa aaaa BBbb bbbb
277 | 
278 |     i - whether to get Argument 1 from Scalar Multiply (above)
279 |     o - opcode:
280 |         00000 - arg0 + arg1
281 |         00100 - fract(arg1)
282 |         01100 - floor(arg1)
283 |         01101 - ceil(arg1)
284 |         10100 - dFdx(arg0, arg1)
285 |         10101 - dFdy(arg0, arg1)
286 |             Note: dFdx(x) is actually implemented as dFdx(-x, x) (same for dFdy)
287 |             See "Speculation on derivatives" above
288 |         10111 - sel
289 |             Seems to select between two input values based on ^fmul: (^fmul ? arg1 : arg0).
290 |         11111 - arg1 (passthrough)
291 |     d - destination
292 |     A - argument 0 modifiers, bit 0 is abs() and bit 1 is negate
293 |     a - argument 0 register
294 |     B - argument 1 modifiers
295 |     b - argument 1 register
296 |     M - output modifier:
297 |         00 - passthrough
298 |         01 - saturate - clamp(output, 0.0, 1.0)
299 |         10 - max(0.0, output)
300 |         11 - round to integer
301 |     
302 |     control[14], Vec4-Scalar Multiply/Transcendental Scalar ALU
303 | 
304 |     dd dddd MMss ssss na00 0000 00oo oo00
305 | 
306 |     d - destination
307 |     M - output modifier
308 |     s - source
309 |     n - negate source
310 |     a - take absolute value of source
311 |     o - operation:
312 |         0x0 - 0000 - reciprocal (1.0 / x)
313 |         0x1 - 0001 - nop/passthrough?
314 |         0x2 - 0010 - sqrt
315 |         0x3 - 0011 - inversesqrt
316 |         0x4 - 0100 - exp2
317 |         0x5 - 0101 - log2
318 |         0x6 - 0110 - sin
319 |         0x7 - 0111 - cos
320 |         0x8 - 1000 - atan_pt1 (see below)
321 |         0x9 - 1001 - atan2_pt1 (see below)
322 | 
323 |     Or, for float * vec4:
324 |     dd ddmm mmaa aaaa AAbb bbBB BBBB BB11
325 | 
326 |     d - destination
327 |     m - mask (for destination)
328 |     a - scalar source
329 |     A - scalar modifier (negative, absolute value)
330 |     b - vec4 source
331 |     B - vec4 source swizzle descriptor
332 | 
333 |     Note - for sin and cos, the input is multiplied by the constant 1/(2*pi), presumably to
334 |     simplify the hardware.
335 | 
336 |     atan is divided into two instructions, which we'll call atan_pt1 and atan_pt2.
337 |     atan_pt1 takes the (scalar) input and produces a 3-component vector.
338 |     atan_pt2 takes the vector and produces the final output.
339 | 
340 |     Unlike atan_pt1, you need to do an additional multiply between atan2_pt1 and atan_pt2:
341 |     $temp.xyz = atan2_pt1 y, x;
342 |     $temp.x *= $temp.y;
343 |     result = atan_pt2 $temp;
344 | 
345 |     asin and acos are implemented using atan2, as follows:
346 |     asin(x) = atan2(x, sqrt(1 - x^2))
347 |     acos(x) = atan2(sqrt(1 - x^2), x)
348 | 
349 |     atan_pt1:
350 | 
351 |     dd ddmm mmaa aaaa AAbb bbbb BBoo oo01
352 |     d - destination (vector)
353 |     m - write/output mask (always 0111 in this case)
354 |     a - scalar 0 source
355 |     A - scalar 0 modifier
356 |     b - scalar 1 source
357 |     B - scalar 1 modifier
358 |     o - opcode
359 |         0x8 - 1000 - atan_pt1(src0)
360 |         0x9 - 1001 - atan2_pt1(src0, src1)
361 | 
362 |     atan_pt2:
363 | 
364 |     dd dddd 0000 0000 00aa aaAA AAAA AA10
365 |     d - destination (scalar)
366 |     a - source (vector)
367 |     A - swizzle descriptor
368 | 
369 |     control[15], Temporary Write/Framebuffer Read
370 | 
371 |     Temporary Write:
372 | 
373 |     i iiii iiii iiii iiio rrrr rr00 0000 aass ssss 00dd
374 | 
375 |     i - destination index
376 |     a - alignment
377 |         0 - float
378 |         1 - vec2
379 |         2 - vec4
380 |     s - source register
381 |         If the alignment is set to vec4, then only the upper 4 bits are used.
382 |     d - destination
383 |         11 - temporary
384 |     o - register offset enable
385 |     r - offset register
386 | 
387 |     Framebuffer Read:
388 | 
389 |     0 0000 0000 0000 0000 0000 0000 0000 10dd dd00 11ss
390 | 
391 |     d - destination register
392 |     s - source
393 |         11 - gl_FBColor
394 |         10 - gl_FBDepth
395 |         Note: since gl_FBDepth is a float, and the alignment is set to 1,
396 |         this instr will always set the x component of the specified destination register.
397 | 
398 |     control[16], Branch/Discard
399 | 
400 |     Branch:
401 |     0 0011 tttt tttt tttt tttt tttt tttt ttt0 0000 0000 0000 0000 0000 0ccc aaaa aabb bbbb 0000
402 | 
403 |     Discard:
404 |     0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0111 1111 0000 0000 0000 0011
405 | 
406 |     c - condition:
407 |         bit 0 - jump if a > b
408 |         bit 1 - jump if a = b
409 |         bit 2 - jump if a < b
410 |        The jump will happen if any of the conditions signaled by the appropriate bit is true.
411 |        For example, a condition code of 011 means "jump if a >= b" and 111 is an unconditional jump.
412 | 
413 |     t - target address, relative to the start of the current instruction
414 | 
415 |     control[17] and control[18], Vec4 constant fetch
416 | 
417 |     This opcode simply consists of 4 
418 |     (half-floats)[http://en.wikipedia.org/wiki/Half-precision_floating-point_format]
419 |     to be loaded. bits 0-15 store the first argument, bits 16-31 the second one, etc.
420 | 
421 | 
422 | ## Lima Fragment Pipeline
423 | [<img src="http://img850.imageshack.us/img850/1056/limapipelinehorizontal2.png">](http://img850.imageshack.us/img850/1056/limapipelinehorizontal2.png)
424 | 


--------------------------------------------------------------------------------