├── .gitignore
├── LICENSE
├── README.md
└── img
    ├── ExponentialVS.png
    ├── PlaneClusters.png
    ├── UniformVS.png
    ├── aabb_overlap.png
    ├── camera.png
    ├── compute_workgroups.png
    ├── fancy_sponza.png
    ├── flat_demo.png
    ├── frustum.png
    ├── sponza_demo2.png
    ├── t=.png
    └── t=highlighted.png


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | /blend
3 | /old.md


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 DaveH355
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Clustered Shading
  2 | 
  3 | Clustered shading is a technique that allows for efficient rendering of
  4 | thousands of dynamic lights in 3D perspective games. It can be integrated into both forward and deferred rendering pipelines with
  5 | minimal intrusion.
  6 | 
  7 | # Showcase
  8 | 
  9 | ![sponza_demo](img/sponza_demo2.png)
 10 | ![flat_demo](img/flat_demo.png)
 11 | *(top) 512 lights (bottom) 1024 lights | both scenes rendered using clustered deferred on an Intel CometLake-H GT2 iGPU @60 fps*
 12 | 
 13 | # Overview
 14 | 
 15 | The traditional method for dynamic lighting is to loop over every light in the scene to
 16 | shade a single fragment. This is a huge performance limitation as there can be millions of fragments to shade on a modern display.
 17 | 
 18 | What if we could just loop over the lights we know will affect a given fragment? Enter clustered shading.
 19 | 
 20 | > [!IMPORTANT]
 21 | > Clustered shading divides the view frustum into 3D blocks (clusters)
 22 | > and assigns lights to each based on the light's influence. If a light is too far away, it is not
 23 | > visible to the cluster. Then in the shading step, a fragment retrieves the light list for the cluster it's in.
 24 | > This increases efficiency by only considering lights that are very likely to affect the fragment.
 25 | 
 26 | Clustered shading can be thought of as the natural evolution to traditional dynamic lighting. It's not a super well known
 27 | technique. Since its [introduction](https://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf) in 2012, clustered shading has mostly stayed in the realm of research papers and behind the doors of big game studios.
 28 | My goal is to present a simple implementation with clear reasoning behind every decision. Something that might fit
 29 | in a [LearnOpenGL](https://learnopengl.com/) article.
 30 | 
 31 | We'll be using OpenGL 4.3 and C++. I'll assume you have working knowledge of both.
 32 | 
 33 | > [!TIP]
 34 | > If you are viewing this in dark mode on GitHub, I recommend trying out light mode high contrast for easier reading.
 35 | > Text can appear blurry to the eye in dark mode.
 36 | 
 37 | ## Step 1: Splitting the view frustum into clusters
 38 | 
 39 | <p align="center">
 40 |   <img src="img/frustum.png">
 41 | </p>
 42 | 
 43 | *The view frustum and camera position (center of projection) form a pyramid like shape*
 44 | 
 45 | The definition of the view frustum is the space between the `zNear` and `zFar` planes. This is the part of the world the camera can "see". Shading is only done
 46 | on fragments that end up in the frustum.
 47 | 
 48 | **Our goal is to divide this volume into a 3D grid of clusters.** We'll define the clusters in view space so they are always relative to where the camera is.
 49 | 
 50 | ### Dvision Scheme (Z)
 51 | 
 52 | <p align="middle">
 53 |   <img src="img/UniformVS.png"/>
 54 |   <img src="img/ExponentialVS.png" />
 55 | </p>
 56 | 
 57 | *uniform division (left) and exponential division (right)*
 58 | 
 59 | There are two main ways to divide the frustum along the depth: uniform and exponential division.
 60 | 
 61 | The exponential division lets us cover the same area with fewer divisions. And we generally don't care if this causes a lot of lights to
 62 | be assigned to those far out clusters. Because less of an object appears on the screen the further out in perspective projection,
 63 | there are fewer fragments to shade.
 64 | 
 65 | So exponential division it is. We'll use the equation below which closely matches the image on the right.
 66 | 
 67 | ```math
 68 | \LARGE
 69 | Z=\text{Near}_z\left(\frac{\text{Far}_z}{\text{Near}_z}\right)\Huge^{\frac{\text{slice}}{numslices}}
 70 | ```
 71 | 
 72 | - $\text{Near}_z$ and $\text{Far}_z$ represent the near and far planes
 73 | - $\text{slice}$ is the current slice index
 74 | - $\text{numslices}$ is the total number of slices to divide with.
 75 | 
 76 | This equation gives us the positive Z depth from the camera a slice should be. Where $Z$ is some value between the near and far planes.  
 77 | 
 78 | ### Division Scheme (XY)
 79 | 
 80 | In addition to slicing the frustum along the depth, we also need to divide on the xy axis.
 81 | What subdivision scheme to use is up to you. If your near and far planes are very far apart, you'll want more depth slices.
 82 | 
 83 | A good place to start is 16x9x24 (x, y, z-depth) which is what [DOOM 2016](https://advances.realtimerendering.com/s2016/Siggraph2016_idTech6.pdf) uses.
 84 | I personally use 12x12x24 to show the division scheme can be anything you choose.
 85 | 
 86 | ### Cluster shape
 87 | 
 88 | The simplest way to represent the shape of the clusters is an AABB (Axis Aligned Bounding Box).
 89 | Unfortunately, a side effect is that the AABBs must overlap to cover the frustum shape.
 90 | This image shows that.
 91 | 
 92 | <p align="center">
 93 | <img src="img/aabb_overlap.png" width="90%">
 94 | </p>
 95 | 
 96 | *The points used to create the AABB cause overlapping boundaries*
 97 | 
 98 | This still gives good results performance wise. You could choose a more accurate shape
 99 | and improve shading time as lights are better assigned to their clusters.
100 | But what you're ultimately doing is trading faster shading for slower culling.
101 | 
102 | In fact, I'll make a bold claim: This algorithm does not benefit from more accurate cluster shapes or distributions of clusters.
103 | The reason being, there is not a lot of room for optimization without complicating and slowing down the cluster creation or culling step.
104 | 
105 | ### Implementation
106 | 
107 | We use a compute shader to build the cluster grid. This is all fully functioning code, taken straight from my OpenGL playground project.
108 | 
109 | <details open>
110 |   <summary>GLSL</summary>
111 | 
112 | ```glsl
113 | #version 430 core
114 | layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
115 | 
116 | struct Cluster
117 | {
118 |     vec4 minPoint;
119 |     vec4 maxPoint;
120 |     uint count;
121 |     uint lightIndices[100];
122 | };
123 | 
124 | layout(std430, binding = 1) restrict buffer clusterSSBO {
125 |     Cluster clusters[];
126 | };
127 | 
128 | uniform float zNear;
129 | uniform float zFar;
130 | 
131 | uniform mat4 inverseProjection;
132 | uniform uvec3 gridSize;
133 | uniform uvec2 screenDimensions;
134 | 
135 | vec3 screenToView(vec2 screenCoord);
136 | vec3 lineIntersectionWithZPlane(vec3 startPoint, vec3 endPoint, float zDistance);
137 | 
138 | /*
139 |  context: glViewport is referred to as the "screen"
140 |  clusters are built based on a 2d screen-space grid and depth slices.
141 |  Later when shading, it is easy to figure what cluster a fragment is in based on
142 |  gl_FragCoord.xy and the fragment's z depth from camera
143 | */
144 | void main()
145 | {
146 |     uint tileIndex = gl_WorkGroupID.x + (gl_WorkGroupID.y * gridSize.x) +
147 |             (gl_WorkGroupID.z * gridSize.x * gridSize.y);
148 |     vec2 tileSize = screenDimensions / gridSize.xy;
149 | 
150 |     // tile in screen-space
151 |     vec2 minTile_screenspace = gl_WorkGroupID.xy * tileSize;
152 |     vec2 maxTile_screenspace = (gl_WorkGroupID.xy + 1) * tileSize;
153 | 
154 |     // convert tile to view space sitting on the near plane
155 |     vec3 minTile = screenToView(minTile_screenspace);
156 |     vec3 maxTile = screenToView(maxTile_screenspace);
157 | 
158 |     float planeNear =
159 |         zNear * pow(zFar / zNear, gl_WorkGroupID.z / float(gridSize.z));
160 |     float planeFar =
161 |         zNear * pow(zFar / zNear, (gl_WorkGroupID.z + 1) / float(gridSize.z));
162 | 
163 |     // the line goes from the eye position in view space (0, 0, 0)
164 |     // through the min/max points of a tile to intersect with a given cluster's near-far planes
165 |     vec3 minPointNear =
166 |         lineIntersectionWithZPlane(vec3(0, 0, 0), minTile, planeNear);
167 |     vec3 minPointFar =
168 |         lineIntersectionWithZPlane(vec3(0, 0, 0), minTile, planeFar);
169 |     vec3 maxPointNear =
170 |         lineIntersectionWithZPlane(vec3(0, 0, 0), maxTile, planeNear);
171 |     vec3 maxPointFar =
172 |         lineIntersectionWithZPlane(vec3(0, 0, 0), maxTile, planeFar);
173 | 
174 |     clusters[tileIndex].minPoint = vec4(min(minPointNear, minPointFar), 0.0);
175 |     clusters[tileIndex].maxPoint = vec4(max(maxPointNear, maxPointFar), 0.0);
176 | }
177 | 
178 | // Returns the intersection point of an infinite line and a
179 | // plane perpendicular to the Z-axis
180 | vec3 lineIntersectionWithZPlane(vec3 startPoint, vec3 endPoint, float zDistance)
181 | {
182 |     vec3 direction = endPoint - startPoint;
183 |     vec3 normal = vec3(0.0, 0.0, -1.0); // plane normal
184 | 
185 |     // skip check if the line is parallel to the plane.
186 | 
187 |     float t = (zDistance - dot(normal, startPoint)) / dot(normal, direction);
188 |     return startPoint + t * direction; // the parametric form of the line equation
189 | }
190 | vec3 screenToView(vec2 screenCoord)
191 | {
192 |     // normalize screenCoord to [-1, 1] and
193 |     // set the NDC depth of the coordinate to be on the near plane. This is -1 by
194 |     // default in OpenGL
195 |     vec4 ndc = vec4(screenCoord / screenDimensions * 2.0 - 1.0, -1.0, 1.0);
196 | 
197 |     vec4 viewCoord = inverseProjection * ndc;
198 |     viewCoord /= viewCoord.w;
199 |     return viewCoord.xyz;
200 | }
201 | ```
202 | 
203 | </details>
204 | 
205 | <details>
206 |   <summary>C++</summary>
207 | 
208 | ```cpp
209 | namespace Compute
210 | {
211 | constexpr unsigned int gridSizeX = 12;
212 | constexpr unsigned int gridSizeY = 12;
213 | constexpr unsigned int gridSizeZ = 24;
214 | constexpr unsigned int numClusters = gridSizeX * gridSizeY * gridSizeZ;
215 | 
216 | struct alignas(16) Cluster
217 | {
218 |   glm::vec4 minPoint;
219 |   glm::vec4 maxPoint;
220 |   unsigned int count;
221 |   unsigned int lightIndices[100];
222 | };
223 | 
224 | unsigned int clusterGridSSBO;
225 | 
226 | void init_ssbos()
227 | {
228 |   // clusterGridSSBO
229 |   {
230 |     glGenBuffers(1, &clusterGridSSBO);
231 |     glBindBuffer(GL_SHADER_STORAGE_BUFFER, clusterGridSSBO);
232 | 
233 |     // NOTE: we only need to allocate memory. No need for initialization because
234 |     // comp shader builds the AABBs.
235 |     glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(Cluster) * numClusters,
236 |                  nullptr, GL_STATIC_COPY);
237 |     glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, clusterGridSSBO);
238 |   }
239 | }
240 | 
241 | Shader clusterComp;
242 | 
243 | void cull_lights_compute(const Camera &camera)
244 | {
245 |   auto [width, height] = Core::get_framebuffer_size();
246 | 
247 |   // build AABBs every frame
248 |   clusterComp.use();
249 |   clusterComp.set_float("zNear", camera.near);
250 |   clusterComp.set_float("zFar", camera.far);
251 |   clusterComp.set_mat4("inverseProjection", glm::inverse(camera.projection));
252 |   clusterComp.set_uvec3("gridSize", {gridSizeX, gridSizeY, gridSizeZ});
253 |   clusterComp.set_uvec2("screenDimensions", {width, height});
254 | 
255 |   glDispatchCompute(gridSizeX, gridSizeY, gridSizeZ);
256 |   glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
257 | }
258 | 
259 | void init()
260 | {
261 |   init_ssbos();
262 |   // load shaders
263 |   clusterComp = Shader("clusterShader.comp");
264 | }
265 | 
266 | ```
267 | 
268 | </details>
269 | 
270 | I encourage you to stop here and adapt this code into your game or engine and study it! See if you can spot the exponential formula from earlier.
271 | 
272 | We divide the screen into tiles and convert the points to view space sitting on the camera near plane. This
273 | essentially leaves us with a divided near plane. Then for each min and max point of a tile on the near plane,
274 | we draw a line from the origin through that point and intersect it with the ***cluster's*** near and far planes.
275 | The intersection points together form the bound of the AABB.
276 | 
277 | > [!NOTE]
278 | > screenDimensions is more accurately thought of as the dimensions of glViewport, under which, lighting will be done. 
279 | 
280 | And a few notes on the C++ side:
281 | 
282 | 1. `glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);` ensures the writes to the SSBO (Shader Storage Buffer Object) by the compute shader are visible to the next shader.
283 | 
284 | 2. `alignas(16)` is used to correctly match the C++ struct memory layout with how the SSBO expects it.
285 | 
286 |    According to the [OpenGL Spec](https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf#page=169&zoom=146,46,581), the `std430` memory layout requires
287 |    the base alignment of an array of structures to be the base alignment of the largest member in the structure.
288 | 
289 |    ```cpp
290 |    struct alignas(16) Cluster
291 |    {
292 |      glm::vec4 minPoint; // 16 bytes
293 |      glm::vec4 maxPoint; // 16 bytes
294 |      unsigned int count; // 4 bytes
295 |      unsigned int lightIndices[100]; // 400 bytes
296 |    };
297 |    ```
298 | 
299 |    The largest element in this struct is a vec4 of 16 bytes. Since we are storing an array of Cluster, the struct should have its memory aligned
300 |    to 16 bytes. If you don't know what memory alignment is, don't worry.
301 |    We basically need to add padding bytes to make the total struct size a multiple of 16 bytes.
302 |    We can manually add some dummy variables or let the compiler handle it with `alignas`.
303 | 
304 |    > If you are not storing an array of structures in `std430`, and as long as you [stay away from vec3](https://stackoverflow.com/q/38172696/19633924), you *probably* don't need to worry about alignment.
305 | 
306 | ## Step 2: Assigning Lights to Clusters (Culling)
307 | 
308 | Our goal now is to cull the lights by assigning them to clusters based on the light influence. The most common type of light used in games is the point light.
309 | A point light has a position and radius which define a sphere of influence.
310 | 
311 | We brute force test every light against every cluster. If there is an intersection between the sphere and AABB, the light is visible to the cluster,
312 | and it is appended to the cluster's local list.
313 | 
314 | Let's look at the cluster struct.
315 | 
316 | ```glsl
317 |   struct Cluster
318 |   {
319 |     vec4 minPoint; // min point of AABB in view space
320 |     vec4 maxPoint; // max point of AABB in view space
321 |     uint count;
322 |     uint lightIndices[100]; // elements point directly to global light list
323 |   };
324 | ```
325 | 
326 | - The min and max define the AABB of this cluster like before.
327 | 
328 | - `lightIndices` contains the lights visible to this cluster. We hardcode a max of 100 lights visible to a cluster at any time. You'll see this number used a few times elsewhere.
329 |   If you want to increase the number, make sure to change it everywhere.
330 | 
331 | - `count` keeps tracks of how many lights are visible. It tells how much to read from the `lightIndices` array.
332 | 
333 | We'll use another compute shader to cull the lights. Compute shaders are just so awesome because they are general purpose.
334 | They are great for parallel tasks. In our case, testing intersection of thousands of lights against thousands of clusters.
335 | 
336 | Let's have each compute shader thread process a single cluster.
337 | 
338 | <details>
339 |   <summary>Is it really called a thread?</summary>
340 | 
341 | - Strictly speaking, no, compute shaders don't use the term "thread" like CPUs. Compute shaders have workgroups and invocations. Each workgroup has its own invocations (called workgroup size or local_size).
342 |    Each invocation is an independent execution of the main function. But it's helpful to think of invocations as threads.
343 | 
344 |    Read more about compute shaders on the [OpenGL Wiki](https://www.khronos.org/opengl/wiki/Compute_Shader).
345 | 
346 | </details>
347 | 
348 | ### Implementation
349 | 
350 | <details open>
351 | <summary>GLSL</summary>
352 | 
353 | ```glsl
354 | #version 430 core
355 | 
356 | #define LOCAL_SIZE 128
357 | layout(local_size_x = LOCAL_SIZE, local_size_y = 1, local_size_z = 1) in;
358 | 
359 | struct PointLight
360 | {
361 |     vec4 position;
362 |     vec4 color;
363 |     float intensity;
364 |     float radius;
365 | };
366 | 
367 | struct Cluster
368 | {
369 |     vec4 minPoint;
370 |     vec4 maxPoint;
371 |     uint count;
372 |     uint lightIndices[100];
373 | };
374 | 
375 | layout(std430, binding = 1) restrict buffer clusterSSBO
376 | {
377 |     Cluster clusters[];
378 | };
379 | 
380 | layout(std430, binding = 2) restrict buffer lightSSBO
381 | {
382 |     PointLight pointLight[];
383 | };
384 | 
385 | uniform mat4 viewMatrix;
386 | 
387 | bool testSphereAABB(uint i, Cluster c);
388 | 
389 | // each invocation of main() is a thread processing a cluster
390 | void main()
391 | {
392 |     uint lightCount = pointLight.length();
393 |     uint index = gl_WorkGroupID.x * LOCAL_SIZE + gl_LocalInvocationID.x;
394 |     Cluster cluster = clusters[index];
395 | 
396 |     // we need to reset count because culling runs every frame.
397 |     // otherwise it would accumulate.
398 |     cluster.count = 0;
399 | 
400 |     for (uint i = 0; i < lightCount; ++i)
401 |     {
402 |         if (testSphereAABB(i, cluster) && cluster.count < 100)
403 |         {
404 |             cluster.lightIndices[cluster.count] = i;
405 |             cluster.count++;
406 |         }
407 |     }
408 |     clusters[index] = cluster;
409 | }
410 | 
411 | bool sphereAABBIntersection(vec3 center, float radius, vec3 aabbMin, vec3 aabbMax)
412 | {
413 |     // closest point on the AABB to the sphere center
414 |     vec3 closestPoint = clamp(center, aabbMin, aabbMax);
415 |     // squared distance between the sphere center and closest point
416 |     float distanceSquared = dot(closestPoint - center, closestPoint - center);
417 |     return distanceSquared <= radius * radius;
418 | }
419 | 
420 | // this just unpacks data for sphereAABBIntersection
421 | bool testSphereAABB(uint i, Cluster cluster)
422 | {
423 |     vec3 center = vec3(viewMatrix * pointLight[i].position);
424 |     float radius = pointLight[i].radius;
425 | 
426 |     vec3 aabbMin = cluster.minPoint.xyz;
427 |     vec3 aabbMax = cluster.maxPoint.xyz;
428 | 
429 |     return sphereAABBIntersection(center, radius, aabbMin, aabbMax);
430 | }
431 | ```
432 | 
433 | </details>
434 | 
435 | Now let's update the C++ code. Mainly to create the lights. How exactly this is done is different for everyone.
436 | But the following suits the basic purpose.  
437 | 
438 | <details>
439 | <summary>C++</summary>
440 | 
441 | The important part is to create and fill the light SSBO. Note the use of `alignas` in the PointLight struct definition.
442 | 
443 | ```cpp
444 | struct alignas(16) PointLight
445 | {
446 |   glm::vec4 position;
447 |   glm::vec4 color;
448 |   float intensity;
449 |   float radius;
450 | };
451 | 
452 | int main()
453 | {
454 | 
455 |   std::mt19937 rng{std::random_device{}()};
456 | 
457 |   constexpr int numLights = 512;
458 |   std::uniform_real_distribution<float> distXZ(-100.0f, 100.0f);
459 |   std::uniform_real_distribution<float> distY(0.0f, 55.0f);
460 | 
461 |   std::vector<PointLight> lightList;
462 |   lightList.reserve(numLights);
463 |   for (int i = 0; i < numLights; i++)
464 |   {
465 |     PointLight light{};
466 |     float x = distXZ(rng);
467 |     float y = distY(rng);
468 |     float z = distXZ(rng);
469 | 
470 |     glm::vec4 position(x, y, z, 1.0f);
471 | 
472 |     light.position = position;
473 |     light.color = {1.0, 1.0, 1.0, 1.0};
474 |     light.intensity = 1;
475 |     light.radius = 5.0f;
476 | 
477 |     lightList.push_back(light);
478 |   }
479 | 
480 |   glGenBuffers(1, &lightSSBO);
481 |   glBindBuffer(GL_SHADER_STORAGE_BUFFER, lightSSBO);
482 | 
483 |   glBufferData(GL_SHADER_STORAGE_BUFFER, lightList.size() * sizeof(PointLight),
484 |                lightList.data(), GL_DYNAMIC_DRAW);
485 | 
486 |   glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, lightSSBO);
487 | }
488 | 
489 | ```
490 | 
491 | ```cpp
492 | namespace Compute
493 | {
494 | constexpr unsigned int gridSizeX = 12;
495 | constexpr unsigned int gridSizeY = 12;
496 | constexpr unsigned int gridSizeZ = 24;
497 | constexpr unsigned int numClusters = gridSizeX * gridSizeY * gridSizeZ;
498 | 
499 | struct alignas(16) Cluster
500 | {
501 |   glm::vec4 minPoint;
502 |   glm::vec4 maxPoint;
503 |   unsigned int count;
504 |   unsigned int lightIndices[100];
505 | };
506 | 
507 | unsigned int clusterGridSSBO;
508 | 
509 | void init_ssbos()
510 | {
511 |   // clusterGridSSBO
512 |   {
513 |     glGenBuffers(1, &clusterGridSSBO);
514 |     glBindBuffer(GL_SHADER_STORAGE_BUFFER, clusterGridSSBO);
515 | 
516 |     // NOTE: we only need to allocate memory. No need for initialization because
517 |     // comp shader builds the AABBs.
518 |     glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(Cluster) * numClusters,
519 |                  nullptr, GL_STATIC_COPY);
520 |     glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, clusterGridSSBO);
521 |   }
522 | }
523 | 
524 | Shader clusterComp;
525 | Shader cullLightComp;
526 | 
527 | void cull_lights_compute(const Camera &camera)
528 | {
529 |   auto [width, height] = Core::get_framebuffer_size();
530 | 
531 |   // build AABBs, doesn't need to run every frame but fast
532 |   clusterComp.use();
533 |   clusterComp.set_float("zNear", camera.near);
534 |   clusterComp.set_float("zFar", camera.far);
535 |   clusterComp.set_mat4("inverseProjection", glm::inverse(camera.projection));
536 |   clusterComp.set_uvec3("gridSize", {gridSizeX, gridSizeY, gridSizeZ});
537 |   clusterComp.set_uvec2("screenDimensions", {width, height});
538 | 
539 |   glDispatchCompute(gridSizeX, gridSizeY, gridSizeZ);
540 |   glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
541 | 
542 |   // cull lights
543 |   cullLightComp.use();
544 |   cullLightComp.set_mat4("viewMatrix", camera.view);
545 | 
546 |   glDispatchCompute(27, 1, 1);
547 |   glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
548 | }
549 | 
550 | void init()
551 | {
552 |   init_ssbos();
553 |   // load shaders
554 |   clusterComp = Shader("clusterShader.comp");
555 |   cullLightComp = Shader("clusterCullLightShader.comp");
556 | }
557 | ```
558 | 
559 | </details>
560 | 
561 | The compute shader has 128 "threads" per workgroup. We dispatch 27 workgroups for a total of 3456 threads.
562 | This is to fit the design of each thread processing a single cluster. Remember we have 12x12x24 = 3456 clusters. If you change anything, make sure to change your dispatch to match the total thread count
563 | with the number of clusters.
564 | 
565 | Also, since each thread processes its own cluster and writes to its own part of the SSBO memory, we don't need to use any shared memory or atomic operations!
566 | This keeps the compute shader as parallel as possible.
567 | 
568 | ## Step 3: Consumption in fragment shader
569 | 
570 | Now that we have built our cluster grid and assigned lights to clusters, we can finally consume this data.
571 | 
572 | We calculate the cluster a fragment is in, retrieve the light list for that cluster, and do cool lighting.
573 | This is basically reversing the calculations in the cluster compute shader to solve for the xyz indexes of a cluster.
574 | 
575 | Your lighting shader should look something like this
576 | 
577 | ```glsl
578 | #version 430 core
579 | 
580 | // PointLight and Cluster struct definitions
581 | //...
582 | // bind to light and cluster ssbo
583 | // same as cull compute shader
584 | 
585 | uniform float zNear;
586 | uniform float zFar;
587 | uniform uvec3 gridSize;
588 | uniform uvec2 screenDimensions;
589 | 
590 | out vec4 FragColor;
591 | 
592 | void main()
593 | {
594 |     //view space position of a fragment. Replace with your implementation
595 |     vec3 FragPos = texture(gPosition, TexCoords).rgb;
596 | 
597 |     // Locating which cluster this fragment is part of
598 |     uint zTile = uint((log(abs(FragPos.z) / zNear) * gridSize.z) / log(zFar / zNear));
599 |     vec2 tileSize = screenDimensions / gridSize.xy;
600 |     uvec3 tile = uvec3(gl_FragCoord.xy / tileSize, zTile);
601 |     uint tileIndex =
602 |         tile.x + (tile.y * gridSize.x) + (tile.z * gridSize.x * gridSize.y);
603 | 
604 |     uint lightCount = clusters[tileIndex].count;
605 | 
606 |     for (int i = 0; i < lightCount; ++i)
607 |     {
608 |         uint lightIndex = clusters[tileIndex].lightIndices[i];
609 |         PointLight light = pointLight[lightIndex];
610 |         // do cool lighting
611 |     }
612 | }
613 | ```
614 | 
615 | Here FragPos is the view space position of the fragment.
616 | The absolute value of `FragPos.z` gives us the positive Z depth of the fragment from the camera.
617 | Remember, that's exactly the left hand side of the exponential equation from earlier.
618 | 
619 | Solving that earlier equation for the slice results in the z index of the cluster.
620 | 
621 | ```glsl
622 | uint zTile = uint((log(abs(FragPos.z) / zNear) * gridSize.z) / log(zFar / zNear));
623 | ```
624 | 
625 | Finding the xy index of the cluster is very simple. We have the screen coordinates of the fragment from`gl_FragCoord.xy`, we just need to divide by
626 | the tileSize. Again, this is the reverse of what the cluster compute shader does.
627 | 
628 | ## Common Problems
629 | 
630 | A common problem is flickering artifacts. This could be either:
631 | 
632 | 1. Your light is affecting fragments outside its defined radius. This causes uneven lighting. Try adding a range check to your attenuation.
633 | 
634 | 2. There are too many lights visible to a single cluster. Remember we hardcoded a max of 100 lights per cluster at any time. If this limit is hit, further intersecting lights
635 |    will be ignored, and their assignment will become unpredictable. This can happen at further out clusters, since the exponential division causes those clusters to be very large.
636 | 
637 |    **Solution:** Increase the light limit. The only cost is more GPU memory. You can also add a check in your lighting shader
638 |    to output a warning color.
639 | 
640 |    ```glsl
641 |     uint lightCount = clusters[tileIndex].count;
642 |     if (lightCount > 95) {
643 |         //getting close to limit. Output red color and dip
644 |         FragColor = vec4(1.0f, 0.0f, 0.0f, 1.0f);
645 |         return;
646 |     }
647 |    ```
648 | 
649 | ## Benchmarks
650 | 
651 | The following benchmarks were measured using `glFinish()` and regular C++ clocks on
652 | my linux machine using an Intel CometLake-H GT2. I found the integrated gpu results were more in line with what I expected.
653 | It also shows the competitiveness of the algorithm on low-end hardware.
654 | 
655 | The scene uses cluster shading with deferred rendering **without** any optimizations like frustum culling.
656 | 
657 | - 12x12x24 cluster grid
658 | - Camera near and far planes (0.1, 400)
659 | - Light XZ positions allowed to range (-100, 100) and vertical Y (0, 55)
660 | - 1920x1080 resolution
661 | - Sponza model
662 | 
663 | |                     | Building Clusters | Light Assignment | Shading |
664 | |---------------------|-------------------|------------------|---------|
665 | | 512 lights (13.0f)  | 0.28 ms           | 0.95 ms          | 5.23 ms |
666 | | 1,024 lights (7.0f) | 0.27 ms           | 1.50 ms          | 3.71 ms |
667 | | 2,048 lights (3.0f) | 0.42 ms           | 2.61 ms          | 2.84 ms |
668 | | 4,096 lights (2.0f) | 0.29 ms           | 5.15 ms          | 3.28 ms |
669 | 
670 | ### Optimization
671 | 
672 | The benchmarks show constructing the cluster grid takes constant time, while shading perf is largely affected by light radius.
673 | However, a bottleneck starts to appear in assigning lights to clusters. This makes sense since we are brute force testing every light against every cluster. We need some way to reduce the number of lights being tested.
674 | 
675 | One way is to build a BVH (Bounding Volume Hierarchy) over the lights and traverse it in the culling step. This
676 | can produce good results, but IMO it overcomplicates things. Clustered shading is already an optimization technique, and I have doubts about spiraling into a rabbit hole of optimizing the optimizers.
677 | 
678 | The easiest solution here is to frustum cull the lights and update the light SSBO every frame. Thus, we only test lights that are
679 | in the view frustum. Frustum culling is fast and already standard in many games.
680 | 
681 | ## Further Reading
682 | 
683 | - [Clustered Deferred and Forward Shading - 2012](https://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf) A research paper where
684 |   clustered shading was first introduced.
685 | - [A Primer On Efficient Rendering Algorithms & Clustered Shading - 2018](http://www.aortiz.me/2018/12/21/CG.html) An excellent blog post much of this tutorial is based on.
686 | - [Practical Clustered Shading - 2015](http://www.humus.name/Articles/PracticalClusteredShading.pdf) Presentation by Avalanche Studios
687 | - [Simple Alternative to Clustered Shading for Thousands of Lights - 2015](https://worldoffries.wordpress.com/2015/02/19/simple-alternative-to-clustered-shading-for-thousands-of-lights/)
688 |    Alternative to clustered shading by building a BVH and traversing it directly in the shading step.
689 | 
690 | -----------
691 | *Questions, typos, or corrections? Feel free to open an issue or pull request!*
692 | 


--------------------------------------------------------------------------------
/img/ExponentialVS.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/ExponentialVS.png


--------------------------------------------------------------------------------
/img/PlaneClusters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/PlaneClusters.png


--------------------------------------------------------------------------------
/img/UniformVS.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/UniformVS.png


--------------------------------------------------------------------------------
/img/aabb_overlap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/aabb_overlap.png


--------------------------------------------------------------------------------
/img/camera.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/camera.png


--------------------------------------------------------------------------------
/img/compute_workgroups.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/compute_workgroups.png


--------------------------------------------------------------------------------
/img/fancy_sponza.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/fancy_sponza.png


--------------------------------------------------------------------------------
/img/flat_demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/flat_demo.png


--------------------------------------------------------------------------------
/img/frustum.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/frustum.png


--------------------------------------------------------------------------------
/img/sponza_demo2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/sponza_demo2.png


--------------------------------------------------------------------------------
/img/t=.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/t=.png


--------------------------------------------------------------------------------
/img/t=highlighted.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DaveH355/clustered-shading/3589aefe1baf7f37180a6f1853f4c2456805e005/img/t=highlighted.png


--------------------------------------------------------------------------------