In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.
Simplified HLSL code looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;
[numthreads(8, 8, 1)]
void CSMain(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID)
{
uint2 lightStartCount = lightStartCounts[gid.xy];
float4 lightAccumulator = 0;
for (uint i = 0; i < lightStartCount.y; ++i)
{
uint lightIndex = lightStartCount.x + i;
// In real impl light data would be position, radius, color, etc.
float4 lightData = lightDatas[lightIndex]; // 64-wide mem load. All lanes load from same address.
lightAccumulator += lightData;
}
output[tid.xy] = lightAccumulator;
}
The problem with this code is that we load the same light data for each lane. If you would have used constant buffer instead of Buffer
Fortunately with SM 6.0 wave intrinsics we can do better. We can load 32 (Nvidia) or 64 (AMD) ligths at once using a single load instruction and then use WaveReadLaneAt to broadcast light data from one lane to all lanes, one lane at a time. This reduces the number of load instructions by 32x / 64x. Also there’s no register bloat, since each loaded lane has now unique data, instead of replicated value. Register file usage is as good as with AMD scalar loads.
Optimized code looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[numthreads(8, 8, 1)]
void CSMainWave(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID, uint gix : SV_GroupIndex)
{
const uint WAVE_LANES = WaveGetLaneCount(); // Usually 64 or 32. Compile time constant.
uint2 lightStartCount = lightStartCounts[gid.xy];
float4 lightAccumulator = 0;
for (uint i = 0; i < divRoundUp(lightStartCount.y, WAVE_LANES); ++i)
{
// Load 32/64 unique items at once in outer loop. 32x/64x reduction in mem loads.
float4 lightData64 = lightDatas[lightStartCount.x + gix + i * WAVE_LANES];
uint loopIters = min(WAVE_LANES, lightStartCount.y - WAVE_LANES * i);
for (uint i2 = 0; i2 < loopIters; ++i2)
{
// In real impl light data would be position, radius, color, etc.
float4 lightData = WaveReadLaneAt(lightData64, i2); // Broadcast current light to all lanes
lightAccumulator += lightData;
}
}
output[tid.xy] = lightAccumulator;
}
This code actually will beat scalar optimized code on AMD, since one vector load is faster than 64 scalar loads. Another advantage is that this optimization works for all buffer and texture types, not just for constant/raw/structured buffers.
According to my GPU buffer benchmark (https://github.com/sebbbi/perftest), new Nvidia Maxwell/Pascal/Volta/Turing drivers employ uniform address optimization for all buffer and texture types. I have seen 20x-27x performance increases for cases where all threads inside a loop load from uniform address. My guess is this Nvidia optimization employed in their new compiler is similar to the optimization I described above. I don’t have access to their PIX plugin to validate my claim. Maybe someone with access to that plugin can validate my assumption (if NDA is not a problem).
Recent AMD drivers employ a similar optimization for global atomic add. This can be validated with publicly available tools such as RenderDoc. If all lanes add C or 0, the compiler first does WavePrefixCountBits (MBCNT) and then a single lane global atomic, reducing serialization to atomic counter by 64x.
I hope that AMD also implements the optimization above to their shader compiler. In AMDs case, this optimization is actually super good, since in most cases AMD gets also full memory coalescing gain from this optimizationg, resulting in 4x load issue rate.
Shader playground link: http://shader-playground.timjones.io/04cb6af828fe0e1a48d651e25d06cffb