首页 WaveLane
文章
取消

WaveLane

In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

Simplified HLSL code looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;

[numthreads(8, 8, 1)]
void CSMain(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID)
{
    uint2 lightStartCount = lightStartCounts[gid.xy];    
    float4 lightAccumulator = 0;
    
    for (uint i = 0; i < lightStartCount.y; ++i)
    {
        uint lightIndex = lightStartCount.x + i;
        
        // In real impl light data would be position, radius, color, etc.
        float4 lightData = lightDatas[lightIndex];  // 64-wide mem load. All lanes load from same address.
        lightAccumulator += lightData;
    }
    
    output[tid.xy] = lightAccumulator;
}

The problem with this code is that we load the same light data for each lane. If you would have used constant buffer instead of Buffer, the compiler would have generated special code utilizing GPU specific uniform load facility like constant buffer hardware (Nvidia) or scalar unit (AMD). But this can be problematic, since constant buffers have 64KB size limit, and constant buffer array elements must be float4/int4 (16 byte alignment). Not even enough space to hold 1080p tile light lists.

Fortunately with SM 6.0 wave intrinsics we can do better. We can load 32 (Nvidia) or 64 (AMD) ligths at once using a single load instruction and then use WaveReadLaneAt to broadcast light data from one lane to all lanes, one lane at a time. This reduces the number of load instructions by 32x / 64x. Also there’s no register bloat, since each loaded lane has now unique data, instead of replicated value. Register file usage is as good as with AMD scalar loads.

Optimized code looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[numthreads(8, 8, 1)]
void CSMainWave(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID, uint gix : SV_GroupIndex)
{
    const uint WAVE_LANES = WaveGetLaneCount(); // Usually 64 or 32. Compile time constant.
    
    uint2 lightStartCount = lightStartCounts[gid.xy];
    float4 lightAccumulator = 0;

    for (uint i = 0; i < divRoundUp(lightStartCount.y, WAVE_LANES); ++i)
    {
        // Load 32/64 unique items at once in outer loop. 32x/64x reduction in mem loads.
        float4 lightData64 = lightDatas[lightStartCount.x + gix + i * WAVE_LANES];

        uint loopIters = min(WAVE_LANES, lightStartCount.y - WAVE_LANES * i);
        for (uint i2 = 0; i2 < loopIters; ++i2)
        {
            // In real impl light data would be position, radius, color, etc.
            float4 lightData = WaveReadLaneAt(lightData64, i2); // Broadcast current light to all lanes
            lightAccumulator += lightData;
        }
    }
    
    output[tid.xy] = lightAccumulator;
}

This code actually will beat scalar optimized code on AMD, since one vector load is faster than 64 scalar loads. Another advantage is that this optimization works for all buffer and texture types, not just for constant/raw/structured buffers.

According to my GPU buffer benchmark (https://github.com/sebbbi/perftest), new Nvidia Maxwell/Pascal/Volta/Turing drivers employ uniform address optimization for all buffer and texture types. I have seen 20x-27x performance increases for cases where all threads inside a loop load from uniform address. My guess is this Nvidia optimization employed in their new compiler is similar to the optimization I described above. I don’t have access to their PIX plugin to validate my claim. Maybe someone with access to that plugin can validate my assumption (if NDA is not a problem).

Recent AMD drivers employ a similar optimization for global atomic add. This can be validated with publicly available tools such as RenderDoc. If all lanes add C or 0, the compiler first does WavePrefixCountBits (MBCNT) and then a single lane global atomic, reducing serialization to atomic counter by 64x.

I hope that AMD also implements the optimization above to their shader compiler. In AMDs case, this optimization is actually super good, since in most cases AMD gets also full memory coalescing gain from this optimizationg, resulting in 4x load issue rate.

Shader playground link: http://shader-playground.timjones.io/04cb6af828fe0e1a48d651e25d06cffb

reference

本文由作者按照 CC BY 4.0 进行授权

三角函数的 FP64 近似

VS 中一键 Attach 到指定进程