Hierarchical Z-Buffer Occlusion Culling (Updated 07.15.2011)
Published February 27, 2011
Updates: - Fixed some code in the baking step - Rewrote culling shader to use a variable number of samples
So here's my take on the GPU based Hierarchical-Z Buffer Occlusion Culling thing that you might have already heard about. Implementations have already been presented at
rastergrid.com and
nickdarnell.com, so I'll skip most of the explanation.
If you're like me and believe in geometry culling, you might feel tempted to try it in your engine. In the past few days, I've been experimenting with it to see how it competes with the existing occlusion culling algorithm in my engine: CHC (
Coherent Hierarchical Culling) (more about this later).
There are a lot of things one can do wrong in HIZ and the smallest mistake can have dramatical effects: incorrect culling. One of the above linked implementations indeed falls victim to an error. Since I assumed it to be correct, I spent a while to see the mistake. Let me explain:
The implementation at
rastergrid.com computes the mip level using half the view size of the bounding rectangle of the object. In code it looks like this:
float LOD = ceil( log2( max( ViewSizeX, ViewSizeY ) / 2.0 ) );
This assumption is correct for this case:
But a simple translation of the same rectangle will result in a different coverage:
Suddenly 6 or even 9 pixels are covered by the view size rectangle! The texels in the middle won't be sampled and this results in false culling. A little check showed that about half the rectangles on the screen are affected by these cases.
In my implementation I'm using a different way to compute the mip level which takes account of these cases. Since the last update of this journal entry, I also support higher number of samples because i found out that taking more samples actually increases the performance (because of better culling and you don't have to compute the whole mip map chain). But before we jump into the code, I want to give my opinion on the algorithm.
In theory, the algorithm has a lot of advantages compared to occlusion query based algorithms:
- Less stalling (just once for fetching the results from the GPU (dx9))
- One draw call for culling many, many objects
Cons:
[s]- Very coarse culling for objects that are big in screenspace. Can make it pretty useless in a lot of cases. For more effective culling you would probably have to take 16 or more samples.[/s] (Fixed this issue with the last update. My assumption was correct
)
[s]- Baking the HIZ map isn't exactly free even with ping-pong opimizations and all that. You can use a coarser map, but then the culling becomes even less effective. Rendering the HIZ map actually added more constant overhead to my engine than my CHC implementation ever did.[/s] (Haven't compared this since the last update)
Now it's time for some code. Except for the mip calculation and some other tweaks, the implementation is similar to the one at
rastergrid.com. I'm using DirectX 9 for this one.
Implementation
HIZ culling vertex shader:
VS_outputAABB vs_HIZ(VS_inputAABB input){ VS_outputAABB output; // get the dimensions of the AABB in world space const float boxWidth = input.aabbMax.x - input.aabbMin.x; const float boxHeight = input.aabbMax.y - input.aabbMin.y; const float3 boxDepth = float3(0, 0, input.aabbMax.z - input.aabbMin.z); // build the 8 box corners in world space float3 boxConers[8]; boxConers[0] = input.aabbMin.xyz + float3(0, boxHeight, 0); boxConers[1] = input.aabbMin.xyz + float3(boxWidth, boxHeight, 0); boxConers[2] = input.aabbMin.xyz; boxConers[3] = input.aabbMin.xyz + float3(boxWidth, 0, 0); boxConers[4] = boxConers[0] + boxDepth; boxConers[5] = boxConers[1] + boxDepth; boxConers[6] = boxConers[2] + boxDepth; boxConers[7] = boxConers[3] + boxDepth; // compute the viewspace rectangle of the AABB // init the rectangle with the opposite max values output.rectSS.x = 1.0f; // left output.rectSS.z = - 1.0f; // right output.rectSS.y = - 1.0f; // top output.rectSS.w = 1.0f; // bottom float depth = 1.0f; for(int i = 0; i < 8; ++i) { // transform the AABB into clip space float4 cornerVS = mul(float4(boxConers, 1.0f), MatView); float4 cornerSS = mul(cornerVS, MatProj); boxConers = cornerSS.xyz / cornerSS.w; // get the min depth of all corners depth = min(depth, cornerVS.z / FARCLIPDIST); // get the max coverage of the screen space box output.rectSS.x = min(boxConers.x, output.rectSS.x); output.rectSS.z = max(boxConers.x, output.rectSS.z); output.rectSS.y = max(boxConers.y, output.rectSS.y); output.rectSS.w = min(boxConers.y, output.rectSS.w); } // transform to normalized screen space coords output.rectSS += 1.0f; output.rectSS *= 0.5f; // clamp values to 0-1 output.rectSS = saturate(output.rectSS); output.depth = depth; // render the point to its position in the output texture input.pos /= HizBufSize; output.pos.x = input.pos; output.pos.y = 1.0f; output.pos.xy -= 0.5f; output.pos.xy *= 2.0f; output.pos.zw = 1.0f; return output;}
HIZ culling pixel shader:
PS_output ps_HIZ(in VS_outputAABB input){ PS_output output = (PS_output)0; // numSamples is the number of samples to take. This also defines how deep your mip chain has to be: // maxMipNeeded = max(0, ceil(log2(max(HizDimX, HizDimY) / numSamples))) // numSamples should be a power of 2 output.col = 1.0f; const float widthSS = (input.rectSS.z - input.rectSS.x); const float heightSS = (input.rectSS.y - input.rectSS.w); const float maxSizeSS = max(widthSS * HizDim.x, heightSS * HizDim.y) / numSamples; const float mip = max(0, ceil(log2(maxSizeSS))); const float2 bOffset = 0.5f / HizDim; float HIZdepth = 0; float yPos = 1.0f - input.rectSS.y; const float stepX = widthSS / (numSamples + 1); const float stepY = heightSS / (numSamples + 1); const bool mSampler = (mip % 2) == 0; for(int y = 0; y < numSamples + 1; ++y) { float xPos = input.rectSS.x; for(int x = 0; x < numSamples + 1; ++x) { const float2 nCoords0 = float2(xPos, yPos) + bOffset; if(mSampler) HIZdepth = max(HIZdepth, tex2Dlod(texDepthMap0, float4(nCoords0, 0, mip)).x); else HIZdepth = max(HIZdepth, tex2Dlod(texDepthMap1, float4(nCoords0, 0, mip)).x); xPos += stepX; } yPos += stepY; } if(input.depth > HIZdepth) output.col = 0.0f; return output;}
Baking HIZ map pixel shader:
PS_output ps_depthMap(in VS_output input, in float2 PositionSS : VPOS){ PS_output output = (PS_output)0; const float width = LastMipInfo.x; const float height = LastMipInfo.y; const float mip = LastMipInfo.z; const float2 texelDim = 1.0f / float2(width, height); const float2 nCoords0 = float2((PositionSS.x * 2.0f) / width, (PositionSS.y * 2.0f) / height) + 0.5f * texelDim; const float2 nCoords1 = float2(nCoords0.x + texelDim.x, nCoords0.y); const float2 nCoords2 = float2(nCoords0.x, nCoords0.y + texelDim.y); const float2 nCoords3 = float2(nCoords1.x, nCoords2.y); const bool oddX = OddSize.x && PositionSS.x * 2 == width - 3; const bool oddY = OddSize.y && PositionSS.y * 2 == height - 3; float4 vTexels; vTexels.x = tex2Dlod(texDepthMap0, float4(nCoords0, 0, mip)).x; vTexels.y = tex2Dlod(texDepthMap0, float4(nCoords1, 0, mip)).x; vTexels.z = tex2Dlod(texDepthMap0, float4(nCoords2, 0, mip)).x; vTexels.w = tex2Dlod(texDepthMap0, float4(nCoords3, 0, mip)).x; output.col = max(max(vTexels.x, vTexels.y), max(vTexels.z, vTexels.w)); if(oddX) { const float extra1 = tex2Dlod(texDepthMap0, float4(nCoords0 + float2(texelDim.x * 2.0f, 0 ), 0, mip)).x; const float extra2 = tex2Dlod(texDepthMap0, float4(nCoords0 + float2(texelDim.x * 2.0f, texelDim.y ), 0, mip)).x; output.col = max(output.col, max(extra1, extra2)); } if(oddY) { const float extra1 = tex2Dlod(texDepthMap0, float4(nCoords0 + float2(0, texelDim.y * 2.0f ), 0, mip)).x; const float extra2 = tex2Dlod(texDepthMap0, float4(nCoords0 + float2(texelDim.x, texelDim.y * 2.0f ), 0, mip)).x; output.col = max(output.col, max(extra1, extra2)); } if(oddX && oddY) output.col = max(output.col, tex2Dlod(texDepthMap0, float4(nCoords0 + texelDim * 2.0f, 0, mip)).x); return output;}
If you find any errors, I'd highly appreciate if you point them out.