Advertisement

Performance of gaussian blur with linear sampling

Started by November 21, 2019 06:13 AM
7 comments, last by _Flame_ 5 years, 2 months ago

Hello. According to this article efficient gaussian blur with linear sampling it is better to reduce the number of cycles in the gaussian blur fragment shader by using  bilinear interpolation.

I did some experiments and it is indeed better but only if framebuffer texture format is not wide. I have big performance improvement(about 25%) if i use GL_RGB16F texture format with such approach. But when i use GL_RGB32F than performance drops to about same 25%. Could someone comment on that?

I experiment on nvidia p1000 video card.

 

BTW i use apitrace to see performance difference of specific shader program.

Texture sampling performance is also predicated on bandwidth. So going from 16-bit float to 32-bit float is theoretically doubling your bandwidth. So it would be unrealistic to expect the same performance given the bandwidth difference. 

Advertisement
1 hour ago, cgrant said:

Texture sampling performance is also predicated on bandwidth. So going from 16-bit float to 32-bit float is theoretically doubling your bandwidth. So it would be unrealistic to expect the same performance given the bandwidth difference.

Do you mean the linear sampling is worse than more cycles in the shader because of bandwidth bottleneck?

It'd be interesting to see how you measured.

I highly doubt that using a 5x5 filter realised via discretised loads (not samples) on a RGB32F could possibly be faster than a 3x3 filter realised via bilinear samples (not loads) on the same RGB32F texture.

The reason is that the underlying memory will be organised in such a fashion that the linear samples will hit the caches quite as effectively as the discretised loads, the amount of memory transferred will be the same, plus the fixed GPU texture sampling hardware will return the mix of the four samples for free, compared to wasting ALU instructions on doing it yourself.

 

The texture filtering units on most GPU's out in the wild have varying cycle counts for different formats. It's not at all uncommon to have 1/2 rate for 64bpp formats and 1/4 rate for 128bpp formats. Generally you want to avoid 128bpp formats anyway, since they are rarely necessary in graphics and consume a lot of memory + bandwidth.

9 hours ago, MJP said:

The texture filtering units on most GPU's out in the wild have varying cycle counts for different formats. It's not at all uncommon to have 1/2 rate for 64bpp formats and 1/4 rate for 128bpp formats. Generally you want to avoid 128bpp formats anyway, since they are rarely necessary in graphics and consume a lot of memory + bandwidth.

It makes sense. It would be far fetched to have same filtering cycles count for all formats. Is there a documentation for such thing?

@pcmaster It is not 5x5 and 3x3 but 5 + 5 and 3 + 3.

Advertisement

Yeah, I'm sorry about that 5x5. But my argument (guesstimate) still holds.

Maybe MJP will point us to some documentation that says that 128bpp formats (such as RGBA32F) have 1/4 rate, which is definitely true for the AMD GCN. I remember having read it but I cannot seem to find it in the AMD GCN ISA whitepaper nor in the AMD GCN Architecture whitepaper (same for the newer AMD RDNA).

Nevertheless, the "AMD RDNA Architecture" whitepaper on page 21 says:

Quote

the texture sampling and interpolation for pixels using FP16 per channel has doubled and is on par with INT8 data

Suggesting that the previous architecture (GCN) had 1/1 rate for int8 (such as rgba8), 1/2 rate for fp16 or int16 and 1/4 rate for fp32 or int32. It will be stated somewhere also explicitly but I couldn't find it in 10 minutes :(

If understood correctly that root cause in not sampling itself but interpolation.  In case of interpolation shader we reduce texture reads but add interpolation.

I use apitrace to measure shader program performance. Here are screenshots with results. Shaders that responsible for gauss filtering are outlined by black. Column "Avg GPU time" is what we are looking for. It shows how much time it took to render with a  shader per frame. There are 2 shaders because it is done with 2 passes(vertical and horizontal)

 

5 + 5 gaussian blur GL_RGB16F118939285_Cropped16F.png.9d095616a12c8365aa2c607ff61a9ee7.png

3 + 3 gaussian blur with interpolation GL_RGB16F2121633769_Cropped16finterpolation.png.229ca9a8d64629e29862c315826dfa83.png

 

5 + 5 gaussian blur GL_RGB32F1023914152_Cropped32F.png.738d8ce68891977ef00f3f96a02878bc.png

3 + 3 gaussian blur with interpolation GL_RGB32F1402049296_Cropped32finterpolation.thumb.png.aaf54c0a05f29fc92b8688ab934e7521.png

 

Summary:

5 + 5 gaussian blur GL_RGB16F: 27.6 and 24.5

3 + 3 gaussian blur with interpolation GL_RGB16F: 19.5 and 17.4

 

5 + 5 gaussian blur GL_RGB32F: 43.5 and 48.8

3 + 3 gaussian blur with interpolation GL_RGB32F: 49.3 and 55.9

 

We see that in case of GL_RGB32F these is a definite performance drop.

This topic is closed to new replies.

Advertisement