How to use dynamic branching to skip unnecessary instructions

Started by
2 comments, last by zubetto 2 years, 1 month ago

I would like to use dynamic branching to skip unnecessary instructions. Please consider two functions:

float computeFirst(float s)
{
   [branch] if(abs(s) > 1.0)
       return -1.0;
   
   // a bunch of instructions
   
   return acos(s); // acos just for example
}

float computeSecond(float s)
{
   [branch] if(abs(s) > 1.0)
   {
       return -1.0;
   }
   else    
   {
       // a bunch of instructions
       
       return acos(s); // acos just for example
   }
}

Are these functions equivalent? Both have the dynamic branch but do they work the same way and the unnecessary instructions are actually skipped (when all the pixels in a warp follow the same branch)?

Using Shader Playground, I found that these two functions compile differently:

// computeFirst
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_input_ps linear v0.x
dcl_output o0.xyzw
dcl_temps 2
lt r0.x, l(1.000000), |v0.x|
if_nz r0.x
 mov r0.y, l(-1.000000)
endif  
add r0.z, -|v0.x|, l(1.000000)
sqrt r0.z, r0.z
mad r0.w, |v0.x|, l(-0.018729), l(0.074261)
mad r0.w, r0.w, |v0.x|, l(-0.212114)
mad r0.w, r0.w, |v0.x|, l(1.570729)
mul r1.x, r0.z, r0.w
mad r1.x, r1.x, l(-2.000000), l(3.141593)
lt r1.y, v0.x, -v0.x
and r1.x, r1.y, r1.x
mad r0.z, r0.w, r0.z, r1.x
movc o0.x, r0.x, r0.y, r0.z
mov o0.yzw, l(0,0,0,0)
ret  
// Approximately 17 instruction slots used

// computeSecond
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_input_ps linear v0.x
dcl_output o0.xyzw
dcl_temps 2
lt r0.x, l(1.000000), |v0.x|
if_nz r0.x
 mov r0.x, l(-1.000000)
else  
 add r0.y, -|v0.x|, l(1.000000)
 sqrt r0.y, r0.y
 mad r0.z, |v0.x|, l(-0.018729), l(0.074261)
 mad r0.z, r0.z, |v0.x|, l(-0.212114)
 mad r0.z, r0.z, |v0.x|, l(1.570729)
 mul r0.w, r0.y, r0.z
 mad r0.w, r0.w, l(-2.000000), l(3.141593)
 lt r1.x, v0.x, -v0.x
 and r0.w, r0.w, r1.x
 mad r0.x, r0.z, r0.y, r0.w
endif  
mov o0.x, r0.x
mov o0.yzw, l(0,0,0,0)
ret  
// Approximately 18 instruction slots used

In computeFirst, dynamic branch looks useless and never seems to allow unnecessary instructions to be skipped. Am I misunderstanding something and are these two compiled versions equivalent?

Advertisement

They're definitely equivalent in terms of the end result but clearly the compiler is handling the first case differently in terms of the actual codegen. fxc seems to be using an extremely weird interpretation of the if/return section by generating a branch for setting a temporary to -1, but then evaluating the rest of that function and using a movc to conditionally select the result based on the original value that was branched on. HLSL isn't really fully spec'd so it would be hard to argue whether that interpretation is valid or not, but I agree it's definitely weird and undesirable. fxc has always had weird handling of early return statements in my experience, I used to always avoid them. The newer dxc compiler is much better about these sorts of things, but that's obviously not an option unless you're using D3D12.

MJP said:
The newer dxc compiler is much better about these sorts of things, but that's obviously not an option unless you're using D3D12

I'm using HLSL in Custom Expressions in UE4 Materials, so I'm at a pretty basic level of understanding things like API, shader model, etc. In UE4, I can set DirectX 12 as the default RHI. If I do this, does it mean that material's HLSL will be compiled with dxc instead of fxc and I have to deal with DXIL (which looks completely unreadable) instead of DXBC if I want to debug or optimize my code (or maybe I don't have to care about optimization with dxc)?

Regarding the original question, it seems that the computeFirst function is better optimized without the branch attribute. I just added the actual expressions instead of the comment line and now the compilation result depends on the number of instructions that can be skipped using dynamic branching:

float computeFirst(float s)
{
   if(abs(s) > 1.0)
       return -1.0;
   
   s = acos(s) / acos(-1.0);
   s = acos(s) / acos(-1.0);
   s = acos(s) / acos(-1.0);
   s = acos(s) / acos(-1.0);
   s = acos(s) / acos(-1.0); // if you comment out this line, dynamic branching will not be used
   
   return s;
}

ps_5_0
dcl_globalFlags refactoringAllowed
dcl_input_ps linear v0.x
dcl_output o0.xyzw
dcl_temps 1
ge r0.x, l(1.000000), |v0.x|
if_nz r0.x
 add r0.x, -|v0.x|, l(1.000000)
 sqrt r0.x, r0.x
 mad r0.y, |v0.x|, l(-0.018729), l(0.074261)
 mad r0.y, r0.y, |v0.x|, l(-0.212114)
 mad r0.y, r0.y, |v0.x|, l(1.570729)
 mul r0.z, r0.x, r0.y
 mad r0.z, r0.z, l(-2.000000), l(3.141593)
 lt r0.w, v0.x, -v0.x
 and r0.z, r0.w, r0.z
 mad r0.x, r0.y, r0.x, r0.z
 mul r0.y, r0.x, l(0.318310)
 mad r0.z, -r0.x, l(0.318310), l(1.000000)
 sqrt r0.z, r0.z
 mad r0.x, r0.x, l(-0.005962), l(0.074261)
 mad r0.x, r0.x, r0.y, l(-0.212114)
 mad r0.x, r0.x, r0.y, l(1.570729)
 mul r0.x, r0.z, r0.x
 mul r0.y, r0.x, l(0.318310)
 mad r0.z, -r0.x, l(0.318310), l(1.000000)
 sqrt r0.z, r0.z
 mad r0.x, r0.x, l(-0.005962), l(0.074261)
 mad r0.x, r0.x, r0.y, l(-0.212114)
 mad r0.x, r0.x, r0.y, l(1.570729)
 mul r0.x, r0.z, r0.x
 mul r0.y, r0.x, l(0.318310)
 mad r0.z, -r0.x, l(0.318310), l(1.000000)
 sqrt r0.z, r0.z
 mad r0.x, r0.x, l(-0.005962), l(0.074261)
 mad r0.x, r0.x, r0.y, l(-0.212114)
 mad r0.x, r0.x, r0.y, l(1.570729)
 mul r0.x, r0.z, r0.x
 mul r0.y, r0.x, l(0.318310)
 mad r0.z, -r0.x, l(0.318310), l(1.000000)
 sqrt r0.z, r0.z
 mad r0.x, r0.x, l(-0.005962), l(0.074261)
 mad r0.x, r0.x, r0.y, l(-0.212114)
 mad r0.x, r0.x, r0.y, l(1.570729)
 mul r0.x, r0.z, r0.x
 mul r0.x, r0.x, l(0.318310)
else  
 mov r0.x, l(-1.000000)
endif  
mov o0.x, r0.x
mov o0.yzw, l(0,0,0,0)
ret  
// Approximately 47 instruction slots used

This topic is closed to new replies.

Advertisement