Advertisement

local variable == SLOW?

Started by October 17, 2000 05:41 PM
33 comments, last by jho 24 years, 2 months ago
i dont think that in the function, there is any memory allocated or killed, its just the "functionstack"-pointer, that is incrementet, to show to the next 40bites (yep, takes time, but not so much)
the problem is just the call of the constructor.. and, perhaps global vector could help there, dont know.. (you perhaps have to set the vector to [0,0,0], anyways, and then there is a funccall..)

yep you could do inline global functions, stoffel. just write inline in front of the func

like that

inline void matrixmult(const matrix3D& M,const vector3D& V,const vector3D& result)
{
vector3D temp;
...
}

thats it
should be as fast as the matrix::mult (because in matrix::mult, you get a matrix* this, you just cant see it)

we wanna play, not watch the pictures

quote:
if it works in debug it WILL work in release, so why would you use an ''if'' statement that just do extra checking on code you already know is correct?


Providing you never alias a vector, you''re correct. But if you store pointers to things that have vectors it''s very hard to make sure you never have two pointers to the same thing. It wont happen very often, so it could end up a hard to find bug.

It''s probably not the best idea to stick an almost pointless if inside a function that''s called so often though.

Advertisement
Guys, I'm pulling my hair out.
I'm doing benchmarking in release mode now, running the same thing over with slightly different code

In a particular frequently called function, I make this call:
inline bool camera::TransformLine(const vector3D &V1, const vector3D &V2, line2D &result)
{
R_matrix.mult(V1, V1P);
R_matrix.mult(V2, V2P);
...other stuff...
}
I get a speed of 227k (some number of ticks)
(FYI: V1P and V2P are globals, if I declare Vector3D V1P, V2P inside the function, the speed is about 231K == a little slower)

if I copy and "expand" my ::mult code explicitly, it's faster.

// R_matrix.mult(V1, V1P);
// R_matrix.mult(V2, V2P);

V1P.x = V1.x * R_matrix.element[0][0] +
V1.y * R_matrix.element[0][1] +
V1.z * R_matrix.element[0][2] +
R_matrix.element[0][3];
V1P.y = V1.x * R_matrix.element[1][0] +
V1.y * R_matrix.element[1][1] +
V1.z * R_matrix.element[1][2] +
R_matrix.element[1][3];
V1P.z = V1.x * R_matrix.element[2][0] +
V1.y * R_matrix.element[2][1] +
V1.z * R_matrix.element[2][2] +
R_matrix.element[2][3];

V2P.x = V2.x * R_matrix.element[0][0] +
V2.y * R_matrix.element[0][1] +
V2.z * R_matrix.element[0][2] +
R_matrix.element[0][3];
V2P.y = V2.x * R_matrix.element[1][0] +
V2.y * R_matrix.element[1][1] +
V2.z * R_matrix.element[1][2] +
R_matrix.element[1][3];
V2P.z = V2.x * R_matrix.element[2][0] +
V2.y * R_matrix.element[2][1] +
V2.z * R_matrix.element[2][2] +
R_matrix.element[2][3];

I get a speed of 213k. I consider this quite a difference on my 700Mhz machine, and maybe it'll be more on slower machines.

I don't understand why though, my mult function is already inlined. Why is there still a overhead? Is VC++ disabling the inlining for me? 213 vs 227 is like 6% difference I could really use that speed!


Edited by - jho on October 19, 2000 7:16:21 PM
The "inline" command is only a hint to the compiler to inline the function. If

-the function contains for, while, do''s
-the function is recursive
-the compiler hasn''t yet seen the function''s definition
-the compiler thinks the function is too complex

it will not inline the function.

I guess you have to determine if the compiler is really inlining the code. Try:

-replacing inline with a macro and see if the speed goes up
-i remember hearing about a compiler command in VC++ that forces the compiler to inline, but I don''t remember what it is
-check the assembly code to see if the function is actually inlined

Hope this helps!
quote:
-replacing inline with a macro and see if the speed goes up

Don''t do this. This is exactly the same thing as unrolling the function calls into your calling function, which you said you''ve already done.
quote:
-i remember hearing about a compiler command in VC++ that forces the compiler to inline, but I don''t remember what it is

There''s an MS-specific directive called __forceinline. If the function cannot be inlined, it generates a level 1 warning. I''d do it in only for testing purposes (to see that it''s in fact being inlined). I don''t like leaving non-standard code in my products.

Silly question: are you including the definition of your function in the .H file? Inlined functions must be in the .H file.

Another silly question: are you getting your speed results from just 1 run or multiple runs? System loads can change performance. Also, are you certain this is the only thing you changed between the two runs?
quote:
Silly question: are you including the definition of your function in the .H file? Inlined functions must be in the .H file.


Damn, that''s not silly at all.. I didn''t know that.. So must have been actually not running anything inline up to this point because I didn''t put the code in the .h file.

I also looked as MSDN help there seems to be 100,000 reason for the compile to disable inline for you
Advertisement
Stoeffl, when you said that:

char buf[80];
for()
{
}

is the same as

for()
{
char buf[80];
}

you were partially wrong. I haven''t tested it with the VC compiler, it might optimize it, but I''m talking in general. Both algorithms are not the same and the first one is faster than the second one).

Okay, the first piece of code, allocates memory only once throughtout the whole loop, whereas in the second example two additional instructions are needed (sub/add) in each iteration of the for() loop.

sub sp, 80
mov cx, 10
for:
...
loop for
add sp, 80

that''s what you get from the first C code.

what you get from the second example, is the following:

mov cx, 10
for:
sub sp, 80
...
add sp, 80
loop for

okay, the code looks the same, except that the instructions are moved around a little. Well, that''s basically what would make your code slower, unless the compiler you''re using optimizes it to the first version.

Time passes...

I tested it with VC and it seems like it optimizes it, from looking at the ASM listing. What I was refering to was without any optimizations on the compiler''s side. Or let''s just say you were writing the code in ASM. Then it matters the way you''d write it, since you can''t get any lower than that (using an assembler).

-------------------------------
I'll screw up whoever screws around with the gamedev forum!

..-=gLaDiAtOr=-..
quote:
Stoeffl, when you said that *snip code* you were partially wrong. I haven''t tested it with the VC compiler, it might optimize it, but I''m talking in general. Both algorithms are not the same and the first one is faster than the second one).

I am completely, 100% correct. Please re-read my post.

Let''s look at how your example is wrong. You say the code:
for (i=0; i{  char s[80];  ...} 

..is compiled to the following instructions:
mov cx, 10for:sub sp, 80...add sp, 80loop for 


You show that stack space is allocated each time the declaration point is passed in the algorithm (char s[80]). If this were the case, what would the following program do?:
while (1){  char s[80];  sprintf (s, "Hello world!\n");} 


Using your logic, this would eventually cause a stack overflow. This is not the case, regardless of optimization. It will loop forever, but s already has memory allocated for it at the beginning of the function, and no extra memory allocation occurs here.

Let me put it in other terms. There are two performance hits when you create an object:
1) Memory allocation
2) Construction

ALL stack-local variables are allocated memory at the same time: the beginning of the function. This is regardless of their size and quantity; they are all allocated in a single step, which I went through great pains to show in explicit, gory detail.

Therefore, the only cost for declarations of stack variables, even if they''re in a loop, is construction cost. Since char and int have no construction costs, there is no difference between allocating the variable outside or inside of the loop in efficiency; the only difference is the scope of the variable.

When you say, "In general, the algorithms are not the same and the first one is faster", it depends:
- If, by "in general", you mean some other language''s implementation, then yes, there may or may not be a difference.
- If, by "in general", you mean C or C++, then you are wrong; they are exactly the same.

Please take your time to learn this about C and C++. It''s an extremely important concept.
Note that Stoffels assembly listing was made of an unoptimized compile. The six instructions push''ing and pop''ing esi, edi, and ebx would have been removed in an optimized build because those registers are not used in the function.
Stoffel, very good point.. and very good example with the infinite loop.

You say there''re two performance hit, mem allcation and construction.

So lets see if I understand this,

if I have
A)
for (...)
{
myclass A;
//something
}

and
B)
myclass A;
for (..)
{
//something
}


Then the memory allocation is exactly the same for both versions, however, the first code will take performance hit for the construction multiple times, am I right?

Build in type has no construction costs? if I have a blank constructor for my class is it considered to have no construction costs?




This topic is closed to new replies.

Advertisement