searching math libs to beat

Math and Physics Programming

Started by conman September 22, 2003 07:05 PM

23 comments, last by conman 20 years, 7 months ago

181

Author

September 22, 2003 07:05 PM

I have written a matrix class using expression templates witch workes really good I guess and I''m searching now other matrix libraries to compare produced assembler code and calculation time. I''ve heard from one called Blitz++ but the links are broken... So I hope someone here knows some good (not commecial) ones I can beat

constantin

Anonymous

September 22, 2003 07:06 PM

Try D3DX.

Fruny

1,658

September 22, 2003 07:17 PM

This link to Blitz++ works just fine for me.
You can also try the Matrix Template Library
Or the BLAS routines, including Boost's C++ implementation of those.
Or you could try LAPACK.
Or even the Template Numerical Toolkit.

A Fortran77 compiler may be necessary to use some of those libraries. No aliasing = better performance

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan

conman

181

Author

September 22, 2003 07:30 PM

Thanks! they will do it

uutee

142

September 23, 2003 07:30 AM

>No aliasing = better performance Which is one of the
>reasons why Fortran is still used in numerical
>computations.

Does aliasing really matter if you do something as simple as matrix-vector products or BLAS operations (vector additions etc.)? I''ve never looked into this, but writing out simple BLAS routines in ASM resulted in 0% faster code than what VC++ gave.

But granted, there are more complicated problems in which aliasing might be a little burden...

>(That and the sheer number of libraries available).

That''s another thing to beat then I guess

- Mikko Kauppila

conman

181

Author

September 23, 2003 08:10 AM

What exactly does 'aliasing' mean?
My matrix class is fast cause it does no unnecessary memory operations with temporary objects to solve an expression like C = A+(B*D).

[edited by - conman on September 23, 2003 9:14:38 AM]

Charles B

863

September 23, 2003 12:03 PM

Can you send an asm sample that your template method produces ? I can probably tell how far it is from the optimal asm solution directly.

For instance cross product, and matrix mul.

"Coding math tricks in asm is more fun than Java"

conman

181

Author

September 23, 2003 04:48 PM

Ok, this is my assembler output of a matrix multiplication:

Matrix<2, 2, int> A,B,C;
A = B*C;

produces:

mov	eax, DWORD PTR _B$[esp+76]mov	edx, DWORD PTR _C$[esp+68]mov	ecx, DWORD PTR _C$[esp+72]mov	esi, DWORD PTR _B$[esp+68]mov	edi, eaximul	edi, ecxmov	ebx, edximul	ebx, esiadd	edi, ebxmov	ebx, DWORD PTR _B$[esp+72]mov	DWORD PTR _A$[esp+68], edimov	edi, DWORD PTR _B$[esp+80]mov	ebp, ediimul	ebp, ecxmov	ecx, ebximul	ecx, edxadd	ebp, ecxmov	ecx, DWORD PTR _C$[esp+80]mov	edx, ecximul	ecx, ediimul	edx, eaxmov	eax, DWORD PTR _C$[esp+76]mov	DWORD PTR _A$[esp+72], ebpmov	ebp, eaximul	eax, ebximul	ebp, esiadd	edx, ebpadd	eax, ecxmov	DWORD PTR _A$[esp+76], edxmov	DWORD PTR _A$[esp+80], eax[source]the same one with double:Matrix<2, 2, double> A,B,C;A = B*C;produces:[source]fld	QWORD PTR _B$[esp+104]fmul	QWORD PTR _C$[esp+104]fld	QWORD PTR _C$[esp+112]fmul	QWORD PTR _B$[esp+120]faddp	ST(1), ST(0)fstp	QWORD PTR _A$[esp+104]fld	QWORD PTR _B$[esp+128]fmul	QWORD PTR _C$[esp+112]fld	QWORD PTR _B$[esp+112]fmul	QWORD PTR _C$[esp+104]faddp	ST(1), ST(0)fstp	QWORD PTR _A$[esp+112]fld	QWORD PTR _C$[esp+120]fmul	QWORD PTR _B$[esp+104]fld	QWORD PTR _C$[esp+128]fmul	QWORD PTR _B$[esp+120]faddp	ST(1), ST(0)fstp	QWORD PTR _A$[esp+120]fld	QWORD PTR _C$[esp+128]fmul	QWORD PTR _B$[esp+128]fld	QWORD PTR _C$[esp+120]fmul	QWORD PTR _B$[esp+112]faddp	ST(1), ST(0)fstp	QWORD PTR _A$[esp+128]

What do you say?

Fruny

1,658

September 23, 2003 05:52 PM

Charles B

863

September 24, 2003 04:50 AM

I won''t comment the int routine, since it''s probably not used in 3D games. There are some unecessary variable "renamings" (like mov edi, eax probably).

Well this obviously gets rid of stupid temp variables on the stack : Temp=B+C; A=Temp; generated by usual C++ operator implementations.

However the parallelism is still not exploited. The best code attempts to make 2, 3 or 4 separate calculation "threads" (like A[0][0] and A[0][1]) in parallel and group fstps to avoid waiting states, stalls. You probably still loose 20% perfs here, but that''s much better than typical code. I don''t know but Intel compiler would probably recombine fpu instructions better. I suppose you used VisualC++ or gcc here.

Hmm else in general don''t use doubles as local variables (on the stack). You have 50% chances that the stack is not aligned on a 64bits boundary, which is perf killing. Prefer float or customize your double matrices allocations on the heap.

fld	QWORD PTR _B$[esp+104]fmul	QWORD PTR _C$[esp+104]fld	QWORD PTR _C$[esp+112]fmul	QWORD PTR _B$[esp+120]; stall (1 or 2 cycles)faddp	ST(1), ST(0); stall (2 cycles ?)fstp	QWORD PTR _A$[esp+104]etc...

"Coding math tricks in asm is more fun than Java"

searching math libs to beat

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

searching math libs to beat

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines