searching math libs to beat

Started by
23 comments, last by conman 20 years, 7 months ago
I have written a matrix class using expression templates witch workes really good I guess and I''m searching now other matrix libraries to compare produced assembler code and calculation time. I''ve heard from one called Blitz++ but the links are broken... So I hope someone here knows some good (not commecial) ones I can beat constantin
Advertisement
Try D3DX.
This link to Blitz++ works just fine for me.
You can also try the Matrix Template Library
Or the BLAS routines, including Boost's C++ implementation of those.
Or you could try LAPACK.
Or even the Template Numerical Toolkit.

A Fortran77 compiler may be necessary to use some of those libraries. No aliasing = better performance Which is one of the reasons why Fortran is still used in numerical computations. (That and the sheer number of libraries available).


[ Start Here ! | How To Ask Smart Questions | Recommended C++ Books | C++ FAQ Lite | Function Ptrs | CppTips Archive ]
[ Header Files | File Format Docs | LNK2001 | C++ STL Doc | STLPort | Free C++ IDE | Boost C++ Lib | MSVC6 Lib Fixes ]

[edited by - Fruny on September 22, 2003 8:17:35 PM]
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan
Thanks! they will do it
>No aliasing = better performance Which is one of the
>reasons why Fortran is still used in numerical
>computations.

Does aliasing really matter if you do something as simple as matrix-vector products or BLAS operations (vector additions etc.)? I''ve never looked into this, but writing out simple BLAS routines in ASM resulted in 0% faster code than what VC++ gave.

But granted, there are more complicated problems in which aliasing might be a little burden...

>(That and the sheer number of libraries available).

That''s another thing to beat then I guess

- Mikko Kauppila
What exactly does 'aliasing' mean?
My matrix class is fast cause it does no unnecessary memory operations with temporary objects to solve an expression like C = A+(B*D).


[edited by - conman on September 23, 2003 9:14:38 AM]
Can you send an asm sample that your template method produces ? I can probably tell how far it is from the optimal asm solution directly.

For instance cross product, and matrix mul.

"Coding math tricks in asm is more fun than Java"
Ok, this is my assembler output of a matrix multiplication:

Matrix<2, 2, int> A,B,C;
A = B*C;

produces:
mov	eax, DWORD PTR _B$[esp+76]mov	edx, DWORD PTR _C$[esp+68]mov	ecx, DWORD PTR _C$[esp+72]mov	esi, DWORD PTR _B$[esp+68]mov	edi, eaximul	edi, ecxmov	ebx, edximul	ebx, esiadd	edi, ebxmov	ebx, DWORD PTR _B$[esp+72]mov	DWORD PTR _A$[esp+68], edimov	edi, DWORD PTR _B$[esp+80]mov	ebp, ediimul	ebp, ecxmov	ecx, ebximul	ecx, edxadd	ebp, ecxmov	ecx, DWORD PTR _C$[esp+80]mov	edx, ecximul	ecx, ediimul	edx, eaxmov	eax, DWORD PTR _C$[esp+76]mov	DWORD PTR _A$[esp+72], ebpmov	ebp, eaximul	eax, ebximul	ebp, esiadd	edx, ebpadd	eax, ecxmov	DWORD PTR _A$[esp+76], edxmov	DWORD PTR _A$[esp+80], eax[source]the same one with double:Matrix<2, 2, double> A,B,C;A = B*C;produces:[source]fld	QWORD PTR _B$[esp+104]fmul	QWORD PTR _C$[esp+104]fld	QWORD PTR _C$[esp+112]fmul	QWORD PTR _B$[esp+120]faddp	ST(1), ST(0)fstp	QWORD PTR _A$[esp+104]fld	QWORD PTR _B$[esp+128]fmul	QWORD PTR _C$[esp+112]fld	QWORD PTR _B$[esp+112]fmul	QWORD PTR _C$[esp+104]faddp	ST(1), ST(0)fstp	QWORD PTR _A$[esp+112]fld	QWORD PTR _C$[esp+120]fmul	QWORD PTR _B$[esp+104]fld	QWORD PTR _C$[esp+128]fmul	QWORD PTR _B$[esp+120]faddp	ST(1), ST(0)fstp	QWORD PTR _A$[esp+120]fld	QWORD PTR _C$[esp+128]fmul	QWORD PTR _B$[esp+128]fld	QWORD PTR _C$[esp+120]fmul	QWORD PTR _B$[esp+112]faddp	ST(1), ST(0)fstp	QWORD PTR _A$[esp+128]


What do you say?
Aliasing : When more than one variable can refer to a given memory location.

[ Start Here ! | How To Ask Smart Questions | Recommended C++ Books | C++ FAQ Lite | Function Ptrs | CppTips Archive ]
[ Header Files | File Format Docs | LNK2001 | C++ STL Doc | STLPort | Free C++ IDE | Boost C++ Lib | MSVC6 Lib Fixes ]
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan
I won''t comment the int routine, since it''s probably not used in 3D games. There are some unecessary variable "renamings" (like mov edi, eax probably).

Well this obviously gets rid of stupid temp variables on the stack : Temp=B+C; A=Temp; generated by usual C++ operator implementations.

However the parallelism is still not exploited. The best code attempts to make 2, 3 or 4 separate calculation "threads" (like A[0][0] and A[0][1]) in parallel and group fstps to avoid waiting states, stalls. You probably still loose 20% perfs here, but that''s much better than typical code. I don''t know but Intel compiler would probably recombine fpu instructions better. I suppose you used VisualC++ or gcc here.

Hmm else in general don''t use doubles as local variables (on the stack). You have 50% chances that the stack is not aligned on a 64bits boundary, which is perf killing. Prefer float or customize your double matrices allocations on the heap.

fld	QWORD PTR _B$[esp+104]fmul	QWORD PTR _C$[esp+104]fld	QWORD PTR _C$[esp+112]fmul	QWORD PTR _B$[esp+120]; stall (1 or 2 cycles)faddp	ST(1), ST(0); stall (2 cycles ?)fstp	QWORD PTR _A$[esp+104]etc...
"Coding math tricks in asm is more fun than Java"

This topic is closed to new replies.

Advertisement