Before I get ahead of myself, this is in C (with assembly being permissible).
I've got a need to copy two buffers. The destination buffer is an int32 buffer, and the input buffer is either a uint8 or a uint16 buffer. Right now I'm just doing a normal loop, i.e.:
int i;
for (i = 0; i < length; ++i)
{
int_buffer = uchar_buffer;
}
// or
for (i = 0; i < length; ++i)
{
int_buffer = ushort_buffer;
}
Is there any way to speed this up perhaps? I'm okay with resorting to this method if there aren't any faster ways, but I'm particularly curious if there's some trick in C that can be used, or perhaps a specific assembly instruction. Thanks!
Is there any way to speed this up perhaps? I'm okay with resorting to this method if there aren't any faster ways, but I'm particularly curious if there's some trick in C that can be used, or perhaps a specific assembly instruction. Thanks!
You could Duff it, yeah. If the thought of doing that makes you cringe, then just unrolling the loops a little helps a lot too. You'd do something like: While (length is a multiple of 8)
{
dst[0] = src[0];
.
.
.
dst[7] = src[7];
dst += 8;
src += 8;
length -= 8;
}
Copy anything left over (regular for loop)
Experimenting with different multiples, I've found that 16 is quite good to use, giving slightly better performance (this was years ago, copying color-indexed data to a 32-bit direct draw surface via a palette lookup). My experience is that going higher slows things down again; your's may differ.
If you use 16, be aware that there may be one multiple of 8 left over, so unrolling that can help too (it will be very marginal though).
If you pad your arrays to always be the relevant multiple that can make things a lot simpler.
Either way it can get you a lot of code and if you need to do this more than once for different data types you might want to template it in C++.
Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.
I wish I could use C++ in this (not that there's anything specific in C++ that would make it faster though), but it's strictly C. I thought about SIMD, but I've never actually used it so I don't want to take the time figure out how to do it (with the potential of doing it wrong) without knowing the kinds of benefits it would provide.
Anyway, thanks for reminding me of the existence of Duff's Device! I'll look into it and see if it helps.
One related question: I tried once to do this:
int i;
for (i = length; i--;)
{
*int_buffer++ = *uchar_buffer++;
}
But it actually ran slower like that. I didn't look at the disassembly, but I couldn't think of an explanation. Seems like it should at least be just as fast, if not (marginally) faster.
I've benchmarked these methods, and have generally found that *dst++ = *src++ is always slower than array indexing. The compiler is obviously doing something different, but I haven't yet dug into the underlying asm to find out what's going on.
Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.
Tu run, due to rdtsc, use: "start /affinity 0x01 test.exe 1500". Or similar, it's important to force running on single core. Or find a better profiler, as long as it's one that can report individual cycles.
-------------
copy_ref is dumb reference implementation.
copy_2 is unrolled loop.
copy_3 is one read per 4 writes
copy_4 is one read per 8 writes
copy_5 is Duff's device.
-----
Results 1024 elements
(32-bit executable, VS):
copy_2 and copy_3 are tied, copy_2 takes slight lead. copy_4 is worse than naive, duff's device comes solid third.
64-bit executable
copy_4 wins, copy_3 is second, copy_2 third, then Duff's device and finally naive.
----
Conclusions:
For a 64-bit app, minimizing memory reads through a register-wide temporary wins.
For a 32-bit app, unrolling alone is good enough, but to remain consistent, minimizing reads is just as good.
So copy_3 or copy_4 would be fastest, copy_2 for 32 bit apps. May vary on memory performance and CPU. And compiler. And OS. And weather. And phase of moon.
Gotcha: copy_3 and copy_4 may depend on endianess, so they aren't portable. Use unrolled loop to be on the safe side and avoid the hassles, it looks like only 10% slower.
-----
memcpy() is probably your best bet. It checks if buffers are aligned and uses SSE when possible, therefore it's not likely you'll get better performance with own function. The following function does almost exactly same what standard memcpy() does in perfect situation:
Normally I'd just be using memcpy but the input and output buffers aren't the same type, so wouldn't it mess up the copy? Does your memcpySSE function work if the dst is a int* buffer and src is a uint8* buffer (or a uint16* buffer)? I need the values copied, not necessarily the bytes.