On 07/26/2011 03:15 AM, Maurizio De Cecco wrote:
>> So are you now considering use some #ifdef to select float/4 instead of
>> double/8 vectors in jMax or just change all of them?
>
> Well, at the moment on gcc the perfomance with vector types is the same
> as without vector types, so i'll leave the Linux version without vector
> types (the code is #ifdef'ed).
When I was playing around with this last night... the best performance
came from your non-optimized, non-vectored code.
Why?
Because GCC translated it to optimized, vectored code.
> By the way, i forgot to mentions that all my tests where at 64 bits;
> i'll try later on a 32 bit Ubuntu.
I was on 32 bit Ubuntu. Also, with GCC the 64-bit optimizer is known to
be better at optimising SIMD code.
Because I'm a sucker for these kinds of diversions, I came up with a
scheme that shaved about 1 second off your test (on my machine). It
assumes that `vecsize` is a power-of-two. The idea is to store stuff in
the processor registers, and access each buffer one page at a time (a
cache page is 64 bytes on x86... 16 floats).
static inline void add3_vec(float * restrict arg0, float * restrict
arg1, float * restrict arg2, unsigned int vecsize)
{
unsigned int i;
v4sf *v0, *v1, *v2;
v4sf c0, c1, c2, c3, c4, c5, c6, c7;
const unsigned cache_size = 4;
v0 = (v4sf*)arg0;
v1 = (v4sf*)arg1;
v2 = (v4sf*)arg2;
vecsize /= 4*cache_size;
while(vecsize--) {
c0 = *v0++;
c1 = *v0++;
c2 = *v0++;
c3 = *v0++;
c4 = *v1++;
c5 = *v1++;
c6 = *v1++;
c7 = *v1++;
*v2++ = c0 + c4;
*v2++ = c1 + c5;
*v2++ = c2 + c6;
*v2++ = c3 + c7;
}
}
-gabriel
_______________________________________________
Linux-audio-dev mailing list
Linux-audio-dev@email-addr-hidden
http://lists.linuxaudio.org/listinfo/linux-audio-dev
Received on Tue Jul 26 16:15:01 2011
This archive was generated by hypermail 2.1.8 : Tue Jul 26 2011 - 16:15:01 EEST