Re: [LAD] vectorization

From: Fons Adriaensen <fons@email-addr-hidden>
Date: Tue May 06 2008 - 10:55:45 EEST

On Tue, May 06, 2008 at 09:21:09AM +0200, Jens M Andreasen wrote:

> On Tue, 2008-05-06 at 00:24 +0200, Fons Adriaensen wrote:
> > After each iteration, call an empty function, separately compiled,
> > that takes all three vectors as arguments (and _not_ as const *
> > of course). No more tricks. The overhead is peanuts compared
> > to the calculation.
>
> You mean like this:
>
> // defined in empty.c as return 0;
> extern int empty(void*a,void*b,void*d);
>
> And then call it at the end of iteration:
>
> for (j = 0; j < n; ++j)
> {
> for (i = 0;i < N; ++i)
> cxD[i]+= cxA[i]*cxB[i];
>
> empty(&cxA,&cxB,&cxD);
> }
> fprintf (stderr,"> clock: %d ms %s\n",(clock()-clk)/1000,s);

Yes.

> Well, that certainly did level out everything. For n = 1000:
>
> > clock: 64680 ms (_Complex)
> > clock: 61990 ms (cvec_t)
> > clock: 71060 ms (original float array[N][2])

QED :-)

> This measures the terrible latency I have between main memory and cache.

Which will determine performance for every algorithm that

- is working on a data set that is larger than the cache,
- does not produce multiple results from the same inputs.

A complex MAC uses each data point twice, so it gains a little.

> Changing back N from (1024 * 1024) to
> #define N 1024
>
> .. and then increasing n to a million again (this should be safe now?) -

It is safe now, but with such a small data size the code is still not
representative of real life use of a very simple operation such as a
MAC loop. In practice you also have to generate the data and use the
result before a second iteration can start and this will trash the
cache even for smaller data sizes. Even more so if the code runs in
a jack callback and other processes will alternate with it.

> so we can pretend not to be limited by PC100 - yields with icc:
>
> > clock: 16510 ms (_Complex)
> > clock: 6090 ms (cvec_t)
> > clock: 12800 ms (original float array[N][2])
>
> .. and with gcc:
>
> > clock: 13820 ms (_Complex)
> > clock: 6330 ms (cvec_t)
> > clock: 13420 ms (original float array[N][2])
>
> Very even I would say.

Yes, gcc now performs quite well compared to the Intel compiler.

Ciao,

-- 
FA
Laboratorio di Acustica ed Elettroacustica
Parma, Italia
Lascia la spina, cogli la rosa.
_______________________________________________
Linux-audio-dev mailing list
Linux-audio-dev@email-addr-hidden
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Received on Tue May 6 16:15:02 2008

This archive was generated by hypermail 2.1.8 : Tue May 06 2008 - 16:15:02 EEST