linux-audio-dev: Re: [LAD] vectorization

From: Jussi Laako <jussi@email-addr-hidden>
Date: Wed May 07 2008 - 01:45:26 EEST

Fons Adriaensen wrote:
> Which will determine performance for every algorithm that
>
> - is working on a data set that is larger than the cache,
> - does not produce multiple results from the same inputs.

Here are some results with empty() included...

N=1024, n=1000000, gcc:
> clock: 16500 ms (_Complex)
> clock: 26760 ms (cvec_t)
> clock: 15820 ms (original float array[N][2])
> clock: 13700 ms (asm on float array)

N=(1024*1024), n=1000, gcc:
> clock: 8410 ms (_Complex)
> clock: 9360 ms (cvec_t)
> clock: 8500 ms (original float array[N][2])
> clock: 10540 ms (asm on float array)

And if I remove "-fprefetch-loop-arrays", it degrades to:
> clock: 12800 ms (_Complex)
> clock: 10010 ms (cvec_t)
> clock: 13800 ms (original float array[N][2])
> clock: 10510 ms (asm on float array)

And non-vectorized version ("normal x86-64 code"):
> clock: 12840 ms (_Complex)
> clock: 22830 ms (cvec_t)
> clock: 12880 ms (original float array[N][2])
> clock: 10470 ms (asm on float array)

The asm code I used doesn't include prefetch instructions, because the
data sets I use at once are smaller. Vectorization improves cvec_t
layout case significantly.

> It is safe now, but with such a small data size the code is still not
> representative of real life use of a very simple operation such as a
> MAC loop. In practice you also have to generate the data and use the

There are several use cases where the data set is rather small and is
used in several subsequent loops, thus cache can help.

After profiling, I've identified number of algorithms which
significantly benefit from handwritten vectorized asm.

BR,

- Jussi

_______________________________________________
Linux-audio-dev mailing list
Linux-audio-dev@email-addr-hidden
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Received on Wed May 7 04:15:02 2008

This archive was generated by hypermail 2.1.8 : Wed May 07 2008 - 04:15:02 EEST