Re: [linux-audio-dev] Traps in floating point code

New Message Reply About this list Date view Thread view Subject view Author view Other groups

Subject: Re: [linux-audio-dev] Traps in floating point code
From: Erik de Castro Lopo (erikd-lad_AT_mega-nerd.com)
Date: Fri Jul 02 2004 - 01:41:34 EEST


On Thu, 1 Jul 2004 15:15:22 -0700 (PDT)
Dan Hollis <goemon_AT_sasami.anime.net> wrote:

> On Fri, 2 Jul 2004, Erik de Castro Lopo wrote:
> > On Thu, 01 Jul 2004 18:18:41 +0200
> > Benno Senoner <sbenno_AT_gardena.net> wrote:
> > > Eric what do you think ? can something like that be coded efficiently
> > > using SSE/SSE2 ?
> > Probably not. There are some algorithms which simply can't be vectorized.
>
> Would an opteron be any better at it?

No, its not an issue of the CPU. The problem is that SSE and SSE2 on
the x86 extended family and Altivec on the PowerPC all require that
the algorithm can be refactored to allow blocks of data to be
processed at time. However, the refactoring can in some cases slow
the algorithm down to an extent that it ends up slower on the vector
processor than on the standard FPU.

Here's an example (where a, b anc c are arrays of floats) :

     for (k = 0 ; k < 128 ; k++)
        a [k] = b [k] * c [k] ;

which can be written as:

     for (k = 0 ; k < 128 / 4 ; k+ = 4)
     { a [k] = b [k] * c [k] ;
        a [k + 1] = b [k + 1] * c [k + 1] ;
        a [k + 2] = b [k + 2] * c [k + 2] ;
        a [k + 3] = b [k + 3] * c [k + 3] ;
        }

This can then be vectorized so that four entries in the b array can
be added to four entries in the c array in a single instruction and
then assigned to four entries in the a array in another single
instruction. like:

     for (k = 0 ; k < 128 / 4 ; k+ = 4)
     { load_four_b_values () ; /* single instruction. */
        load_four_c_values () ; /* single instruction. */
        add_four_b_to_four_c_vals () ; /* single instruction. */
        store_4_a_values () ; /* single instruction. */
        }

The above should run at close to four times the speed of the original.

Now look at a new example:

     a [0] = b [0] * c [0];
     for (k = 1 ; k < 128 ; k++)
        a [k] = a [k - 1] + b [k] * c [k] ;

The problem here is that each a [k] depends on the value of a [k - 1].
Refactoring this to be vectorized is likely to require so much
shuffling of data that it ends up slower than the non-vectorized
version.

Erik

-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam_AT_mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"This is like creating laws against blasphemy and then complaining that
unbelievers can't come up with any logical argument against the existence
of God"  -- www.infoanarchy.org on the Digital Millenium Copyright Act


New Message Reply About this list Date view Thread view Subject view Author view Other groups

This archive was generated by hypermail 2b28 : Fri Jul 02 2004 - 01:34:32 EEST