Re: [linux-audio-dev] Traps in floating point code

New Message Reply About this list Date view Thread view Subject view Author view Other groups

Subject: Re: [linux-audio-dev] Traps in floating point code
From: Ruben van Royen (ruben_AT_guidedbees.com)
Date: Thu Jul 01 2004 - 20:59:11 EEST


First of all, I was not yet talking about vectorizing your code which is often
hard, especially for a compiler. but SSE can be used on scalars as well (as
you probably know).
The fact is that the intel pentium 4 optimization guide says that SSE code is
generally as fast as or faster than regular FP code. And especially the
truncation to integer is faster. Also denormals (which started all of this)
can be handled faster by sse math by turning on a mode flag that makes input
denormals behave as zero's This is of course not IEEE compliant, but exactly
what you were doing in your code.
The reasons for SSE code being slower than FP code could be:
        The addition is pipelined in the FP, but not in the SSE unit.
        Incorrect allignment might incur a higher penalty for SSE.

Ruben

On Thursday 01 July 2004 18:18, Benno Senoner wrote:
> Jens M Andreasen wrote:
> >Why not just use modf?
> >
> > double fullindex, increment, integer, fraction;
> > // int i;
> >
> > fullindex += increment;
> > fraction = modf(fullindex, &integer);
> > // i = integer;
> >
> >C99 have float and long double versions as well.
>
> The problem of modf is that it is slow (it generates "call modf" which
> involves subroutine calling, even with -ffast-math), and the
> integer index is still a double which needs to be converted to an int,
> so you need do perform either a fistl
> or lrintf();
>
> I benchmarked my code agains the modf() code and mine is 3 times faster.
> As said in my code the fract part might become 1.0 in some cases
> or even if it got a bit below 0, for interpolation it still works
> perfectly because the continuity of polynomial interpolation.
> In LinuxSampler all we want is fast interpolation and the current code
> we use is efficient.
> We will explore the possibility of using fixed point (int/fract part, eg
> 16/16 bit) indexes perhaps we will squeeze out a bit more
> but on the other hand it has the problem that if you need LFOs and other
> kind of pitch modulation you have to convert
> indexes from float/dobule to fixed point which might be a bit
> timeconsuming, especially because we have sample accurate
> modulation and envelopes. We will see what can be done, for now the goal
> is to get a perfectly working sampler engine.
> 5-10% of performance increase is not that important for now, especially
> if fixed point indexes turn out to be a big PITA.
> (cannot say yet, whether this is true or not, one has to implement and
> benchmark the stuff within the whole sampler engine
> to get real numbers, synthetic benchmarks are not alway enough because
> in a complex audio app you do lots of other stuff besides
> resampling)
>
>
> regarding SSE/SSE2:
> I performed various benchmarks (pure C/C++) with gcc 3.3/3.4 and the
> latest intel compiler with SSE vectorization optimizations:
> the fact is as Eric said that SSE/SSE2 is slower than the regular FPU in
> some cases.
> Resampling (polynomial, eg linear, cubic) seems such a case. Not much
> can be parallelized (icc is not able to vectorize
> anything in the code below) so you either use an alternative algorithm
> handcrafted for SSE or you better stay with the
> regular FPU.
> I'm not sure if it is possible to achieve decent speed increases with
> SSE, perhaps it would help to keep track
> of 4 indexes
> eg.
>
> double fullindex1, fullindex2, fullindex3, fullindex4;
> double fract1, fract2, fract3, fract4;
> int intindex1, intindex2, intindex3, intindex4;
> double pitch; ( eg 1.0 plays the audio at normal speed)
>
> for(...) {
> // fullindex1 is the official index (incremented by pitch at each
> interation) , fullindex2,3,4 are the indexes needed
>
> // can the following 3 lines be parallelized ?
>
> fullindex2 = fullindex1 + pitch;
> fullindex3 += fullindex1 + 2.0 * pitch;
> fullindex4 += fullindex1 + 3.0 * pitch;
>
> // AFAIK SSE2 can do 2 double_to_int with one instruction
> // see CVTPD2PI , http://folk.uio.no/botnen/intel/vt/reference/vc50.htm
> // so at least 2x speedup would be achieved (only on P4+ CPUs)
> // Athlon XP does not support SSE2 and SSE can only do 2 float to int using
> // see http://folk.uio.no/botnen/intel/vt/reference/vc57.htm
>
> intindex1 = double_to_int(fullindex1);
> intindex2 = double_to_int(fullindex2);
> intindex3 = double_to_int(fullindex3);
> intindex4 = double_to_int(fullindex4);
>
> // can be parallelized using SSE (doing 4 ops per instruction)
> fract1 = fullindex1 - intindex1;
> fract2 = fullindex2 - intindex2;
> fract3 = fullindex3 - intindex3;
> fract4 = fullindex4 - intindex4;
>
> // can be parallelized using SSE
> outputsamplebuf[0] =samplebuf[intindex1] + fract1 *
> (samplebuf[intindex1 + 1] - samplebuf[intindex1]);
> outputsamplebuf[1] =samplebuf[intindex2] + fract1 *
> (samplebuf[intindex2 + 1] - samplebuf[intindex2]);
> outputsamplebuf[2] =samplebuf[intindex3] + fract1 *
> (samplebuf[intindex3 + 1] - samplebuf[intindex3]);
> outputsamplebuf[3] =samplebuf[intindex4] + fract1 *
> (samplebuf[intindex4 + 1] - samplebuf[intindex4]);
>
> fullindex1 += 4.0*pitch; // increase fullindex1 by 4 times pitch
> because we processed 4 samples
>
> }
>
> The disadvantage is that you must keep pitch constant for 4 samples but
> this is not a big problem.
> (but in theory we could add a different pitch value to each fullindex1-4
> variable so it would not be so hard
> to lift that kind of restriction).
>
> Eric what do you think ? can something like that be coded efficiently
> using SSE/SSE2 ?
> (I'd prefer SSE because the Athlon XP support that kind of instructions
> too while SSE2 is only supported by
> P4+ or AMD64 CPUs)
>
>
> regarding pure SSE math, try to compile this benchmark with gcc 3.3/3.4
> and try to use -mfpmath=sse/sse2
> you will see speed will suck compared to using the regular FPU.
> The intel icc will do less damage but it will still be slower than pure
> FPU code.
>
> http://www.linuxdj.com/benno/rspeed4.tgz
>
> Using SSE is not always a panacea for audio apps.
>
> cheers,
> Benno
> http://www.linuxsampler.org


New Message Reply About this list Date view Thread view Subject view Author view Other groups

This archive was generated by hypermail 2b28 : Thu Jul 01 2004 - 20:53:22 EEST