Re: [LAD] GCC Vector extensions

From: Stéphane Letz <letz@email-addr-hidden>
Date: Thu Jul 21 2011 - 15:22:06 EEST

>
> On 07/20/2011 10:27 AM, Maurizio De Cecco wrote:
>> I am playing around with GCC and Clang vector extensions, on Linux and
>> Mac OS X, and i am getting some strange behaviour.
>>
>> I am working on jMax Phoenix, and its dsp engine, in its current state,
>> is very memory bound; it is based on the aggregation of very small
>> granularity operations, like vector sum or multiply, each of them
>> executed independently from and to memory.
>>
>> I tried to implements all this 'primitive' operations using the vector
>> types.
>>
>> On clang/MacOSX i get an impressive improvement in performance,
>> around 4x on the operations, even just using the vector types for
>> copying data; my impression is that the compiler use some kind of vector
>> load/store instruction that properly use the available memory bandwidth,
>> but unfortunately i do not know more about the x86 architecture.
>>
>> On gcc/Linux, (gcc 4.5.2) the same code produce a *slow down* of around
>> 2.5x.
>>
>> Well, anybody have an idea of why ?
>>
>> I am actually running linux (Ubuntu 11.04) under a VMWare virtual
>> machine, i do not know is this may have any implications.
>
> Maybe. A better comparison would be: clang/Linux vs. gcc/Linux and
> clang/MacOSX vs gcc/MacOSX compiled binaries.
>
> Also as Dan already pointed out: gcc has a whole lot of optimization
> flags which are not enabled by default. try '-O3 -msse2 -ffast-math'.
> '-ftree-vectorizer-verbose=2' is handy while optimizing code.
>
> have fun,
> robin

Or you can use LLVM to *directly* generate vector code, as in the following example, result of some experiments done with Faust and it's LLVM backend:

block_code8: ; preds = %block_code8.block_code8_crit_edge, %block_code3
  %20 = phi float* [ %15, %block_code3 ], [ %.pre11, %block_code8.block_code8_crit_edge ]
  %21 = phi float* [ %14, %block_code3 ], [ %.pre10, %block_code8.block_code8_crit_edge ]
  %22 = phi float* [ %16, %block_code3 ], [ %.pre9, %block_code8.block_code8_crit_edge ]
  %indvar = phi i32 [ 0, %block_code3 ], [ %indvar.next, %block_code8.block_code8_crit_edge ]
  %nextindex1 = shl i32 %indvar, 2
  %nextindex = add i32 %nextindex1, 4
  %23 = sext i32 %nextindex1 to i64
  %24 = getelementptr float* %22, i64 %23
  %25 = getelementptr float* %21, i64 %23
  %26 = bitcast float* %25 to <4 x float>*
  %27 = load <4 x float>* %26, align 1
  %28 = getelementptr float* %20, i64 %23
  %29 = bitcast float* %28 to <4 x float>*
  %30 = load <4 x float>* %29, align 1
  %31 = fadd <4 x float> %27, %30
  %32 = bitcast float* %24 to <4 x float>*
  store <4 x float> %31, <4 x float>* %32, align 1
  %33 = icmp ult i32 %nextindex, %18
  br i1 %33, label %block_code8.block_code8_crit_edge, label %exit_block6

In this block float* arrays are loaded, then "bitcast" in vector of 4 floats, the vector of 4 float is loaded, then manipulated with LLVM IR vector version of add, mult...etc... then stored.

The LLVM IR is still generated to use the "conservative" "align 1" option since it can not yet be sure data is always aligned. The result SSE code with then use the MOVUPS (Move Unaligned Packed Single-Precision Floating-Point Values). The next steps is to generated stuff like:

 %27 = load <4 x float>* %26, align 4

so that MOVAPS (Move Aligned Packed Single-Precision Floating-Point Values) is used instead.

We already see so nice speed improvements, but the Faust vector LLVM backend version is still not yet complete...

Stéphane

_______________________________________________
Linux-audio-dev mailing list
Linux-audio-dev@email-addr-hidden
http://lists.linuxaudio.org/listinfo/linux-audio-dev
Received on Thu Jul 21 16:15:04 2011

This archive was generated by hypermail 2.1.8 : Thu Jul 21 2011 - 16:15:04 EEST