linux-audio-user: Re: [LAU] optimizing jackd build

From: Mike Taht <mike.taht@email-addr-hidden>
Date: Mon Apr 09 2007 - 18:18:10 EEST

On 4/9/07, Sampo Savolainen <v2@email-addr-hidden> wrote:
>
> On Mon, 2007-04-09 at 16:26 +0200, Dragan Noveski wrote:
> > Mike Taht wrote:
> > > does jack say it's running SSE on startup?
> > now, since i recompiled with --enable-dynsimd it says:
> >
> > ...
> > JACK tmpdir identified as [/dev/shm/]
> > SSE2 detected
> > load = 0.2297 max usecs: 40.000, spare = 10626.000
> > ...
> >
> > looks like a nice hint??
> >
> > what is about processor type and architecture?
> > are there more hints about optimizing?

I'm kind of writing this as a general note to other audio authors.

One big win on large data sets (lots of tracks) that overrun your available
cache, is to cache align the floating point data rather than simd align it.
On several of my oprofiles and test cases with ardour + jack where I was
overrunning L2 cache regularly, cache alignment sped up simd calculations by
9-40%.

On a linux system, to find the optimal alignment for your data, do a cat
/proc/cpuinfo | grep cache_alignment.

and go looking through whatever code you are running for calls to
"posix_memalign". For example, jack uses a default alignment of 16, when 64
or 128 would be more appropo' for modern processors.

Cache alignment is an easy test, I hope more people try it....

Enabling SSE throughout the project via compiler flags makes the
> resulting code /depend/ on SSE. In other words, running that on a
> platform with no SSE will result in "Illegal instruction (core dumped)".

At present the major bottlenecks in ardour and jackd have all been reduced
considerably by extensive oprofiling the hot spots and replacing them, where
possible, with SSE. In typical situations the SSE routines are still near
the top of the runtime for those two programs. Without SSE, the "normal"
equivalents are in general responsible for 3-8x more of the total runtime.
Recent example: SSE optimizations to ardour cut total cpu usage for an
extreme test case (116 tracks, 40+ busses) from 77% down to 35% in the 64
samples/period case, and from 35% to 12% in the 1024 samples/period case.

SSE routines could be used to speed up graphics as well, particularly in
RGBA situations.

Linux's Oprofile subsystem is wonderful as it has low overhead and can run
on smp'd rt kernels.

At present the major bottlenecks left for ardour and jack are very much down
in the noise floor. A typical user now spends more cpu time in plugins than
in those two core programs. (thus I've been oprofiling a few plugins
heavily, among other things, now have a SSE'd comb_run routine... and
hopefully will announce some sped up plugins soon)

In projects like jackd and Ardour there are places which can be improved
> vastly via SSE code. Creating a framework which can enable SSE / etc.
> per the platform the binary is ran makes it possible for distributions
> to include optimized versions of the software which will work on any x86
> platform.

I note that x86_64 comes with SSE and SSE2 by default and that taking
branches (e.g: determining at run time if SSE1 or SSE2 is available),
particularly at low period sizes (64), is expensive, so it would be nice if
more audio code, when compiled for x86_64, used SSE by default, without the
run-time test.

I have heard good things about the current development branch of gcc,
> but gcc 4.1 still has a _long_ way to go when it comes to vectorizing
> (=writing code using parallel SIMD instructions, in other words SSE).

gcc 4.3 - which is a long way from working - has a vectorizer which
understands the type conversions so critical to SSE usage. (sampos's famed
assembler SSE peak code uses a clever type conversion)

To enable automatic vectorization in gcc 4.1.X, you can turn it on by
-ftree-vectorize
and see what it is doing by -ftree-vectorizer-verbose=5

And weep. for example: In the zillions of lines of code in Ardour 2, only 2
loops get automatically vectorized with gcc 4.1.X.

Hand written assembler is still many orders faster than what gcc is
> capable of doing. In Ardour peak computation (for both metering and
> waveform displaying) is written in SSE (the first part in pure assembly,
> the second in a C-level abstraction which is almost 1:1 assembly). Both
> functions are more than 20x faster in raw performance than what gcc 4.1
> can do.

Not only that, but writing SSE code is FUN! It's one of the few cases left
in this world where a the human can still be smarter than the compiler!

Sampo
>
>
> _______________________________________________
> Linux-audio-user mailing list
> Linux-audio-user@email-addr-hidden
> http://lists.linuxaudio.org/mailman/listinfo.cgi/linux-audio-user
>

-- 
Mike Taht
PostCards From the Bleeding Edge
http://the-edge.blogspot.com

_______________________________________________
Linux-audio-user mailing list
Linux-audio-user@email-addr-hidden
http://lists.linuxaudio.org/mailman/listinfo.cgi/linux-audio-user
Received on Mon Apr 9 20:15:11 2007

This archive was generated by hypermail 2.1.8 : Mon Apr 09 2007 - 20:15:12 EEST