Re: [linux-audio-dev] Performance problems caused by dlopen()

New Message Reply About this list Date view Thread view Subject view Author view Other groups

Subject: Re: [linux-audio-dev] Performance problems caused by dlopen()
From: Will Benton (willb_AT_cs.wisc.edu)
Date: Sun Oct 06 2002 - 03:58:28 EEST


On Saturday, October 5, 2002, at 04:30 PM, nick wrote:

> I keep hearing about how aligning XXX on XX boundaries (or similar)
> gives huge performance increases etc..
>
> Where's a good place to start finding out about these techniques?

Techniques for improving cache performance should be common in any
MS-level compilers class, so you may want to search for course notes on
the web. Here are a few other things to check out:

* "Compiler transformations for high-performance computing"
    http://citeseer.nj.nec.com/bacon93compiler.html
    This is a survey paper that covers compiler optimizations,
emphasizing techniques for implementing high-performance scientific
code; you'll probably find that it shares several things in common with
high-performance audio code (except for the RT requirements of audio
and the massive data sizes of scientific code, of course). You can
probably implement most of these by hand or coax gcc to implement some
of them for you.

* http://oprofile.sf.net
    Oprofile is a profiler that uses the hardware counters. On most
modern processors, you can get a counter for I-cache and D-cache misses.

* _Modern Compiler Design and Implementation_
    by Steven Muchnick

* _Modern Compiler Implementation in {Java|ML|C}_ (pick one)
    by Andrew Appel

I don't remember how big cache lines are on the x86, but page-aligning
data (i.e. address % 4096 == 0) should be a good start. You also want
to try and have inner loops operate only on data that fits in a cache
line. There is a technique called "loop tiling" to restructure loops
so that they will work on a cache line worth of data at a time. As an
aside, you can usually get loop performance gains by loop unrolling or
software pipelining, but you'll want to make sure that the code for
your inner loop fits in the I-cache.

The big thing to remember is that once you start getting into this
stuff, you're getting into optimizations that are not only specific to
one architecture, but to a particular processor. Optimizations that
will work on a PIII might not improve performance on an Athlon, and
optimizations for either will result in code that is much slower on a
celeron (and vice versa). Therefore, most of the serious low-level
stuff is best left to a compiler backend. If you're willing to do
serious autoconf work, though (I think fftw does something like this),
you can probably implement some good stuff by hand.

In general, you can get huge performance increases by respecting the
memory hierarchy. That is to say, if your code (or the code your
compiler generates) respects the fact that
     REGISTERS are an order of magnitude faster than
     L1 CACHE which is an order of magnitude faster than
     L2 CACHE which is an order of magnitude faster than
     MAIN MEMORY which is an order of magnitude faster than
     DISK...
then you'll have code that's much, much faster than a naive translation.

However, a lot of this is probably overkill for an effects plugin or
pattern-based softsynth. :-)

best,
wb


New Message Reply About this list Date view Thread view Subject view Author view Other groups

This archive was generated by hypermail 2b28 : Sun Oct 06 2002 - 04:20:04 EEST