[linux-audio-dev] Re: Realtime restrictions when running multiple audio apps .. Was: Re: disksampler ...

New Message Reply About this list Date view Thread view Subject view Author view Other groups

Subject: [linux-audio-dev] Re: Realtime restrictions when running multiple audio apps .. Was: Re: disksampler ...
From: David Olofson (david_AT_gardena.net)
Date: Mon Jul 17 2000 - 03:25:59 EEST


Hi!

I think I've lost track of the threads and lists a little lately, but
here's a looong post on multithreading vs. low latency RT...

On Fri, 14 Jul 2000, Benno Senoner wrote:
> [ I am CCin Victor Yodaiken and it would be nice if the could comment on the
> RTOS related topics in this mail ]
[...]
> > I admit that there might (and probably will be) problems in using any
> > existing soundserver. But you can always fix and tune things, as long as
> > the basic design is good enough.
>
> basic design is the key here .... my model requires _cooperation_ from the
> audio app (loading these as separate plugins and imposing restrictions
> on the operations each module can do and so no ... ) , while existing
> soundserver work out of the box using approaches like artsdsp or esddsp.

Yes... Design is indeed the key point here.

The whole idea with multithreading designs is to "avoid" the
singlethreaded, synchronized, if you like; round-robin scheduling
style designs that would be required otherwise. This is all nice and
well for a great deal of problems, in multiuser systems in
particular, and makes it a lot easier (or rather; possible) to do
things like time sharing, preemptive taskswitching (for smoother and
more fair sharing between programs doing massive, time consuming
operations) and secure, protected environments.

However, as we're dealing with real time applications and low latency
buffered I/O, some nasty side effects [that normally cause no
significant problems] begin to become serious issues. There are
basically two things that make real time applications different from
other applications, and both of these affect the way APIs and
"underlying structures" (ie engines, hosts etc) should be designed
and implemented:

1) Low latency (few and small buffers) in a time-shared environment
   requires scheduling and time sharing fairness of an accuracy that's
   way beyond what anything but a TRUE RTOS kernel (like RTL, RTAI or
   QNX) can provide. Since the reschedules will be frequent, they need
   to be extremely fast, and without significant latency peaks. To
   keep the threads from stepping on each other's toes, causing
   buffer over/underruns, priorities nood to be set carefully, and
   scheduling policies have to be rather advanced. EDF (Earliest
   Deadline First) is an absolute requirement for allowing any
   substantial use of CPU power from within several RT threads, and
   that kind of scheduling requires extra info from the applications
   (that is, complexity in the apps as well) to work properly.

2) Point 2) is basically that you may ignore point 1), as even if it
   is *possible*, and in some cases even viable to design a monster
   of the kind required, it makes no sense to RT multimedia. It buys
   you little, gives you loads of non-obvious problems, and you can't
   even dream about doing it under anything but a full-blown RTOS
   kernel. To keep the latency peaks that anything but a fully pre-
   emptive RTK gives from building up and causing drop-outs, the
   scheduling of the concurrent RT threads need to be almost as tight
   as direct callbacks / round-robin style scheduling, which basically
   renders every blocking/sync point an explicit task switch. Further,
   the scheduler has to be really smart to find out not only who
   depends on who, but also which chain of sleeping threads need to
   be scheduled first, not to miss a deadline later on.

Sure, multithreaded RT *could* be implemented, and perhaps it could
even work! However, consider two of my takes on how to do it, before
you think "Great! So maybe we can have applications run the
processing locally after all! :-)";

Model 1
-------
The easy way. This can be done under Linux/lowlatency, and I
actually think Benno Senoner has a somewhat working implementation
of it lying around already.

It could perhaps deliver acceptable performance with some help from
a simple kernel module, but it's already a setback compared to the
"central engine" approach, since every extra thread involved means an
extra switch, and thus an extra chance for the kernel to do something
silly, like not rescheduling for "a while".

a) View every application RT thread as a client to a central mixer
   thread/daemon. The threads block on only one thing; the "port"
   they use to communicate with the central mixer.

b) Require that the applications obey the central mixer's suggestions
   regarding buffer size and number of buffers, so that the mixer can
   act as a scheduler, waking the applications threads up as required.

c) Make the central mixer run at highest SCHED_FIFO priority, and make
   sure that applications CANNOT get equal or higher priority. This
   allows the central mixer to simply ignore or shut down misbehaving
   client applications, and never to drop-out itself.

d) Give the central mixer control over the priorities of the RT
   threads. This allows the mixer to set priorities so that clients
   with different buffer sizes/latency requirements can run nicely
   together.

This design practically forces the RT threads of the applications to
run like callback style plugins.

The advantages are that these "plugins" run in separate threads
inside separate applications, and that they have more chances of
severely violating the RT rules, and thus get ignored by the central
mixer, rather than making the whole "central engine" stall.

The disadvantages are that you still need an API similar to a plugin
API, that this API will be quite a bit more complicated and have more
overhead, and finally, that the latency will be higher and less
reliable than that of a single-audio-thread central engine. (Note
that all the usual lock-free communication stuff is required anyway,
so you're *not* getting away with simpler code just because you keep
it all inside your application!) Further, you still have the buffer
size restrictions for the streaming just as with callback style
plugins, since the scheduling still relies on it. Allowing all kinds
of scheduling rates for different client applications under heavy
load would be an insane nightmare, even if you tried to do it under
RTL, where you could practically disregard scheduling latencies. The
scheduling *policy* is the real problem here, not the latencies.

My conclusion: Doable, but since you still have to code your threads
like callback plugins, most of the point is lost.

Model 2
-------
The design to use when lots of CPU time, multiple threads, and full
control is required, and a true hard RT kernel can be used. This is
what application developers without control engineering experience
would consider a nice solution. I believe this is what most
application developers are dreaming about, as I've seen many, many
arguments based on ideas stemming from this design philosohpy, and
also many applications relying on this to eventually work on a
"normal" OS.

a) Let every application do what it wants the way it wants to do it.
   Applications that cannot run hard RT due to design errors will
   just not work properly, but won't disturb the system. (*1)

b) In particular, RT threads are allowed to use any buffer sizes/
   scheduling rates they like, and even vary the buffer size as
   they like, since there is no central engine that relies on, or
   expects anything.

c) You *still* have to perform streaming I/O using sensible read/
   write sizes for performance reasons, since every operation has
   to go through the sync/blocking code, just like the
   read()/write() operations of Linux. Looping one sample at a time
   will fry your CPU for nothing, and that'll probably be the case
   for a few years to come. (*2)

d) If you need to use substantial amounts of CPU time, you'll
   probably have to tell the OS about the bandwidth and CPU time
   requirents, so that scheduling can be set up properly.

e) Unless the OS has sufficient information to run EDF scheduling,
   it'll have to resort to high frequency time slicing/sharing, or
   some other, perhaps not yet invented scheduling policy, since the
   deadlines are to tight for anything like normal time sharing.

(*1) Hardly possible to do automatically, as either the user or
     the OS has to find out who's eating the CPU occasionally,
     and lower the priority of that thread. You will already have
     had a few drop-outs in random RT applications before you (or
     the OS) has tracked down the overprioritized offender.

(*2) Eventually, the overhead involved with read()ing/write()ing a
     single sample at a time might become a non-issue because of the
     enormous ammounts of operations performed on every single
     sample, in relations to the overhead of the kernel calls. This
     does not apply to all areas of signal processing, of course, and
     most probably never will.

This one looks a bit nicer than the central engine approach, or model
1 above, as long as YOU don't have to implement the scheduler! :-)

Advantages: RT applications and plugins are basically the same thing;
just threads running in a UN*X style OS, using something very similar
to pipes for streaming.

Disadvantages: Almost impossible to implement in real life. Many and
non-obvious problems (deadlocks, missed dedalines due to
miscalculated thread execution times, misbehaving threads that in
unobvious ways manage to steal the CPU time of other threads, which
in turn miss deadlines,...) plague all kinds of systems that are
designed this way. Still, in the end of the day, all the fundamental
issues are still there. You can *NEVER* disobey the RT programming
rules and get away with it in a longer perspective, no matter how
smart an OS or API you use.

My conclusion: Not viable for low latency multimedia. It can already
be done to some extent with sufficient buffering and lots of small
buffers, or FIFOs, but even under those circumstances, it gets
virtually impossible to make such a setup run reliably, especially if
the total RT CPU load is high. To provide 2-3 ms audio latency with
this model, we need overkill hardware to make up for the overhead
(many small buffers), and a hard RT kernel with peak latencies in the
µs range.

> > This reminds me of LADSPA development. We managed to keep it
> > simple-and-stupid, and now we have a plugin API. Many said, that
> > development will continue, and we will see LADSPA-v2. Ok, this will
> > probably happen, but as we all know, nothing has happened yet. I think
> > this is the reason why we should try to develop the current designs rather
> > than start from scrath.

True. However, in some cases, the design of the existing solutions
definitely render some very desirable features impossible to
implement. Low latency processing and clients running in separate
threads don't mix too well on any current OS.

Is it either using one of the two different ways of constructing
environments, or forcing everyone to switch to a completely new
system, that's designed around the requirements for RT?

Well, not quite! Of course it's possible build a system around an RT
engine that suports remote controlled loading and running of
plugins, and then implement the current audio APIs that most
applications use, on top of that.

There will have to be those extra 2-3 ms of latency, as the engine
has to mix all inputs (although it might bypass the mixer and let a
*single* application talk "directly" to the drivers), but
applications with too high requirements to put up with that should
really be ported to the new API anyway! (It seems that lots of
applications have to be corrected to take advantage of
Linux/lowlatency anyway, so what's the big deal, really? And let's
not even think about what you need to do to get acceptable latencies
on Other Platforms...!)

> Developing a non-trivial project always takes
> > time. The sampler-sw project is a different thing, as we don't have any
> > existing (free) implementations at our use.
>
> Yes it takes time, but for now we can always pipe let's say the sound-output
> of the sw-sampler into arts using arts as soundserver , so what's the problem ?

Latency. But that's an *implementation* problem; not an API problem,
and can thus be fixed later, as long as no serious API design flaw
prevents it. That is, as long as the sampler has a sensible API, it
can run anywhere, including under RTL!

> Adding the "manually-scheduled audioclients" model to arts can be done
> but I prefer to encapsulate the stuff into a small and _lean_ app which is
> easily maintainable and where the chances are bigger that we can produce
> an efficient audio scheduler.

Yep. Thanks to the requirements on RT audio applications, the
application will probably be similar in design to a callback style
plugin anyway (the mixing thread, that is), so porting it to some
suitable plugin API shouldn't be much of an issue.

> > And as for the high-end <-> low-end separation, I just don't see much
> > sense in it. Developers can do what they want, but I as a user,
> > want all my apps to work without clitches. And surpring or not, many "low-end" toy
> > programs are much more suitable for creative use than these "professional
> > apps". It's a damn shame if a program has latency problems, but if it
> > produces nice sounds, I'll still use it - one way or another.
>
> I agree and this is why I proposed to write a sort of rtsoundserverdsp module
> ( similar to artsdsp or esddsp) which can fetch the audio from an existing
> soundserver.

Good. Will work now, and will work great when other applications
start to support whatever API(s) is eventually agreed upon.

[...]
> The only risk of overall dropouts arise when the "low-end" application
> runs SCHED_FIFO and blocks the CPU for several msecs.

This "little" problem is similar to the problems I mentioned above.
Very hard to do anything about, unless we can get more control over
the applications that run RT. IMHO, RT scheduling should not be a
root privilege, but rather a resource with multiple levels of access,
mapping to how high priorities can be used. Then it's just a matter
of users or sysadmins installing the applications properly, so that
trusted applications can and do use higher priorities than "low-end"
RT applications.

> The solution of the problem is: do not run "untrusted" (eg the apps where you
> don't know that they meet the RT programming restrictions etc) as root.

What I don't like about this is that the applications may still
benefit from SCHED_FIFO, even though they occasionally do something
silly. It's the only safe way to do it right now, though.

[...]

> I know it is much easier to implement a simple MP3 player using single
> threading like this:
>
> while(1) {
> read() MP3 frame from disk
> decode_MP3_frame();
> write() audio fragment to disk
> }
>
> rather than doing with two separate threads, one for audio and one
> for disk IO. (audio having higher priority than disk)

Nice example of Very Bad RT programming, indeed! :-)

This one illustrates a problem that's very, VERY hard to fix outside
the applications, and no matter how many programmers claim that "it'll
eventually be a non-issue": THIS IS NEVER GOING TO HAPPEN!

Well, *perhaps* when everyone's using solid state disks with the same
access time as the system RAM, and no block based file system layer
whatsoever, but to me, that sounds rather unlikely to ever happen.

> All other approaches like running 5 separate audio apps communicating via
> pipe/socket to a soundserver and hoping that EACH task (clients plus
> soundserver) will never miss the 700usec processing cycle is not a sure thing.
> Jun Sun from Montavista has measured 1msec scheduling latencies under high load
> and I fear that the client/server model is not up to the task to ensure these
> 700usec processing cycles.

It could be done under RTL. However, see model 1 above, and also note
that it's not as simple as "running the RT threads under RTL". This
is a BIG hack, no matter how it's done, and if it's going to look
like SCHED_FIFO in user space, it's even more work.

Would it be worth the effort? Well, I'm still considering going back
to RTL and see if I can help out, but it takes a serious need for
lower latency than 3 ms AND for the things that RTL doesn't provide
right now before I go down that route. Currently, I'm doing just fine
with Linux/lowlatency for 2+ ms latency and pure RTL kernel modules
for hardcore sub ms RT stuff. Sure, I'd love to see a single solution
that can do both, but...

> Assume that you want to design an audio app in a "right" way like using
> multithreaded approach: assume each application has 3 threads
> (audio, MIDI , disk) multiply this with 5 concurrent applications and you get
> 15 SCHED_FIFO threads all fighting for the CPU, and the 5 audiothreads plus
> 5 midi threads will all fight to ALL achieve <1msec scheduling latencies.
> If one of these 10threads fails then there will be some glitch
> (either audio or MIDI timing variation).

Also, see above. Not only would there be 15 threads to schedule in
this example; some of the threads would also consume considerable
amounts of CPU time, which makes the distinction between scheduling
a task, and a task *finishing* very important. SCHED_FIFO has no
means whatsoever to deal with this. (Well, possibly except for thread
priorities, but that's nowhere near adequate when all CPU hungry
threads have about the same schedule timing profile.)

> With my model the number of threads will be always 3
> (or perhaps the disk-threads could be duplicateds since they do not
> have so tights timing constraints)
> and scheduling the various plugins (with a simple subroutine call) has
> virtually no cost compared to full context switches of dozen of processes.

Besides, running plugins that way makes it very easy to control who
does what when, and making sure plugins get executed in the right
order WRT dependencies in the net. The host basically says "process N
samples NOW!", and when the plugin returns, this work has been done.

If the host want's to be more fool-proof, it could use (partially)
separate address spaces for the plugins and a watchdog timer to keep
nasty plugins from freezing or crashing the entire engine. This could
probably be implemented completely in user space, or more
efficiently using helper modules in kernel space, or whatever, but
it doesn't affect the plugins or the plugin API.

> > So, what I suggest now:
> >
> > - specify our requirements for the soundserver (*)
>
> in order to design/tune an efficient soundserver model , we need
> realworld apps which interact with it.
> I will use the disksampler as testbed.
> (who stops you to run 5 disksamplers with 3msec latencies ? :-)
> believe it or not but it _IS_ possible with the above model)

Yeah, and that's why Steinberg do it in a similar way with ReWire.
Dirty hack or not; it's the only sane way of doing it, even with a
kernel such as Linux/lowlatency.

Besides, several so-called operating systems have done *all* of their
"multitasking" in this way, and there, the applications weren't
exactly as deterministic as the average audio plugin... And RT audio
programming pretty much has to be done in this style anyway, for
various other reasons, some of which have been mentioned too many
times (if possible! ;-) already.

> > - check the current status of aRts, esd, X-audio, etc
>
> You are trying to chase the "one size fits all" approach but I am convinced
> that this is not the right way to go.

Right. Some require RT or ever hard RT, and they'll simply have to
deal with the real world, follow the rules that all control
engineers, game programmers and others have been following for
"ages", and use the APIs that allow them to do this in a civilized
manner. (As opposed to hijacking the whole system, bypassing the OS
and similar approaches.)

As to the rest [users and programmers not requiring hard RT / low
latency], they'll most likely be most happy if they can use their
favourite APIs and get some extra flexibility in the signal routing,
application integration and networking areas.

> > - make a decision whether to use an existing soundserver or
> > start a new project
>
> I hope that I have given enough points to answer this question.
>
> I neither want to spread FUD nor am I a "wheel-reinventer" because the
> software has to be written by me because I do not trust other peoples work.
>
> I say only that by keeping things simple and modular, we have the biggest
> chance to achieve the best realtime performance and smallest maintainability
> overhead.

Just a few reasons to start a new project:

* Nothing similar exists.

* Similar projects exist, but none are similar enough to do your job
  without "adjusting" the goals.

* The leaders of current projects don't want to change the goals of
  their projects.

* Existing projects have too many participants with too different
  views and no leader strong enough to ensure that the project will
  keep going in the right direction.

* No existing project is "right" enough to warrant a fork.

* Existing projects have to complicated and/or messy code, making it
  easier to hack new code than to reuse their code.

* You enjoy leading a project and/or design and/or coding cool things,
  and you think that you might come up with a better solution.

[...]
> I hope that I made the point clear that in order to achieve high performance
> we have to accept some compromizes there is almost no way around it.
> And developers / users of high-end software can live with these restrictions.
>
> It would be nice to hear comments from other list members
> (David Olofson , Paul , Stephan etc)

Well, I hope my long, boring post explains *why*, from a technical
perspective, we need to go this way (or some very similar way) to get
usable performance at an acceptable cost. Hardware cannot get
infinitely fast, and neither RT kernels or nice APIs are magic, so
we just have to deal with it, and come up with a usable "compromize".

> PS: do not expect doing 2msec audio over network when running X11 clients,
> that is asking too much since all involved processes on both machines would need
> to run SCHED_FIFO , plus you have ensure plus the network doesn't get
> congested

Well, there are good NICs, crossover cables and RTL, if you *really*
want to do it. :-)

//David

.- M u C o S --------------------------------. .- David Olofson ------.
| A Free/Open Multimedia | | Audio Hacker |
| Plugin and Integration Standard | | Linux Advocate |
`------------> http://www.linuxdj.com/mucos -' | Open Source Advocate |
.- A u d i a l i t y ------------------------. | Singer |
| Rock Solid Low Latency Signal Processing | | Songwriter |
`---> http://www.angelfire.com/or/audiality -' `-> david_AT_linuxdj.com -'


New Message Reply About this list Date view Thread view Subject view Author view Other groups

This archive was generated by hypermail 2b28 : Mon Jul 17 2000 - 10:56:58 EEST