[linux-audio-dev] Re: RT watchdog .. Was: Re: Lowlatency rpm problem

New Message	Reply	About this list	Date view	Thread view	Subject view	Author view	Other groups

Subject: [linux-audio-dev] Re: RT watchdog .. Was: Re: Lowlatency rpm problem
From: David Olofson (david.olofson_AT_reologica.se)
Date: Wed Oct 11 2000 - 17:16:01 EEST

Next message: Jamie Lokier: "[linux-audio-dev] Re: lowish-latency patch for 2.4.0-test9"
Previous message: Benno Senoner: "[linux-audio-dev] RT watchdog .. Was: Re: Lowlatency rpm problem"

Wed, 11 Oct 2000 Benno Senoner wrote:
> On Wed, 11 Oct 2000, David Olofson wrote:
>
> >
> > One thing that worries me is this; How to find out if a thread is
> > deadlocked, or if it's just running/sleeping with very short
> > intervals, without a kernel hack? Do the existing APIs provide
> > reliable info in such cases? (I don't want to rely on the threads
> > reporting to the daemon in any explicit way! Not reliable.)
> >
> > Preferably, this should be done without any kernel hacks, extra
> > modules or other special requirements.
>
> not sure what's the desired reaction speed of your watchdog.
> (eg. you want that if a task with 5msec timeslices , is killed after no more
> 5-6msec in order to cause minimal audio disruption)

Basically, the idea is to prevent a freaked-out RT application from taking down
the system. As to RT performance, the system is most likely already beyond
help long before the watchdog has a chance to see what's happening.

> Anyway I think such a fast reaction speed is not needed because if
> a task freezes then it is buggy or if you overbook the CPU for a few msec,
> one might not indend to kill the softsynth how caused this.

Exactly.

> I think the best way for the watchdog to see if all realtime processes are
> still alive, would be:
>
> - the RT process must periodically set a flag to 1 in a shared mem area which is
> the checked periodically (let's say every 1-2 secs) by the watchdog and
> then reset to zero.
> If the watchdog finds that the flag is zero during the next check, then it means
> that the RT process freezed in some part of the code (without executing the
> main loop).
> On the other hand you might fear that the RT process did not freeze but could
> be in a state where the CPU is constantly overbooked, thus the audio write
> loop never blocks, causing in practice a freeze of the non RT threads.

Yep, that's why I don't want to rely on such a scheme. :-)

> In this case I would adopt the same flagging strategy , but this time using a
> second a non RT thread (which belongs to the watchdog).

It should probably be a SCHED_FIFO thread with the lowest priority - otherwise
you'll have no idea if it's the RT threads alone that keep it from running, or
if it's just some heavy non-RT job that's hogging the CPU, causing this low-prio
watchdog thread not to answer in time...

Or; if it's not a SCHED_FIFO thread, what is a sensible/reliable timeout before
assuming overload? We don't want the watchdog to kill a working RT application
just because it thinks the non-RT threads get too little CPU time. Then again,
it might be a good idea to have the watchdog look for this too, just in case.
Very long timeouts, though!

Oh, well... This is really a user/sysadmin thing, and should be in the config
file!

How about CPU time quotas for the RT threads? (So you can reserve some 10-30%
for GUI, disk I/O etc, and smash up a nice warning if the engine gets to close
to the limit, rather than after overloading the system. Better than setting
overload == 100%, since there is actually some headroom when the watchdog
starts signalling overload to the engine. :-)

> If the non-RT thread does not give any signs of life
> (basically it does while(1) { i_am_alive=1; sleep(1); } )
> then assume that we are likely to be in a CPU overbooking situation
> (assuming the the RT threads still periodically inform the watchdog that they
> are running ok) In this case it is hard to figure out WHO is causing the mess.

Yeah, especially if there are several RT threads...

> Perhaps one could simply see which thread consumes the biggest amount of
> CPU (like top does), and then kill this thread.
> But it is not 100% foolproof.

<dream mode>
No, it's not really fair. Assume we're in a multi-user RT server: The user
threads should not be thrown out at random just because the system is
overloaded!

First, there should be some overload margin, and second, the users that are
trying to use more CPU than they have the right to do should face the
consequences.

This would be the perfect schenario, and shouldn't be a problem to implement -
</dream mode>
as an alternative scheduler module for RTL or RTAI! Well, it might be possible
to do in the standard Linux scheduler as well, but I'm afraid that would
require some brutal hack to implement hard RT deadlines timeouts on a per
thread basis. Without that, you just can't keep missbehaving RT threads from
breaking the deadlines of other RT threads; only possibly make sure it doesn't
happen more than a few times...

> Anyway such a solution would be much more than a windoze solution can offer.
> (and windows does not have a concept of stability, thus the watchdog talks
> on that platform would be useless :-)) )

This sounds terribly anti-M$, but in my experience this is all too true... If a
DirectX plugin crashes, you may get a black screen, a total system freeze or an
instant reboot. (And that's *WITHOUT* anything like RT scheduling or anything
else that suggests that you're bypassing system security ta gain something
useful.) So, why safeguard against freezes (that can't happen anyway, since
there's no SCHED_FIFO) when the real problem is that you blow up the system
before you have a chance to get to the deadlocked thread state...?

BTW, I'm still running 2.2.10-lowlatency here, and the uptime is 30 days. I've
still not had a kernel crash on this machine, except when f*cking up my own
kernel drivers. The only problem I have seen with this kernel is that PPP
freeze, and I don't use PPP here...

Anything newer that seems stable, while still delivering 2.2.10-ll kind of
performance?

(I need to patch a kernel with both LL and RTL or RTAI, so it'd be best if I
could use a very recent 2.2.x kernel. Then again, it's not a very good idea
starting to beta test a kernel when I'm supposed to set up a rock solid OS +
software in order to track down a hardware problem... To be ready for action
within a few days. *hrmpf* I might get away with RTAI/RTL only for now, though.)

And, as before;

> > (Please, CC to do_AT_reologica.se, as that address isn't subscribed to
> > these lists.)

(At work right now.)

David Olofson
Programmer
Reologica Instruments AB
david.olofson_AT_reologica.se

Next message: Jamie Lokier: "[linux-audio-dev] Re: lowish-latency patch for 2.4.0-test9"
Previous message: Benno Senoner: "[linux-audio-dev] RT watchdog .. Was: Re: Lowlatency rpm problem"

New Message	Reply	About this list	Date view	Thread view	Subject view	Author view	Other groups

This archive was generated by hypermail 2b28 : Wed Oct 11 2000 - 18:52:26 EEST