Re: my take on the "virtual studio" (monolith vs plugins) ... was Re: [linux-audio-dev] ardour, LADSPA, a marriage

New Message	Reply	About this list	Date view	Thread view	Subject view	Author view	Other groups

Subject: Re: my take on the "virtual studio" (monolith vs plugins) ... was Re: [linux-audio-dev] ardour, LADSPA, a marriage
From: David Olofson (david_AT_gardena.net)
Date: Sun Nov 19 2000 - 08:45:11 EET

Next message: Stefan Westerfeld: "Re: my take on the "virtual studio" (monolith vs plugins) ... was Re: [linux-audio-dev] ardour, LADSPA, a marriage"
Previous message: David Olofson: "Re: my take on the "virtual studio" (monolith vs plugins) ... was Re: [linux-audio-dev] ardour, LADSPA, a marriage"
In reply to: Benno Senoner: "Re: my take on the "virtual studio" (monolith vs plugins) ... was Re: [linux-audio-dev] ardour, LADSPA, a marriage"
Next in thread: Paul Barton-Davis: "Re: my take on the "virtual studio" (monolith vs plugins) ... was Re: [linux-audio-dev] ardour, LADSPA, a marriage"
Reply: David Olofson: "Re: my take on the "virtual studio" (monolith vs plugins) ... was Re: [linux-audio-dev] ardour, LADSPA, a marriage"

On Fri, 17 Nov 2000, Benno Senoner wrote:
[...]
> > > Plus the system needs to provide more thatn one datatype since float is
> > > not going to solve all our problems.
> >
> > Right; I have a few models for this by now, ranging from very, very
> > basic designs, such as enumerating a fixed set of API built-in
> > datatypes, nearly all the way to IDL.
> >
> > I think I'll stop slightly above the fixed datatype set variant;
> > there will be a bunch of built-in datatypes as of the 1.0 API, but
> > hosts should be able to use translator plugins or libraries to deal
> > with newer or custom types. (Note: No need to understand types as
> > long as you're connecting ports with the same type ID - just move
> > the raw data.)
>
> I propose to use my dynamic datatype stuff I implemented some time
> ago.
> Infinite number of datatypes, plus host does not need to know how
> to interpret the data, as long as it finds plugins where
> output_datatype of plug A matches input_datatype of plug B.
> (of course plugs are required to support at least one of the
> "builtin" datatypes (eg. float,double,int16,int32) otherwise
> the system will not be usable)

Yes, if it's going into LAD(S)PA - although, it will work in very
similar ways with MuCoS.

The version I've been hacking on basically differs from yours only
in the way datatypes are identified. I'm using ID codes rather than
strings (faster and simpler type matching when connecting). This will
of course mean that datatypes need to be managed centrally to avoid
clashes (like plugin IDs), but then again, strings aren't entirely
safe either...

Anyway, how many new data types/month are we expecting to see? :-)

My thought was that keeping the footprint and overhead of the type
matching system down would be nice, while countless custom datatypes
popping up all over doesn't seem like a likely or desirable scenario
(*) - that's why I prefer numbers to strings.

(*) How useful are complex datatypes anyway? They can't be converted
    into other, more generic datatypes in any useful way, and thus
    they can only be used for internal communication between plugin
    packages. That kind of stuff should be *possible*, but strongly
    discouraged, IMHO, as it contradicts the very fundamental ideas
    behind a standard plugin API.

Well, that's all - it's basically just a matter of what a "datatype
descriptor" is; other than that, we're thinking along the same
lines. :-)

> > I'm *not* going to include low level binary format info in the data
> > type descriptors, as it's simply not worth the effort. I can't see
> > plugins adding that many new data types that hosts will ever need to
> > worry about translating, so it's not worth the incredible added
> > complexity.
>
> Agreed: eventual translation is up to translations plugin. (if any)
> (all nicely demonstrated in my dynamic-type-LADSPA experiments).

Yep. (Except for the very complex formats that cannot be translated
anyway - but that will be a rare special case, rather than what
datatypes are commonly used for, I hope.)

> > This ain't easy, I tell ya'... :-) Anyway, I'm open to further
> > discussion. I have a design that I think is quite sensible more or
> > less designed, and 50% coded, but I'm open to ideas. (Of course, the
> > point with the first release is to take out the chainsaws and axes
> > and bash away - this is far too complex to be solved at the first
> > attempt at an implementation.
>
> Perhaps me and David should keep iterating (discussing privately)
> until we have some design to show up ? (at least in form of a diagram, detailed
> description or working code).
>
> Or should we keep this public ?
> But I warn you that this could involve months of discussion,
> (perhaps a bit boring for some of you) and N models thrown away and
> rebuilt from scratch.
>
> This discussion is somewhat related to LADSPA but as said the goal of the
> API is not meant as a replacement of existing plugin APIs.
> It is a sort of managing/routing API.

Yep; my versions are primarilly intended to be used as a part of
MuCoS, but there's nothing preventing the final solution from being
a generic Simple Datatype Matching System, to be strapped on onto
anything that needs to deal with this kind of stuff.

So far, it's basicaly about a library + conversion plugin library,
but the conversion library could actually be a bunch of very
simplified "plugins" that just convert N frames of datatype A into
datatype B - no plugin API needed for that; even LADSPA is overkill.

Advantage: These could easily be used by any host or communication
subsystem, regardless of native plugin API and stuff. You could hack
small programs that communicate with LADSPA, MuCoS, VST, ... hosts
and other rather complicated beasts, while not having to use any of
those APIs - understanding the datatype descriptors and using the
converter library is all that's required.

> [ ... SMP ... ]
>
>
> > I'm not talking about splitting the net in partial chains that
> > are connected end-to-end, but running the same net on all CPUs,
> > interleaved.
>
> I disagree about the efficiency of this model:
> ok, it has the advantage of keeping all CPUs busy, but the
> performance hit may be big (cache etc) , plus as you mentioned, clusters
> are almost impossible to run using this model.
>
> I prefer this method of load balancing:
>
> you have M plugins and distribute them among N CPUs.
>
> M is usually much higher than N.
> You will say that there are plugs which are very time consuming,
> while others are very lightweight.
> But this issue can be solved quite easily.
> Give each plugin a CPU usage index.
> (similar to the Mhz numbers Steve posted for his plugs)
> That way we can distribute the load among CPUs quite optimally.

The problem is that distributing plugins that way results in
inter-thread dependencies.

Ex: Two plugins per CPU; two CPUs; 4 plugins total:

Net 1: input -> A -> B -> C -> D -> output

Net 2: input ---> A ---> output
               |-> B -|
               |-> C -|
               `-> D -´

Serial model:
Net 1: CPU1: A1A1A1A1B1B1B1B1A2A2A2A2B2B2B2B2
        CPU2: ----------------C1C1C1C1D1D1D1D1
    Fragment: |------------------------------|
                              ^
                          Critical; CPU 2 cannot run before CPU 1 is
                          done with plugin B.

        Note that the total latency needs to be one fragment higher
        than on a single CPU system running this net, WITHOUT
        considering the sync point. If the sceduling latency can peak
        at one fragment, that means another fragment is required.

Net 2: CPU1: A1A1A1A1B1B1B1B1A2A2A2A2B2B2B2B2
        CPU2: C1C1C1C1D1D1D1D1C2C2C2C2D2D2D2D2
    Fragment: |------------------------------|
                              ^
                   Note that the outputs of the plugins cannot use
                   the same buffer and run_adding(), as that would
                   result in CPU collisions and corrupt data. Extra
                   buffers + a mixer plugin must be addded. And the
                   mixer plugin has to wait for all inputs to be
                   available...

        The plugins A-D are not chained, and can thus run in
        parallel. However, the mixer plugin needs the output from all
        plugins, which means that a CPU synchronization is required
        before the mixer can run. That is, the total latency will
        increase by peak_scheduling_latency + mixer execution time.

Parallel model:
Net 1: CPU1: A1A1A1A1B1B1B1B1C1C1C1C1D1D1D1D1A3A3A3A3B3B3B3B3
        CPU2: ----------------A2A2A2A2B2B2B2B2C2C2C2C2D2D2D2D2
    Fragment: |------------------------------|----------------
                              ^
                   This is the most critical timing point; plugin A
                   must not run on both CPUs at once. However, as
                   long as there are more plugins than CPUs, there
                   will be a non-zero margin here; ie we can use per
                   plugin spinlocks, which will never spin unless
                   one CPU is delayed for more than the margin time.

        As to latency, not that it takes only *one* fragment after
        all plugins have processed buffer 1, until all plugins have
        processed buffer 2. That is, to get a buffer¨, you only need
        *one* CPU to have finished one cycle; not all CPUs. On a dual
        CPU system, the signal alternates between the two CPUs, so
        that every other buffer is processed on CPU 1, and the other
        buffers on CPU 2.

Net 2: CPU1: A1A1A1A1B1B1B1B1C1C1C1C1D1D1D1D1A3A3A3A3B3B3B3B3
CPU2: ----------------A2A2A2A2B2B2B2B2C2C2C2C2D2D2D2D2
Fragment: |------------------------------|----------------

        This is identical to Net 1; the parallel model isn't affected
        by the structure of the net, which is very handy... :-)
        (This is not true for feed-back loops, however!)

Also note that there's one more critical sync point for both models,
passing buffers to the audio output.

In the serial model, the output will consist of data originating
from all CPUs, and this data must first be available, and then mixed
down by the thread that communicates with the audio card.

In the parallel model, the CPUs deliver *complete* buffers (since all
CPUs run all plugins), so the output code only needs to sync with one
CPU at a time. As no further processing is required, and as there
are no per-CPU special cases (they all run the same net), the audio
output code can be shared, so that all CPUs use the same code - just
make sure there's a spin-lock to make sure there's only one CPU at a
time in that area, just in case someone should hit a latency peak.
In other words, when a CPU is done with one cycle, it just writes
the resulting output to the audio interface, and then leaves it for
the next CPU to use - which will happen approximately the duration of
one buffer later.

> The fun starts when there are multiple applications running on a SMP
> box: each "application" (= a shared lib), would have to distribute their
> private plugs / processing algorithms in a way that the global CPU load
> is distributed evenly.

Yep.

> It's not an impossible task, but it requires careful design.
> (in practice we need a call where the app asks
> app: "hey soundserver: I want to add a plug, on which CPU should it run on ?"
> soundserver "hmm .. let me look: this plus uses X Mhz and CPU 3 is only lightly
> loaded, so use CPU 3"

If the application doesn't host plugins by itself at all, there's no
problem, as the soundserver would just figure out automatically and
transparently where to put the plugin. That's the *real* advantage of
a higher level API. :-)

> This is a simplification, since I'm still unsure if use a two level model, or
> place plugins an apps on the same pot:
>
> On the other hand using the single level model would have the advantage
> that applications do not need to implement a LADSPA host over and over again.

Exactly; and that's probably a good idea if we're going to mess with
SMP implementations... (It's not trivial to just strap SMP on, even
if some outside "manager" tells you which CPU to run every plugin
on.)

> eg: Ardour needs to run a tree of plugins on some of its tracks:
> it could build the plugin-net using functions supplied by the server and then
> with a single callback, the server would process all the data.
> That way it would become much easier to distribute the load.

Yep.

> What I want to avoid is that apps like Ardour would be forced use a total
> different model in order to use the "virtual studio" API.

Yeah, that's the hard part...

OTOH, how different can porperly working RT plugin hosts be? How
different would you *want* them to be, if they're supposed to run
simultaneously, and still function properly?

What I'm saying is; a properly implemented host with a nice API
should be able to do anything an application could do with it's own
host, at least as long as the actual work has to be done in a single
thread, as is the case with low latency applications.

> But yes, I begin liking the idea of this "delegate the processing of LADSPA
> chains to the server" model.

Yeah; you're not *allowed* to mess with the nets or plugins in non
thread-safe ways anyway, so why would you want to run a host locally
in the application in the first place?

The plugin API is the interface through which you communicate with
plugins - if that doesn't cut it, fix the API! Nothing but the
simplest setups will ever work otherwise, and the totally integrated,
low latency real time studio will remain a dream... (Well, unless
we're all going to use the same single, monolithic, universal
application.)

> One of my concern is that besides from LADSPA, DAW applications may have to
> perform other CPU intensive stuff (which cannot be covered by LADSPA, and
> perhaps needs additional datatypes) and here we will need an additional API.
> (MAIA etc etc)

Can't be helped - we either figure out what that API should look
like, or we resort to non-cooperative applications to do that kind of
stuff.

> But this will not hinder anyone, since all stuff that can be implemented in
> LADSPA can be done here, without the need to learn a new API and/or to
> change your application in order to support it)

Yes, I think so too. This is actually a more generic model than the
cycle based paradigm that LADSPA is built around, so it should
actually be *more* flexible than binding your application to any
specific plugin API meant for running RT networks locally.

> ( ... clusters ... )
>
> > nodes, non-constant data flying around! Also note that some plugins
> > will have huge problems running in this kind of environment...)
>
> Exactly, therefore I propose to avoid this model, since the advantages are
> only marginal (if at all).

See above; latency and not having to split networks at all.

Anyway, this model assumes that plugin state data can be moved around
just like audio data, and as to plugin APIs without explicit support
for that, this basically means that it's for machines with hardware
level shared memory networks - or SMP machines.

> We are still assuming number of plugins much greater than number of
> processing units (CPUs or networked machines)

That's usually the case, but is it usually the case that the total
amount of state data is bigger than the amount of audio data
transported during one cycle...?

> > Well, we are not talking sub 5 ms for a cluster for yet a while, but
> > it should beat the current MacOS and Windows software solutions. (And
> > eat them alive when it comes to processing power of course, but
> > that's of little value if the latency is unworkable.)
>
> But I am pretty confident that 10msec clusters are pratical on 100Mbit
> LANs (switched and dedicated to the audio cluster, without 1000 people running
> remote X11 sessions in the background).
> And that should cause some ohhs and ahhs among Win/Mac folks.

Yeah, I think so too.

And if you really need that kind of power for real time processing,
you're either going to pay $$,$$$+ for a monster SMP machine, or
you're going to have to accept some extra latency. Considering the
latency figures we're talking about here, you'd have to be *very*
serious about RT response on *everything* to go the SMP way... (You'd
be at least the kind of guy who uses several GB of RAM instead of HDs
for recording, just to avoid the seek delays.)

> > I'm more into this "a plugin can be a net" thing... How about the
> > soundserver being the top-level host, while *all* applications
> > actually are plugins? Any points in their nets that the application
> > wolud like to connect to other applications, would be published in
> > the form of ports on that application's plugin interface. The
> > soundserver than connects ports just like any other plugin host.
>
> Yes this sounds interesting, but as said we will need an API which
> lets build the application the plugin-net it desires, which is then
> effectively run by the soundserver (so that it can distribute the load among
> CPUs or even among nodes of a cluster (you may think that a cluster is much
> more complex than a SMP box, but as long as you minimize intercommunication
> (which is wanted anyway, even on shared-mem architectures), the handling is
> quite similar).

Yeah, it's actually just a matter of the timing of CPU<->CPU vs.
node<->node communication. For this kind of applications, we're not
directly depending on the latency of every single transaction, so it
works better than for the scientific computations that this kind of
systems were designed for. (And failed to be all that good at.)

> And of course the application can publish the "connectors" it desires.
> (internal input/outputs or the ins and out of "private" plugins)
>
> for example a simple HD recording app would look as follows:
> (assume the soundserver calls this callback once per fragment
>
> hdr_callback() {
>
> - fetch_soundcard_input_buffers() ( basically it consists of reading buffer
> where the soundserver placed the current fragment(s) (for each input
> channel))
>
> - fetch_input_buffers_from_disk_thread() (this is needed if we need to process
>
> - execute_LADSPA_process_chain()
>
> - send resulting data to soundcard outputs & disk thread (for tracks which
> are being recorded)
>
> }
>
> I'm not sure if my model misses something, but I'd like to have the confirmation
> that it can accomodate stuff like Ardour's punch-in/out stuff.
> (Paul described how it works some time ago, and it is far from trivial, since
> you need to buffer stuff and overwrite tracks on the right place, in order
> to compensate the disk buffering issues).

There is one problem: What if the hdr_callback() has both inputs and
outputs connected to outputs/inputs of another such module? Say;

input -> Chain A -> HDR -> Chain B -> output

where Chain A and Chain B are running under the same host. That
would mean that there is no correct execution order for the plugin
host and the HDR, and thus either Chain A or Chain B will have
another buffer of latency.

(Af course, if the two chains were depending on each other, we would
have a feed-back loop, and this couldn't be avoided anyway.)

I can see two solutions for that:

1) Just accept that multiple independent public chains inside a
single host is a very stupid thing to do, and tell people not to
do it.

2) Drop the "net-in-a-plugin" structure on the engine level, so that
the engine can optimize the entire net, regardless of where
individual plugins actually belong.

Both solutions have their advantages and disadvantages;

1) + Simple
        + Easy to understand
        + Easy to add supports for new plugin APIs
        - Doesn't let the soundserver clean up the nets for stupid
          applications (or users!)

2) + Global optimizations possible
        + Short feedback loops across application boundaries possible
        - Complex
        - The soundserver can only do this for plugin APIs that it
          knows natively

1) is the obivous choice for a multi-API system, why 2) gets more
interesting if we're mainly going to deal with one or two plugin
APIs.

Of course, it's still possible to wrap *single* plugins and then
optimize the nets as usual (which partly invalidates the API related
points), but that's not as efficient as wrapping entire chains.

> Anyway if we can develop such a model where processing / mixing / routing
> can be delegated to the virtual studio API, then
> HDR apps become nothing more than a disk streaming engine, a GUI
> and a bunch of plugins which do the processing.
> (and as said the mixer plugin can be reusers over and over again,
> (long live to the code reusal ! ).

A few ideas:

* Applications need to be able to build an analyze nets.

* App/host shared is memory needed for efficient streaming.

        * A buffered streaming protocol is required, so that external
          applications won't stall the RT engine. (Useful for games,
          HDR, mp3 players etc...)

        * Above protocols should do something sensible if it sees an
          underrun - it's not allowed to crash, play crap data, stall
          or anything like that!

        * Applications must be able to control plugins using
          buffered, sample accurate events. (Sequencers, automation
          etc, where you don't want the sequencer engine as a plugin
          for some reason.

        * Above event system should have an "ASAP" mode, where events
          are delivered in the beginning of the next cycle after they
          arrive. (This is how you'd trigger sound effects in a game
          and that kind of stuff. Works for simple softsynths as
          well, in particular when playing on the PC kdb.)

> And as I mentioned there are ZERO context switches in this model,
> making its scalability and latency excellent.
>
> PS: like LADSPA nets, there will be event-nets, which will be handled
> in a similar way.
> And of course there will be event-processing plugins, which will allow cool
> things like connecting the MIDI sequencer to a remapping plugin which
> transforms pitchbend messages into NRPNs to drive the LP filter of a
> standalone synth. (the possibilities are endless)
>
> waiting for more food of thought ....

Ok, here's some, regarding events; :-)

events == (commands + changes)

I have come to the conclusion that events must be split up in two
classes with distinct differences in semantics. Commands basically
say "Do A at time T!", while changes say "Change X to A at time T!"

The fundamental difference is that you can specify additional rules
for how to manipulate, read and write parameters using changes,
whereas commands just trigger "operations". The problem is that these
operations cannot be strictly defined in the API without making them
useless, so what they actually do, and how, is to remain the secret
of the plugins that implement them.

MIDI parallel:

* Pitchbend and CCs are "changes".

* Note-On and Note-Off are commands.

Examples of differences:

        * You can read a CC back, but you can't "read" a command in
          any sensible way. (You could check if a certain command is
          still affecting the output for example, but that's not the
          same thing.)

        * You can't use alternative ways of issuing commands, such as
          "Ramp from A to B!", as there's no clear definition of what
           a plugin in general should do with such requests.

        * You *can* assume that a change event results in some
          variable to be set to a certain value. As a result, it
          makes sense to include reading and writing these variables
          to implement a form of plugin preset management.

//David

New Message	Reply	About this list	Date view	Thread view	Subject view	Author view	Other groups

This archive was generated by hypermail 2b28 : Sun Nov 19 2000 - 16:04:37 EET