linux-audio-user: Re: [LAU] Open Source Audio Interface

From: Len Ovens <len@email-addr-hidden>
Date: Wed Sep 03 2014 - 00:16:26 EEST

On Mon, 1 Sep 2014, Len Ovens wrote:

> On Tue, 2 Sep 2014, Kazakore wrote:
>
>>> Madi can be sub-ms RT (return trip)
>>
>> Really? Coax, optical or both? I used MADI routers in my work but that was
>> more about sharing multi-channel audio across sites miles apart than
>> low-latency monitoring... (I also have to sadly admit the MADI system is
>> one of the ones I new the least about by the time I left that job :( )
>
> It depends on distance, but it has almost no overhead compared to most other
> formats (besides analog and aes3). It depends on the dsp time used to split
> it up and route.

After reading some more on ethernet, it becomes easier to see why MADI can
have a lot less latency than any ethernet transport.

MADI uses a network physical standard, but it does not use some of the
other bits. The MADI tx/rx buffer contains one whole MADI frame at a time.
The frame itself has no extra data over the size of aes3 data. So each
channel is 32bits long. There is no routing information or error
correction to be calculated beyond the aes3 parity bit. MADI is a physical
point to point protocol. MADI does not use OS drivers to deal with any of
this, but rather uses it's own HW to do this as the bit stream enters the
card. This means that the audio data from the ADC can reach a DAC at the
other end within the same frame if the channel count is less than full or
by the next word clock if not. So the latency is one audio word
effectively. When used as a computer IF, The card will of course store
more than one frame as the audio driver requires. However, this is sw
latency and not required by the protocol itself.

ethernet, on the other hand, uses hw that is not controled. There needs to
be an OS driver to make it work. Because of the variety of HW and drivers,
any audio protocol is at least layer 2. This includes routing information,
and restricts data to 1500 bytes (46 audio channels) per packet. That in
itself is not such a big deal and would only add one word latency if the
whole thing was done in hardware. However, it is dealt with by the OS at
both ends which has other things it needs to do, so the latency of the OS
affects this too. This includes latency going through switches (and their
OS) as well as scheduling around other network traffic. Also the
possiblity of colisions exists and so dealing with audio in chunks of
words makes sense. What this means in practice, is that as a computer IF,
there may be no difference between MADI and a level 2 ethernet transport.

Level 3 (Ip based) transport adds one more level of sw. It expects to deal
with other network traffic too. It has another OS controled level of sw
that checks for packet order and may delay a packet if it thinks it is
out of order. It assumes data integrity is more important than latency.
Convenience is a factor as well because domain names can be used to set
the link up without any other user input. This again increases latency.
Latency can be tuned but it would take user action. Netjack, does this
very successfully, but to work well, other traffic needs to be minimized.

In order to use ethernet hardware in any standard fashion, level 2 is the
minimum the audio protocol can run at. This means the protocol needs to
know the MAC of the other end point. While it would be possible for the
user to enter this info, that would make the protocol hard to use and
should be avoided. Rather a method of discovering other audio end points
would be better. The setup of the system does not have to take place at
level 2 but could be higher.

Capabilties of different physical enet IFs would need to be addressed. The
protocol would need to know the speed of the slowest link if switches are
involved. Any new installation will be using 1000m or higher, but the
minimum would be a 100m link. A 10m link would not work because the
minimum packet size for eithernet is 84bytes (with guard space). The
number of audio frames would be limited to about 14k sample rate. By using
a separate protocol it would be possible to use a 10m link at a higher
latency. (4 audio frames at a time). I suppose that considering that alsa
(or jack anyway) seems to consider 16*2 words at a time anyway, the whole
protocol could work this way. Not so that 10m is supported so much as it
would allow direct tunneling of IP traffic without splitting up packets.

My thought is something like this:
We control all network traffic. Lets try for 4 words of audio. For sync
purposes, at each word boundry a short audio packet is sent of 10
channels. This would be close to minimum enet packet size. Then there
should be room for one full size enet packet, in fact even at 100m the
small sync packet could contain more than 10 channels (I have basically
said 10m would not be supported, but if no network traffic was supported
then 10m could do 3 or 4 channels with no word sync). So:
Word 1 - audio sync plus 10 tracks - one full network traffic packet
word 2 - audio sync plus 10 tracks - one full audio packet 40 tracks
                                         split between word 1 and 2
wors 3 - audio sync plus 10 tracks - one full audio packet 40 tracks
                                         split between word 2 and 3
word 4 - audio sync plus 10 tracks - one full audio packet 40 tracks
                                         split between word 3 and 4

This would allow network traffic at ~20m and 40 tracks with 4 word
latency. 1000m would allow much better network performance and more
channels. I don't know how this would effect CPU use. I haven't mentioned
MIDI or other control, but there is space time wise to add it to the audio
sync packet. As this is an open spec, I would probably use MIDI or OSC as
the control for levels and routing. I have yet to run the numbers, but the
ten tracks with sync is not maxed out, it may go as high as 15 or 16 while
still leaving one full network packet at each word for 4x the network
speed. The thing is this could be controlled on the fly. The user could
choose how many channels they wish to use. The ice1712 gave 12/10 i/o on
the d44/d66 as well as the d1010, but in this case that would not happen,
only the channels needed could show and the user could choose to make
physical inputs 7/8 look like alsa 1/2 very easily.

The driver could store more than one network traffic packet at a time and
if they are smaller than full 1500 byte size send two in the same window
if they are short enough.

In this whole exercise, I am trading through put to gain lower latency and
(more) predictable timing. Because of HW differences and the fact that the
actual hw is serviced outside our control, I don't think the IF could be
used as a sync source. As is the case now, two boxes would require
external sync to truely be in sync. Daisy chained boxes could be close
enough without external sync to not need resampling, but not close enough
to deal with two mics and the same audio.

powered from the cat5 cable should not be included IMO, but it may make
sense to do so anyway, so that if someone does do it anyway, it is
interchangable. :P

This in not meant to replace things like netjack, which encapsulates
audio/MIDI/transport all in one or remote content protocols. This is a
local solution meant generally for one room or hall. If other uses are
found that is a plus.

Does this make any sense? All calc was done for 24bit/48k audio. As with
ADAT, AES3 and MADI channels could be paired if 96k is required.... though
I think it is flexable enough on a 1000m (even 100m) line that higher
rates could be sent natively.

Enough rambling.

--
Len Ovens
www.ovenwerks.net
_______________________________________________
Linux-audio-user mailing list
Linux-audio-user@email-addr-hidden
http://lists.linuxaudio.org/listinfo/linux-audio-user

Received on Wed Sep 3 04:15:01 2014

This archive was generated by hypermail 2.1.8 : Wed Sep 03 2014 - 04:15:01 EEST