Subject: Re: [linux-audio-dev] hdr disk throughput
From: David Olofson (david_AT_gardena.net)
Date: Fri Mar 17 2000 - 21:09:17 EST
On Thu, 16 Mar 2000, Benno Senoner wrote:
> On Thu, 16 Mar 2000, Paul Barton-Davis wrote:
> >
> > Using Andrew Clausen's test case, it looks to me as though a regular,
> > moderately fragmented ext2 filesystem on an Ultra 2 disk with nominal
> > 5.2ms seek time can support a lower bound of 14MB/sec throughput from
> > the disk. That's enough to handle about 38 32 bit/48kHz tracks with
> > simultaneous read/write On the other hand, it can't support even 24
> > 32/96kHz tracks.
> >
> > Why 32 bits ? Because the data from most 24bit cards comes in 32 bit
> > chunks, and unless you want your CPU to burn cycles converting 3 byte
> > values into 4 bytes ones and back again, I suggest you just burn some
> > more disk space, and be glad that your HDR program is ready for 32 bit
> > data when it really shows up ;)
> >
>
> wasting 25% of the total bandwidth, by writing 32bits instead of 24, is quite
> some bandwidth, that means instead of 38 32bit tracks you would get
> 50 24bit tracks, that would be equivalent to your required 24 tracks at 96kHz.
>
> Ok 32 <---> 24 bit conversion burns power but it has to be determined how
> much power this burns, compared to the disk bandwidth saving.
BTW, there is MMX for that kind of jobs. The G4 has even better
instructions that would be great for this kind of byte shuffling.
(IIRC, there is one that builds a 128 bit word by picking bytes out
of a 128 bit source, ackording to a byte-by-byte description in an
argument. I don't know if the G4 executes it in a normal instruction
cycle, but that operation is quite simple to implement in hardware.)
> I am not sure if Intel CPUs allow writing 32bit values at unaligned data
> locations (not multiple of 4) but using this technique ,
> the only overhead you get while writing 32bit samples as 24bit values
> to disk is the slower unaligned mem access.
It does, (as does 68020+ - this got me pretty confused when hunting
a bug on my 68030 Amiga...), but you should use just about *any*
method before you do that, especially on Pentium and older CPUs. When
writing less than 64 bits to memory, the CPU has to read back the
memory words that will be affected, modify the right bytes and then
write it back. Write caching is virtually nonexistent on Pentium, and
limited on P-II and Celeron, so you basically hit the memory access
time directly here.
Preferably, use MMX and full 64 bit read/write all the time. And
don't mix 64 bit and 32 bti operations! MMX doesn't handle that
well...
To avoid MMX, use a few variables (CPU regs, actually), and shift the
data around. I'm not sure if the compiler will optimize this kind of
code well, but I can't see why it shouldn't, as long as there are
enough registers to handle the inner loop.
>
> let's take an example: assume we have several 32bit words
> A , B , C , etc , and that 32bit data is organized in 4 bytes
> (I use bigendian here, I am not sure if littleendian makes this impossible) ,
> A3 A2 A1 A0 is the A 32bit word , assume A3 A2 A1 ( 3bytes contain the
> relevant 24bit data).
> In this case we will just write the B 32bit word at offset 3 instead of offset
> 4, which will overwrite the A0 which doesn't carry relevant data.
You just have to write backwards on little endian CPUs. It's just
that accessing memory backwards is the worst thing you can do - most
cache controllers handle this poorly, if at all, so you'll probably
lose the read cache as well. Or; your bus will be set on fire with
the CPU sleeping...
> Or alternatively use: ( treat the input and output buffers as arrays of chars)
>
> outbuf[outpos] = inbuf[inpos];
> outbuf[outpos+1] = inbuf[inpos+1];
> outbuf[outpos+2] = inbuf[inpos+1];
> inpos += 4;
> outpos += 3;
Hmm... You could try compiling that (perhaps with 2 or 3 samples per
loop instead of one) to asm with full optimization, and see what the
compiler makes out of it. :-)
gcc does generate pretty nice code for things like accessing parts
of words and dwords, so you can usually get away nicely with just
*knowing* how to code things in asm, but then write them in C.
> (hoping that the cache of modern CPUs speeds up this kind of operations)
P-II/Celeron will, to some extent. Pentium won't help you much.
//David
.- M u C o S --------------------------------. .- David Olofson ------.
| A Free/Open Multimedia | | Audio Hacker |
| Plugin and Integration Standard | | Linux Advocate |
`------------> http://www.linuxdj.com/mucos -' | Open Source Advocate |
.- A u d i a l i t y ------------------------. | Singer |
| Rock Solid Low Latency Signal Processing | | Songwriter |
`---> http://www.angelfire.com/or/audiality -' `-> david_AT_linuxdj.com -'
This archive was generated by hypermail 2b28 : Sat Mar 18 2000 - 07:23:37 EST