[linux-audio-dev] Lock-free LIFO used in MidiShare

New Message Reply About this list Date view Thread view Subject view Author view Other groups

Subject: [linux-audio-dev] Lock-free LIFO used in MidiShare
From: Yann Orlarey (orlarey_AT_grame.fr)
Date: Fri Jul 21 2000 - 17:06:15 EEST


Introduction
============

Lock-free techniques allow to synchronize concurrent access to share
objects without requiring mutual exclusion. Lock-free objects have
several advantages compared to critical sections guarded by locks.
If a thread is preempted while updating a lock-free object, it will
not prevent the other threads from operating on it. Lock-free techniques
avoid problems like priority inversion, convoying and deadlock. Moreover,
on simple data structures very efficient implementations can be made
which are an order of magnitude faster than lock-based implementations.

This document gives a short description of the lockfree LIFO stack (LFLF)
used in MidiShare on Linux (Intel version). These data structures are used
by MidiShare for :
- the memory manager, to maintain the free list of fixed size cells
  used for MidiShare events,
- to communicate with the event scheduler
- and to implement the clients FIFOs (this is an interim solution, we
  will probably use the algorithm of Michael & Scott 96).

1) Data structures used
========================

A LFLF is made of a list of cells linked together. A cell can be anything
provided it starts with a pointer available to link together the cells
of the stack.

/*****************************************************************
                           DATA STRUCTURES
 *****************************************************************/

typedef struct cell {
       struct cell* link; /* next cell in the lifo */
       /*...*/ /* any data here */
} cell;

typedef struct lifo {
        volatile cell* top; /* top of the stack */
        volatile unsigned long cnt; /* used to avoid ABA problem */
} lifo;

2) Trivial, non synchronized, implementation
============================================

We have to define 3 operations : init(), push() and pop(). The cnt
field is not used here.

void init(lifo* lf)
{
        lf->top = 0;
}

void push (lifo * lf, cell * cl)
{
        cl->link = lf->top;
        lf->top = cl;
}

cell* pop (lifo * lf)
{
        cell* v;

        v = lf->top;
        if (v) lf->top = v->link;

        return v;
}

3) Mutual exclusion implementation
===================================
We guard the push and pop critical sections with locks

pthread_mutex_t gSemaphore;

void push (lifo * lf, cell * cl)
{
        pthread_mutex_lock(&gSemaphore);

        cl->link = lf->top;
        lf->top = cl;

        pthread_mutex_unlock(&gSemaphore);
}

cell* pop (lifo * lf)
{
        cell* v;

        pthread_mutex_lock(&gSemaphore);

        v = lf->top;
        if (v) lf->top = v->link;

        pthread_mutex_unlock(&gSemaphore);

        return v;
}

4) CMPXCHG and CMPXCHG8B
========================

The lock-free implementation use the CMPXCHG and CMPXCHG8B instructions
on the intel version (and the LWARX and STWCX instructions on PowerPC version).

The semantic of CMPXCHG (compare and exchange) is to update a destination
with a new value provided the destination still contains the old value
that was use to compute the new value).

From Intel SDM volume 2, the exact operations performed by CMPXCHG for
32-bits values are :

IF (EAX = DEST) THEN
        ZF <- 1
        DEST <- SRC
ELSE
        ZF <- 0
        EAX <- DEST
END;

where :
        DEST is the destination to update
        SRC is the source of the new value for DEST
        EAX is the old value of DEST

On an SMP machine, the CMPXCHG should be prefixed with "lock;" to lock the
bus for performing atomic read-modify-write operations on system memory.

The CMPXCHG8B performs the same operations as CMPXCHG but on 64-bits values :

IF (EDX:EAX = DEST) THEN
        ZF <- 1
        DEST <- ECX:EBX
ELSE
        ZF <- 0
        EDX:EAX <- DEST

where :
        DEST is the 64-bits destination to update
        SRC is the source of the new 64-bits value for DEST
        EDX:EAX is the old 64-bits value of DEST

This intruction is used for the pop() operation to avoid the ABA problem
(see later).

5) LFLF implementation
======================

#ifdef __SMP__
#define LOCK "lock ; "
#else
#define LOCK ""
#endif

void init(lifo* lf)
{
        lf->top = 0;
        lf->cnt = 0;
}

void push (lifo * lf, cell * cl)
{
        __asm__ __volatile__ (
                "# LFPUSH \n\t"
                "1:\t"
                "movl %2, (%1) \n"
                LOCK "cmpxchg %1, %0 \n\t"
                "jnz 1b \n\t"
                :
                :"m" (*lf), "r" (cl), "a" (lf->top)
                );
}

cell* pop (lifo * lf)
{
        cell* v=0;
        __asm__ __volatile__ (
                "# LFPOP \n\t"
                "testl %%eax, %%eax \n\t"
                "jz 20f \n"
                "10:\t"
                "movl (%%eax), %%ebx \n\t"
                "movl %%edx, %%ecx \n\t"
                "incl %%ecx \n\t"
                LOCK "cmpxchg8b %1 \n\t"
                "jz 20f \n\t"
                "testl %%eax, %%eax \n\t"
                "jnz 10b \n"
                "20:\t"
                :"=a" (v)
                :"m" (*lf), "a" (lf->top), "d" (lf->cnt)
                :"ecx", "ebx" );
        return v;
}

The push() function is trivial. The link field of the new cell is set with
the current top of the stack. Then, using CMPXCHG, the top of the stack is
updated with the address of this new cell provided it still points to the
old cell. The process is repeated until it succeeds.

The pop() function is a little bit more complex to avoid the so called ABA
problem. Let represents the LIFO stack S containing the cells a, b, c, ...,
as S:[Top:a, a:b, b:c,...]. Pop(S) return 'a' and update S:[Top:b,
b:c,...]. A naive implementation would first read the address of the Top
cell 'a', then, using CMPXCHG, update Top with 'b' provided it still
contains 'a'. But unfortunately the fact that Top still points to 'a'
doesn't mean that 'a' still points to 'b'. The solution is to add a new
field that is incremented every time a pop operation is made and to use
CMPXCHG8B.

6) Performance test
====================

The performances of the lock-based and lock-free implementations are
compared by measuring the time required for 1 to 7 parallel threads to
perform 1 000 000 x 6 concurrent push and pop operations on a shared lifo
stack. The tests have been made on a Celeron 500MHz UP machine and a
Bi-Celeron 500MHz SMP machine with a 2.2.14 kernel.

The code executed by each thread is the following :

long stacktest (long n)
{
        cell* temp[6];
        long i;
        clock_t t0, t1;

        t0 = clock();

        while (n--) {
                for (i=0; i<6; i++) {
                        temp[i] = POP(&gstack);
                        if (temp[i] == 0) printf ("erreur\n");
                }
                for (i=0; i<6; i++) PUSH(&gstack, temp[i]);
        }

        t1 = clock();

        return t1-t0;
}

The integrity of the stack was checked after the threads had completed
their operations.

7) Results
==========
The results are expressed in microseconds and represent the time required
by one thread to perform 6000000 pop + 6000000 push on a shared stack
accessed concurently by 1 to 7 threads.

T1) Celeron 500 Mhz, uni-processor, lock-based implementation
-------------------------------------------------------------
# time required (us)
1 4210000
2 4715000
3 5410000
4 6167500
5 7198000
6 8285000
7 9301428

It is interesting to note that the cost increases with the number of
threads involved up to a factor 2 between 1 thread and 7 threads.

T2) Celeron 500 Mhz, BI-processor, lock-based implementation
-------------------------------------------------------------
# time required (us)
1 4220000
2 5765000
3 14176666
4 27390000
5 50278000
6 56215000
7 56367142

The SMP version scales very badly as the number of threads increases, up to
a factor of 13.

T3) Celeron 500 Mhz, uni-processor, lock-free implementation, no lock prefix
-------------------------------------------------------------
# time required (us)
1 390000
2 395000
3 393333
4 392500
5 392000
6 391666
7 388571

The lock-free implementation is 10 times faster compared to T1 and, as
expected, the cost is independent of the number threads.

T4) Celeron 500 Mhz, uni-processor, lock-free implementation, with lock prefix
-------------------------------------------------------------
# time required (us)
1 950000
2 960000
3 950000
4 945000
5 948000
6 950000
7 951428

The lock prefix is very expensive and should be avoided in UP since it is
not required. We tried to remove the lock prefix on the SMP machine. As
expected the result is quickly a corrupted lifo stack.

T5) Celeron 500 Mhz, BI-processor, lock-free implementation, with lock prefix
-------------------------------------------------------------
# time required (us)
1 950000
2 1755000
3 1986666
4 2720000
5 3010000
6 3706666
7 3921428

The degradation of the performances is important between 1 and 2 threads
because of the competition between threads. But then the system scales
quite well (ratio of 4) compared to T2. With 7 threads it performes 14
times better.

8) Some Papers
==============

Impossibility and universality results for wait-free synchronization;
Maurice P. Herlihy;
Proceedings of the Seventh Annual ACM Symposium on Principles of
distributed computing, 1988, Pages 276 - 290

A methodology for implementing highly concurrent data structures;
Maurice P. Herlihy;
Second ACM SIGPLAN symposium on Principles & practice of parallel
programming, 1990, Pages 197 - 206

Simple, fast, and practical non-blocking and blocking concurrent queue
algorithms;
Maged M. Michael and Michael L. Scott;
Proceedings of the fifteenth annual ACM symposium on Principles of
distributed computing, 1996, Pages 267 - 275

Real-time computing with lock-free shared objects;
James H. Anderson, Srikanth Ramamurthy and Kevin Jeffay;
ACM Trans. Comput. Syst. 15, 2 (May. 1997), Pages 134 - 165

-Yo-

---------------------------------------------------------
Yann Orlarey, (orlarey_AT_grame.fr)
GRAME, 9 rue du Garet, BP 1185, 69202 Lyon, France
http://www.grame.fr
telephone : +33 (0)4 72 07 37 00, fax : +33 (0)4 72 07 37 01
---------------------------------------------------------


New Message Reply About this list Date view Thread view Subject view Author view Other groups

This archive was generated by hypermail 2b28 : Fri Jul 21 2000 - 18:02:22 EEST