\chapter{The L4 Microkernel}


\section{Why L4?}

There are other good reasons for finding an alternative
microkernel to Mach.

\begin{enumerate}
\item It's very big

The precompiled kernel distributed with the 0.2 release of
the GNU system is 1162k.  Admittedly, this contains a large
number of device drivers and it is possible to build a
kernel specific to a particular machine which will contain
only the appropriate device drivers and so will be smaller. 
However, the Linux (version 2.0.33) kernel I have on my PC
is only 1055k, and that includes most of the functionality
which must then be deployed by additional servers on top of
Mach.

\item It is too slow

Recent measurements \cite{muperf} show the performance of
Mach to be dramatically less than previously reported.  In
particular, comparing IPC performance between monolithic and
Mach-based systems shows closer to 50\% performance rather
than the 90\% often claimed.

\item It is too complicated

Mach contains in excess of 200 system calls.  The semantics
of each call are complex and several of them would be better
implemented without (direct) kernel intervention.  For
example, many of the vm\_* interfaces could be directly
implemented as an IPC from the task to its memory manager. 
The OSF Mach Kernel Interfaces document is in excess of 450
pages.
\end{enumerate}

The L4 microkernel seems to suffer from none of these
faults.  L4 running on the x86 family is under 32k and
outperforms Mach by a significant factor.  It contains just
7 system calls and the reference manual is only 50 pages. 
It does not contain a default memory pager as Mach does, but
I do not see this as a disadvantage since different operating
systems have such different requirements for a pager that
they normally provide their own in any case.

L4 is available for the Intel 486 (and compatible CPUs)
and MIPS processors.  A version for the DEC Alpha is in
development.  Some interest has been expressed in a version
for the ARM.

\section{Features of L4}

The primary purpose of microkernels that are designed as
bases for multi-server style operating systems is efficient
and secure message-passing.

Mach and L4 have two significant differences between their
IPC methods. First, Mach uses asynchronous message passing,
which means that the kernel must buffer data (potentially
large quantities of it).  L4 uses synchronous message
passing which involves much less work for the kernel.

Secondly, Mach has a centralised structure for security,
where the kernel enforces the `send rights' through a
mechanism known as ports.  L4 distributes security to
external tasks through a mechanism known as clans.  This is
first discussed in \cite{clans}.

Tasks are organised into Clans with Chiefs.  Within a Clan,
the only protection is that imposed by the individual task
based on the sender's ID (which is enforced by the microkernel)
but between Clans, each Clan boundary that the message crosses
incurs inspection and possible rejection by the Clan Chief.
Clans are nestable, so a hierarchy of protection can be built.
An example of message transmission is shown in Figure~\ref{clan}.
Here, the rectangles represent Clans and the circles represent
tasks.  The thick arrow represents the message that is sent and
the thin arrows represent the messages which are actually passed.

\begin{figure}
\includegraphics*[0mm,0mm][10cm,6cm]{clan.ps}
\caption{Message transmission between tasks in different Clans.}
\label{clan}
\end{figure}

In a centralised system, the kernel is responsible for
administering port rights which adds significant overhead to
IPC calls.  This conflicts directly with the requirement
that IPC be fast.  Additionally, it is philosophically
superior since the point of a microkernel is to remove as
many features as possible from kernel space.  It does not
harm speed when communication is intra-clan and simply
multiplies the time taken by the number of clans traversed
when communication is inter-clan.  It should also scale
better than an in-kernel regulated protection scheme since
the protection mechanisms may be chosen on an arbitrary
basis and changed arbitrarily frequently without requiring
communication with the kernel.  Transparent multiple node
communication may be achieved using the clan mechanism since
the task sees no difference between communicating with a
task on a different machine and a task on the same machine
in a different clan.  In either case, the message is
intercepted and potentially modified by the clan's chief.


\section{Memory management in L4}

Mach has an intricate memory management system which allows
the task to communicate with its pager in great detail.  L4
has no such interface.  It has an extremely simple handler
for the physical memory called $\sigma_0$. This provides no
additional paging facility.  It is intended to grant all of
the available physical pages to a more sophisticated
higher-level pager which is referred to as $\sigma_1$.

I do not see the advantage in placing $\sigma_0$ outside the
kernel.  It requires that the kernel pass considerable
information about the physical state of the machine to
$\sigma_0$ (though this is achieved in an efficient manner). 
It is necessary to define an IPC protocol to access $\sigma_0$
as part of the kernel definition as otherwise the task of
writing $\sigma_1$ would be impossible.  The $\sigma_0$ protocol
definition notes that `Special $\sigma_0$ implementations may
extend this protocol' which is unwise in my opinion since it
could lead to incompatible implementations.

Conceptually then, $\sigma_0$ may be considered to be part of
the kernel. The only advantage to having $\sigma_0$ separate
to the kernel is that it allows for separate compilation of
$\sigma_0$ which may be convenient in certain situations.  The
design of $\sigma_0$ is such that it will not be required
after the initial OS bootstrap, except to refuse requests
for any further memory allocation.  It might lead to a more
efficiemt implementation to put $\sigma_0$ inside the kernel
and have a system for removing initialisation code from the
kernel as recent Linux kernels do.


\section{Device drivers in L4/ARM}

If L4 is ported to the ARM then device drivers present an
interesting problem.  On the ARM, there are only two types of
interrupt, normal and fast.  It is necessary to interrogate
the I/O controller to determine which device caused the
interrupt.  This is not a problem as such, it is reasonable
for the kernel to de-multiplex the interrupt and expose an
interface to the drivers that masks this, but the real
problem is that all the expansion card interrupts are
multiplexed onto one of the I/O controllers lines.  When that
interrupt is triggered, each expansion card must be
interrogated in turn to see if it caused the interrupt.  There
is a standard way for expansion cards to tell the kernel how
to find out if their interrupt has been triggered, but not
all cards support this method.  For details, see
\cite[page 4-126]{prm}

This is soluble in Mach --- since the device drivers are
in-kernel, interrupts can be passed around the built-in
drivers until one claims it. However in L4, this is somewhat
more difficult.  Since the device drivers are outside the
kernel, it is not possible for the kernel to tell \emph{a priori}
which expansion card has caused the interrupt.  Unfortunately,
L4 allows only one thread to be the recipient of any given
interrupt.

In my opinion, L4 should be modified to allow an interrupt
to be shared --- ie the interrupt should be delivered to all
of the threads which have requested it.  It is then
necessary to have a further protocol which permits the
thread to tell the kernel whether or not it has dealt with
the interrupt or wishes it to be passed on to other
claimants.

However, there is a security problem with this.  A thread
needs no particular right to associate with an interrupt. 
Since it is already determined that security shall lie
outside the kernel, it makes no sense to make an exception
to this rule for device drivers.  Any solution ought to be
formulated in terms of clans and chiefs.  Unfortunately,
the kernel is an exception to the clans mechanism.  Messages
that are sent directly to the kernel bypass all chiefs.  I
consider this to be a flaw in the implementation of L4.

Another potential solution to this problem is for L4 to treat
subsequent claimants of the interrupt specially, and pass
them to the first claimant for checking, as if it were the
chief for this particular thread.  However, this idea is
also flawed since the protection it provides can be
circumvented by the following sequence:

\begin{enumerate}
\item A second (malicious) thread claims the vector and is
approved.
\item The first thread is killed in order to be replaced by
an improved version.  The second thread then becomes the
primary thread.
\item The second thread may now deny service to the
replacement for the first thread.
\end{enumerate}

The only viable solution to this problem in terms of the
current operation of L4 is to have a task external to the kernel
which device drivers register themselves with.


\section{HURD on L4}

The HURD distribution contains a large number of libraries.
This was a design decision taken early on, since it was thought
likely that as a large number of similar services would be
desired, abstracting as much as possible into libraries was a
good idea.  One of the libraries in the HURD distribution is
\hurd{LibMOM} (Microkernel Object Module) which provides an
abstraction layer between HURD processes and the underlying
microkernel.  The intention of this library is that to port
HURD from one microkernel to another it should only be
necessary to rewrite \hurd{LibMOM}.

However, careful examination of the sources show that none of the
components of HURD currently use \hurd{LibMOM}.  So the first step
in porting HURD to any other microkernel must be to alter the
various servers to use the \hurd{LibMOM} indirection layer.


\subsection{Memory Management}

The \func{vm\_allocate} Mach system call is replaced by the
\hurd{LibMOM} functions \func{mom\_allocate\_memory} and
\func{mom\_allocate\_address}.  However, these functions only
allow for allocating memory in the current task, whereas
Mach's \func{vm\_allocate} allows tasks to allocate memory
into the address space of another task.  Unfortunately, HURD
does use this feature of Mach and there is no defined
\hurd{LibMOM} function to transfer memory from one task to
another.  It is not used frequently, of the 98 calls to
\func{vm\_allocate}, only 15 do not refer to the invoking
task.  For example, the \hurd{exec} server allocates memory
to the task that it is starting using \func{vm\_allocate}.

If a microkernel has external memory managers, then it must be
possible for one task to give memory to another task.  However,
the precise procedure for this is likely to vary from kernel
to kernel, so I would propose that a new call is required for
\hurd{LibMOM}.

Some of the current \hurd{LibMOM} calls are actually common
combinations of other calls.  Whether a combined call is required
that allocates memory to a different task is a question that
could only be answered by profiling a system that did not have
it and comparing it to one that does.


\subsection{Interprocess Communication}

The other main microkernel provided service is interprocess
communication, normally abbreviated to IPC.  Much of the IPC
in Mach-based operating systems is already abstracted away
from the raw Mach\_Msg interface by MIG, the Mach Interface
Generator.  It is similar in action to Sun's rpcgen program
in that it takes a high level representation of services
provided into client and server stubs which can be linked
against by ordinary programs.  The operating systems group
at Utah have written Flick \cite{flick} which is intended to
provide a replacement for many different generators of this
sort, including MIG and rpcgen.

I don't think it is worth investigating porting MIG to
generate L4 calls, since Flick would provide a much better
basis for emulating Mach-style IPC.  Flick generates code
that is `between 2 and 17 times faster' \cite{flick} than
other generators.  Flick already supports interface
descriptions written in CORBA, ONC RPC and MIG, and will
generate stubs for IIOP, ONC/TCP, Mach ports or Fluke
IPC.  The authors claim that it is extremely flexible and
extensible so it should not be hard to provide a back end
that generates L4 calls.


\subsection{Emulating Mach}

The alternative approach taken in a project described in
\cite{unixl3} is to provide an emulation of LibMach which
provides a veneer over the Mach kernel.  The conclusion of
that report is that providing a Mach emulation on top of
another microkernel is unnecessarily complicated and it is
probable that altering the overlying operating system to
work with L4 directly would be significantly faster.

This does not necessarily mean that a common microkernel
abstraction layer such as that which \hurd{LibMOM} attempts
to provide is going to be inefficient.  Much of the overhead
associated with the LibMach approach was consumed in
emulating the exact semantics of Mach.  This would not
apply to \hurd{LibMOM} since it implements very simple
primitives.


\section{Experimental Evaluation}

I was not able to perform any experimemts of my own, but the
results for running Linux on the L4 kernel mentioned in
\cite{muperf} look promising for the performance problem
associated with Mach.  Additionally, they state that it took
`14 engineer-months' to port the monolithic Linux kernel to
run on L4.

Unfortunately, I am not able to persuade GNUMach to run on my
computer, due to a bug in the device driver for my Adaptec
SCSI card.  I was therefore not able to test my modifications.
I hope to do so at some stage in the future when a replacement
driver appears.

There are many shortcomings in the \hurd{LibMOM} API compared
to the Mach API.  \hurd{LibMOM} evidently requires more work
before HURD can be fully abstracted fom the Mach microkernel.
In order to achieve its goal of abstraction from any particular
microkernel, it must abstract all the services which HURD
requires.  It does not currently even attempt to deal with
handling threads, clocks and most importantly, it has no interface
which deals with access control or other security mechanism.

It could be argued that this type of interface should not be
added to \hurd{LibMOM}; instead in order to port HURD to a new
microkernel, \hurd{libthreads} and \hurd{libports} should be
ported.  I would disagree with this because it would then
leave many libraries which had to be rewritten instead of
one, which would make the task of porting HURD less clear.
