Who are we?
Linux Kernel Programming
Lehr- und Forschungsgebiet Betriebssysteme
Operating Systems Research and Teaching Unit
Faculty
Administration
Researchers
Research assistants - Tutors (HiWi)
Our offices are in the UMIC building, 2nd floor
Website: https://os.rwth-aachen.de
Past/current theses examples
As the name of the group suggests, operating systems!
In short: design, implementation and optimisation of the lower software stack, between the hardware and users.
The main goals are:
In a nutshell, our topics revolve around:
“Classical” operating systems
Virtualisation
Emerging hardware
Binary translation
Lecturer:
Prof. Redha Gouicem
Teaching assistant:
Contact emails
In this course, you will learn how to program in the Linux kernel.
This is a very practical course, where you will mostly write code.
Lectures
Time: Tuesdays @ 16:30 - 18:00
Location: Lecture hall AH VI
Lecturer: Me
Content:
Labs
Time: Mondays & Thursdays @ 10:30 - 12:00
Location: UMIC 025
Teaching assistant: Jérôme Coquisart, M.Sc.
Content:
Important information
Unfortunately, we cannot provide hardware, so you need to come with your laptop.
Hardware that runs Linux is best, but Windows with WSL should also work.
Apple devices with ARM-based processors (M1/M2) should also work, but not easily…
Lectures
Labs
Info
There might a couple more lectures and labs added this year.
Written exam (45% of the final grade)
Project (45% of the final grade)
Weekly labs (10% of the final grade)
Contact e-mail
If you want to contact us, please use the following e-mail address: lkp@os.rwth-aachen.de
If you contact us directly, you might wait longer or get no answer.
Matrix server
We will set up a Matrix chat room with all students (and us).
If you already have an account on another server, you can use it.
Otherwise, you will be allowed to create one on ours.
Lectures and Labs
Lecture slides will be uploaded just before the lecture here: https://teaching.os.rwth-aachen.de/LKP/lecture
Labs will be available here: https://teaching.os.rwth-aachen.de/LKP
During lectures, you can ask questions directly by raising your hand, or through an online Q&A tool:
Link: Claper room
Books
Online material
In this course, we will use the following definitions:
Definition
The operating system is the set of software components that enables applications to use the underlying hardware and provides APIs to ease development.
Definition
The kernel is the set of components of the operating system that are executed in a privileged mode, usually in supervisor mode.
Kernels are usually classified in various types:
Let’s have a quick recap of these kernel architectures!
A monolithic kernel embeds all the system functionalities in a single binary. It contains all the core features of an operating system (scheduling, memory management, etc…) as well as drivers for devices or less essential components.
Characteristics
Examples
Why ‘monolithic’?
Monolithic means it is built as a single binary and runs in the same address space. The source code can still be organised in a modular way (e.g., using libraries).
Modularity
Some monolithic kernels allow dynamic code loading as modules, e.g., for drivers. These are usually called modular monolithic kernels.
A microkernel contains only the minimal set of features needed in kernel space:
address-space management, basic scheduling and basic inter-process communication.
All other services are pushed in user space as servers:
file systems, device drivers, high level interfaces, etc.
Characteristics
Examples
The hybrid kernel architecture sits between monolithic kernels and microkernels.
It is a monolithic kernel where some components have been moved out of kernel space as servers running in user space.
While the structure is similar microkernels, i.e., using user servers, hybrid kernels do not provide the same safety guarantees as most components still run in the kernel.
Controversial architecture
This architecture’s existence is controversial, as some just define it as a stripped down monolithic kernel.
Examples
A unikernel, or library operating system, embeds all the software in supervisor mode.
The kernel as well as all user applications run in the same privileged mode.
It is used to build single application operating systems, embedding only the necessary set of applications in a minimal image.
Characteristics
Examples
The choice of architecture has various impacts on performance, safety and interfaces:
In this course, we will focus on a monolithic modular kernel: Linux.
In the 1960s, MIT, AT&T Bell Labs and General Electric built Multics (Multiplexed Information and Computing Service).
Multics is a time-sharing operating system for mainframes that introduced new concepts:
In 1970, AT&T Bell Labs left the project and started Unix, led by Ken Thompson, with Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna.
Unix kept the hierarchical file system but dropped the single-level store, going for an “everything is a file” philosophy.
Unix was originally a single-tasking OS.
Why ‘Unix’?
The name Unix is a pun on Multics/Unics. Kernighan came up with the name, but states that “no one can remember” who came up with the spelling.
Source: https://en.wikipedia.org/wiki/History_of_Unix
From: Linus Benedict Torvalds
To: comp.os.minix
Subject: What would you like to see most in minix?
Date: 25 August 1991, 22:57:08
Hello everybody out there using minix -
I'm doing a (free) operating system (just a hobby, won't be
big and professional like gnu) for 386(486) AT clones. This
has been brewing since april, and is starting to get ready.
I'd like any feedback on things people like/dislike in minix,
as my OS resembles it somewhat (same physical layout of the
file-system (due to practical reasons) among other things).
I've currently ported bash(1.08) and gcc(1.40), and things
seem to work. This implies that I'll get something practical
within a few months, and I'd like to know what features most
people would want. Any suggestions are welcome, but I won't
promise I'll implement them :-)
Linus (torv...@kruuna.helsinki.fi)
PS. Yes - it's free of any minix code, and it has a
multi-threaded fs. It is NOT protable (uses 386 task switching
etc), and it probably never will support anything other than
AT-harddisks, as that's all I have :-(.
From: Andrew S. Tanenbaum
To: comp.os.minix
Subject: What would you like to see most in minix?
Date: 30 January 1992, 09:04
/* blablabla */
I still maintain the point that designing a monolithic kernel
in 1991 is a fundamental error. Be thankful you are not my
student. You would not get a high grade for such a design :-)
/* blablabla */
Prof. Andrew S. Tanenbaum (a...@cs.vu.nl)
Year | Version | Features |
---|---|---|
1994 | 1.0 | stable kernel with basic UNIX functionalities |
1995 | 1.2–1.3 | round-robin scheduler, loadable modules, /dev/random |
1996 | 2.0 | PowerPC support, multicore, improved networking, Tux |
1999 | 2.2 | frame buffer, NTFS, FAT32, IPv6, USB, SLAB allocator |
2001 | 2.4 | new file systems (ext3, XFS, tmpfs), netfilter |
2003 | 2.6 | preemptible kernel, O(1) scheduler, ALSA |
2004 | 2.6.4–2.6.10 | EFI support, x86-64, ARMv6, CFQ IO scheduler |
2005 | 2.6.14 | FUSE support |
2007 | 2.6.20–2.6.23 | KVM, tickless kernel, SLUB allocator, CFS scheduler |
2008 | 2.6.24–2.6.28 | cgroups, ext4 |
2011 | 2.6.39 | removal of the Big Kernel Lock (BKL) |
2014 | 3.14–3.18 | OverlayFS, eBPF, kernel address space layout randomization (KASLR) |
2015 | 4.0 | live patching |
2018 | 4.15 | kernel page table isolation (security mitigations) |
2019 | 5.1 | io_uring |
2020 | 5.6 | wireguard |
2022 | 6.1 | multi-gen LRU eviction algorithm, initial Rust support |
2023 | 6.6 | new EEVDF scheduler |
2024 | 6.12 | PREEMPT_RT, sched_ext |
Linux offers six main functions:
through five abstraction layers:
User space interfaces
System calls, procfs, sysfs, device files, …
Virtual subsystems
Virtual memory, virtual filesystem, network protocols, …
Functional subsystems
Filesystems, memory allocators, scheduler, …
Devices control
Interrupts, generic drivers, block devices, …
Hardware interfaces
Device drivers, architecture-specific code, …
1. Tools and environment
2. Core components
3. Specific subsystems
4. Drivers and architecture-specific code
arch/
block/
COPYING
CREDITS
crypto/
Documentation/
drivers/
fs/
include/
init/
ipc/
Kbuild
Kconfig
kernel/
lib/
MAINTAINERS
Makefile
mm/
net/
README
REPORTING-BUGS
samples/
scripts/
security/
sound/
tools/
usr/
virt/
1. Tools and environment
2. Core components
3. Specific subsystems
4. Drivers and architecture-specific code
arch/
block/
COPYING
CREDITS
crypto/
Documentation/
drivers/
fs/
include/
init/
ipc/
Kbuild
Kconfig
kernel/
lib/
MAINTAINERS
Makefile
mm/
net/
README
REPORTING-BUGS
samples/
scripts/
security/
sound/
tools/
usr/
virt/
1. Tools and environment
2. Core components
3. Specific subsystems
4. Drivers and architecture-specific code
arch/
block/
COPYING
CREDITS
crypto/
Documentation/
drivers/
fs/
include/
init/
ipc/
Kbuild
Kconfig
kernel/
lib/
MAINTAINERS
Makefile
mm/
net/
README
REPORTING-BUGS
samples/
scripts/
security/
sound/
tools/
usr/
virt/
1. Tools and environment
2. Core components
3. Specific subsystems
4. Drivers and architecture-specific code
arch/
block/
COPYING
CREDITS
crypto/
Documentation/
drivers/
fs/
include/
init/
ipc/
Kbuild
Kconfig
kernel/
lib/
MAINTAINERS
Makefile
mm/
net/
README
REPORTING-BUGS
samples/
scripts/
security/
sound/
tools/
usr/
virt/
1. Tools and environment
2. Core components
3. Specific subsystems
4. Drivers and architecture-specific code
arch/
block/
COPYING
CREDITS
crypto/
Documentation/
drivers/
fs/
include/
init/
ipc/
Kbuild
Kconfig
kernel/
lib/
MAINTAINERS
Makefile
mm/
net/
README
REPORTING-BUGS
samples/
scripts/
security/
sound/
tools/
usr/
virt/
Tools and environment:
Documentation/
scripts/
usr/
tools/
samples/
x
text documentation, in addition to comments
scripts used for configuration, formatting, etc…
utilities to generate the Linux image
user space tools to interact with the kernel
code samples (a good place to start)
Core components:
init/
x
kernel/
lib/
include/
x
kernel start up code (including main.c
)
main kernel components code
libc
used to build the kernel
headers
x
Specific subsystems
block/
crypto/
fs/
ipc/
mm/
net/
security/
sound/
virt/
x
x
drivers for block devices
cryptographic algorithms, hashes, …
file systems
inter-process communication
memory management
network support
kernel security mechanisms
sound drivers, audio support
virtualisation support (kvm
)
x
Drivers
arch/
x
drivers/
x
x
architecture-specific code for each processor family
drivers for various hardware
The inline keyword allows the compiler to replace a function call by the body of the called function.
Pros
Cons
Inlined function definition
gcc
(and most compilers) allow programmers to hint at a branch prediction with the likely() and unlikely() annotations.
These annotations are not POSIX-compliant, but supported by gcc
and clang
(at least).
The asmlinkage annotation tells the compiler to always place the arguments of a function on the stack.
Without it, gcc
may try to optimise function calls by placing arguments in registers instead.
Using asmlinkage prevents this optimisation, simplifying calling this function from assembly code.
It is mainly used in system calls in order to enforce the calling convention.
In practice, asmlinkage is a macro defined in asm/linkage.h
:
A union is a special type that allows storing different types of data at the same memory location.
Each member of a union is a typed alias of the same memory location.
The allocated size is equal to the size of the largest member of the union.
Examples in the kernel
union thread_union {
struct thread_info thread_info;
unsigned long stack[THREAD_SIZE/sizeof(long)];
};
The struct page
is one of the worst union example \(\rightarrow\)
struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
/*
* Five words (20/40 bytes) are available in this union.
* WARNING: bit 0 of the first word is used for PageTail(). That
* means the other users of this union MUST NOT use the bit to
* avoid collision and false-positive PageTail().
*/
union {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
* lruvec->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
union {
struct list_head lru;
/* Or, for the Unevictable "LRU list" slot */
struct {
/* Always even, to negate PageTail */
void *__filler;
/* Count page's or folio's mlocks */
unsigned int mlock_count;
};
/* Or, free page */
struct list_head buddy_list;
struct list_head pcp_list;
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
union {
pgoff_t index; /* Our offset within mapping. */
unsigned long share; /* share count for fsdax */
};
/**
* @private: Mapping-private opaque data.
* Usually used for buffer_heads if PagePrivate.
* Used for swp_entry_t if PageSwapCache.
* Indicates order in the buddy system if PageBuddy.
*/
unsigned long private;
};
struct { /* page_pool used by netstack */
/**
* @pp_magic: magic value to avoid recycling non
* page_pool allocated pages.
*/
unsigned long pp_magic;
struct page_pool *pp;
unsigned long _pp_mapping_pad;
unsigned long dma_addr;
atomic_long_t pp_ref_count;
};
struct { /* Tail pages of compound page */
unsigned long compound_head; /* Bit zero is set */
};
struct { /* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
void *zone_device_data;
/*
* ZONE_DEVICE private pages are counted as being
* mapped so the next 3 words hold the mapping, index,
* and private fields from the source anonymous or
* page cache page while the page is migrated to device
* private memory.
* ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
* use the mapping, index, and private fields when
* pmem backed DAX files are mapped.
*/
};
/** @rcu_head: You can use this to free a page by RCU. */
struct rcu_head rcu_head;
};
union { /* This union is 4 bytes in size. */
/*
* For head pages of typed folios, the value stored here
* allows for determining what this page is used for. The
* tail pages of typed folios will not store a type
* (page_type == _mapcount == -1).
*
* See page-flags.h for a list of page types which are currently
* stored here.
*
* Owners of typed folios may reuse the lower 16 bit of the
* head page page_type field after setting the page type,
* but must reset these 16 bit to -1 before clearing the
* page type.
*/
unsigned int page_type;
/*
* For pages that are part of non-typed folios for which mappings
* are tracked via the RMAP, encodes the number of times this page
* is directly referenced by a page table.
*
* Note that the mapcount is always initialized to -1, so that
* transitions both from it and to it can be tracked, using
* atomic_inc_and_test() and atomic_add_negative(-1).
*/
atomic_t _mapcount;
};
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;
#ifdef CONFIG_MEMCG
unsigned long memcg_data;
#elif defined(CONFIG_SLAB_OBJ_EXT)
unsigned long _unused_slab_obj_exts;
#endif
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
#endif
#ifdef CONFIG_KMSAN
/*
* KMSAN metadata for this page:
* - shadow page: every bit indicates whether the corresponding
* bit of the original page is initialized (0) or not (1);
* - origin page: every 4 bytes contain an id of the stack trace
* where the uninitialized value was created.
*/
struct page *kmsan_shadow;
struct page *kmsan_origin;
#endif
} _struct_page_alignment;
A structure is a collection of one or more variables.
struct version {
unsigned short major; // usually 2 bytes
unsigned long minor; // usually 8 bytes
char flags; // 1 byte
};
Memory alignment
A memory access is aligned if the accessed address is a multiple of the size of the access.
Example: an access to an unsigned long
is aligned if the address is a multiple of 8 bytes.
Ordering and Padding
The only guarantee in the C language is the order of the members! The compiler can add padding between members for performance reasons.
In C, an array must have a size. It is common to use a struct
to keep it close to the array:
This has several drawbacks:
Allocation is done in two steps (allocate the struct
, then allocate the array)
One way to overcome this is called tail-padded structures: placing an undefined size array as the last member of a structure.
struct buf {
size_t length;
char buffer[];
};
struct buf *alloc_buffer(size_t length)
{
struct buf *b = malloc(sizeof(struct buf) + length);
b->length = length;
return b;
}
Allocation, free, and copy can be done in one go.
Multiple implementations are possible:
int buffer[]
: in the C99 standard (flexible array member), preferred formint buffer[1]
: non-standard, but supported by compilersint buffer[0]
: non-standard, but supported by compilersIf you run this code, you get this error:
foo.c: In function ‘main’:
foo.c:6:12: error: assignment to expression with array type
6 | ja = yes;
| ^
yes
is a pointer to a char
(here, the first character of the string "da"
).
ja
is an array identifier, a symbolic constant.
void main(void)
{
char *yes = "da";
char ja[3];
printf("yes: %p - %p\n", yes, &yes);
printf("ja: %p - %p\n", ja, &ja);
}
A symbolic constants’s address doesn’t really make sense, so the compiler gives it the value of the constant (hence ja == &ja
).
Declaration
A function pointer is declared with the following syntax:
Calling a function pointer
void say_hello(char *name)
{
printf("Hello %s\n", name;
}
int main(void)
{
void (*func_ptr)(char *); // declaration
func_ptr = say_hello; // assignment
(*func_ptr)("zero"); // call
return 0;
}
As a function argument
Function pointers are frequently used in the kernel to set up callbacks.
void free_elem(struct elem *e)
{
free(e);
}
void put_elem(struct elem *e, void (*release(struct elem *)))
{
e->refcount--;
if (!e->refcount)
release(e);
}
int main(void)
{
struct elem *e = malloc(sizeof(struct elem));
put_elem(e, free_elem);
}
Macros for constants
Warning!
What happens with this code?
sqr(a + 1)
is expanded as 3 + 1 * 3 + 1
, which returns 7
.
sqr(a + 1)
is expanded as ((3 + 1) * (3 + 1))
, which returns 16
.
Always put parenthesis around the arguments, as the macro’s expansion might provoke unwanted behaviour!
Macros as code blocks
#define kthread_init_delayed_work(dwork, fn) \
do { \
kthread_init_work(&(dwork)->work, (fn)); \
timer_setup(&(dwork)->timer, \
kthread_delayed_work_timer_fn, \
TIMER_IRQSAFE); \
} while (0)
The do ... while(0)
construct allows the use of this macro:
;
afterkthread_init_delayed_work(dwork, fn);
if
or while
, this will evaluate to the result of the last instruction of the loopif (kthread_init_delayed_work(dwork, fn)) foo(x);
bar(kthread_init_delayed_work(dwork, fn));
Kernel stack is small compared to user stack!
Stack size is statically defined at kernel compile time, cannot grow dynamically.
Usually fits on a few pages:
What to avoid?
Avoid floating point operations at all cost!
Why?
Extremely costly!
Not very useful!
gcc
Making changes to your kernel can render it unstable and lead to a kernel panic, i.e., a full system crash.
Keep a backup kernel
Never replace your running kernel with a new one!
Always keep a fully working backup kernel installed in your bootloader!
Work in modules
Always implement your changes as modules if possible.
Important
Keep in mind that a bug can corrupt persistent data, e.g. on your hard drive. You could lose data for good if you work directly on your system!
Tip
Working in a virtual machine alleviates most of these issues!
In the kernel, you won’t have access to the usual libraries like the libc
.
Thankfully, the kernel provides its own internal “library” with basic functionalities.
They are described in Documentation/core-api/index.rst
.
Let’s make a quick tour of some of these functionalities!
To ensure portability across architectures, the kernel offers generic types defined in include/linux/types.h
Functions in the kernel follow the same convention as system calls by returning an integer:
If the function returns a pointer:
Return NULL
if there is only one reason to fail
Return the error code encoded with the ERR_PTR()
macro.
The calling function can check if there was an error with IS_ERR()
and get the error code with PTR_ERR()
int do_shash(unsigned char *name, unsigned char *result, const u8 *data1, unsigned int data1_len,
const u8 *data2, unsigned int data2_len, const u8 *key, unsigned int key_len)
{
int rc;
unsigned int size;
struct crypto_shash *hash;
struct sdesc *sdesc;
hash = crypto_alloc_shash(name, 0, 0);
if (IS_ERR(hash)) {
rc = PTR_ERR(hash);
pr_err("%s: Crypto %s allocation error %d\n", __func__, name, rc);
return rc;
}
/* ... */
If you need to print information to be available from user space, e.g., tracing or debugging, you can use the printk()
function.
It works similarly to printf()
, with a couple of differences:
include/linux/kern_levels.h
, from KERN_EMERG
to KERN_DEBUG
.There are also predefined macros for each level:
Tip
Formats are available at Documentation/printk-formats.txt
.
Memory allocation is done with the kmalloc()
function, similar to malloc()
.
Some specific characteristics:
Fast (except if blocked waiting for pages)
Allocated memory is not initialised
Allocated memory is contiguous in physical memory
Memory is allocated by areas of \(2^n - k\) bytes (\(k\): a few metadata bytes).
Do not allocate 1024 B if you need 1000 B, you will end up with 2048 B!
Example:
kmalloc GFP flags
The second parameter of kmalloc()
is a Get Free Pages (GFP) flag:
More combinations available in include/linux/gfp_types.h
.
If you need large chunks of memory, you should not use kmalloc()
, and request pages directly with one of these functions:
returns a pointer to a free page after filling it with zeros
returns a pointer to a free page
returns a pointer to a memory area with \(2^{order}\) contiguous pages
Virtual allocation
If you don’t need the memory to be contiguous, you can allocate in the virtual address space instead of physical:
If you need to wait for a resource (e.g., network packet, message), the interface should implement a wait queue to allow your thread to sleep and be woken up when the resource is available.
thread sleeps and will be woken up if wake_up()
is called on the wait queue and the condition is true
same as wait_event()
, but the thread can also be woken up by a signal
The resource handler calls wake_up()
on the queue to wake up waiting threads.
Workqueues allow you to execute code asynchronously.
At creation time, a pool of thread is initialised.
Jobs can then be submitted in the form of a function pointer and a pointer to an argument.
A thread from the workqueue will, asynchronously, check the queue, pop a job and execute it.
Tip
Documentation available in Documentation/core-api/workqueue.rst
.
The kernel also offers generic data structures to work with:
Important
Generic data structures in C are not obvious to build…
Instead of having objects in a list, we have the list in the objects!
The “naive” version:
Not generic! You need one list type of list per object type.
When you iterate over the list, how do you get the containing object?
From the address of any member in a structure, how can we get the address of the structure?
e.g., from the address of a list_head
element in a structure
Linux implements the container_of macro!
/**
* container_of - cast a member of a structure out to the containing structure
* @ptr: the pointer to the member.
* @type: the type of the container struct this is embedded in.
* @member: the name of the member within the struct.
*
* WARNING: any const qualifier of @ptr is lost.
*/
#define container_of(ptr, type, member) ({ \
void *__mptr = (void *)(ptr); \
static_assert(__same_type(*(ptr), ((type *)0)->member) || \
__same_type(*(ptr), void), \
"pointer type mismatch in container_of()"); \
((type *)(__mptr - offsetof(type, member))); })
After expanding all macros, this looks like this:
#define offset_of(type, member) \
(&((type *)0)->member)
#define container_of(ptr, type, member) \
((type *)(((void *)ptr - offset_of(type, member))))
offset_of
: cast the address 0
to type *
and access the membercontainer_of
: substract this offset from the address of the memberFor each generic data structure, the kernel provides helpers to use them.
Let’s see examples for circular doubly linked lists (list_head
from include/linux/list.h
):
Allocators:
Insert/delete:
And a lot more!
Tip
You can find similar helpers for all generic data structures.
Go check them out in the kernel sources!
Resources/objects can be accessed concurrently in the kernel.
There are two reasons this can happen:
Possible solutions:
Concurrency problems can arise due to preemption both on single- and multi-core systems.
In the case of single-core CPUs, it can be solved solely by disabling interrupts, making the kernel code non-preemptible.
In Linux, you can use the following macros:
local_irq_disable()
: this uses the proper assembly instruction to disable interrupts on the current core, e.g., cli on x86.local_irq_enable()
: this uses the proper assembly instruction to enable interrupts on the current core, e.g., sti on x86.On multi-core systems, disabling interrupts is not sufficient, as other cores might also access data concurrently.
One potential solution is to serialise all kernel code, allowing only one thread at a time to execute code in supervisor mode.
This was the initial solution used in Linux when support for multi-core CPUs was added.
The Big Kernel Lock (BKL) was taken when entering the kernel and released when exiting.
Only one thread at a time was running kernel code.
Pro: Extremely simple to implement and safe
Con: Large performance degradation due to the loss of parallelism for kernel code
Linux and the BKL
Linux had a Big Kernel Lock from the introduction of Symmetric Multi-Processor (SMP) in version 2.0 in 1999 until its removal in 2.6.39 in 2011.
Since the BKL removal, fine-grained synchronisation mechanisms are used in the kernel.
A non-exhaustive list of synchronisation mechanisms and their (partial) API:
Note
Most of these have variations that also disable interrupts when taking a lock.
Concurrent access problems can also be solved by using atomic operations in some cases.
Atomic operations are architecture-specific, and are defined in include/linux/atomic/atomic-instrumented.h
These operations should be used on a specific type, atomic_t
, to represent the atomic variable.
You can find the usual atomic operations, for example:
void atomic_add(int i, atomic_t *v);
void atomic_dec(atomic_t *v);
void atomic_or(int i, atomic_t *v);
Defined in Documentation/process/coding-style.rst
.
It defines a set of rules that will be enforced when a patch is submitted:
Tip
You can check if your patches are valid with regard to the coding style with the scripts/checkpatch.pl
script!
Indentation is done with tabs, not spaces.
Tabs are 8 characters long.
For better readability, the preferred limit on the length of a line is 80 characters.
However, never break user-visible strings, as it also breaks the ability to grep
them.
Since 2020, checkpatch.pl
only complains about lines longer than 100 characters.
From the coding style documentation:
Now, some people will claim that having 8-character indentations makes the code
move too far to the right, and makes it hard to read on a 80-character terminal
screen. The answer to that is that if you need more than 3 levels of
indentation, you're screwed anyway, and should fix your program.
Philosphy of function-versus-keyword usage.
No spaces after functions
Spaces after keywords (except if they are used like function, e.g., sizeof
)
if, switch, case, for, do, while
For operators:
Spaces on both sides of binary and ternary operators
= + - < > * / % | & ^ <= >= == != ? :
No space after unary operators or before/after postfix/prefix increment and decrement
& * + - ~ ! ++ –
No space around structure member operators
. ->
No trailing spaces at the end of lines
From the official kernel website, kernel.org, download the tarball archive.
Or from the command line, for example:
You can (and should) also check out the integrity of the tarball with the pgp signature:
$ wget https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.5.7.tar.sign
$ unxz linux-6.5.7.tar.xz # the signature is done on the decompressed tarball
$ gpg --verify linux-6.5.7.tar.sign linux-6.5.7.tar
This will probably fail because you don’t have the public keys of the maintainers that generated the tarball.
Get them from the kernel’s key server (documentation):
The kernel configuration describes the features that will be enabled in the built binary, as well as change their behavior.
It also describes if features should be built-in the binary or compiled as modules.
By default, the Makefile
-based build system uses the file .config
located at the root of the kernel sources.
You can generate initial configurations with the following commands (non-exhaustive list):
$ make allnoconfig # minimal, everything that can be disabled is disabled
$ make defconfig # default configuration for the local architecture
$ make localmodconfig # configuration based on the current state of the machine (plugged devices, etc.) and builds them as modules
$ make localyesconfig # same but everything is built-in
$ make oldconfig # keeps the values of the current .config and asks for the new options
Or you can copy the configuration of your running kernel:
$ cp /boot/config-$(uname -r)* .config # available on some distros
$ zcat /proc/config.gz > .config # available if CONFIG_IKCONFIG_PROC is enabled
If you need to know more about your hardware to generate your config, check out these commands:
lshwd
, lscpu
, lspci
, lsusb
, …dmidecode
hdparm
cat /proc/cpuinfo
, cat /proc/meminfo
, …dmesg
The kernel build system is based on Makefile
s.
Just run make
to compile it.
The compilation produces the following important files:
vmlinux
: the raw Linux kernel image. This ELF is used for debugging and profiling;System.map
: symbol table of the kernel. Not necessary to run the kernel, used for debugging;arch/<arch>/boot/bzImage
: compressed image of the kernel. This is the one that will be loaded and used.Two main steps:
This will copy the image and symbol map in /boot
, and generate the initramfs.
The symbol map (System.map
) provides the list of the symbols available in this kernel, their address and type.
0000000000000000 D __per_cpu_start
0000000000000000 D fixed_percpu_data
0000000000001000 D cpu_debug_store
0000000000002000 D irq_stack_backing_store
0000000000006000 D cpu_tss_rw
000000000000b000 D gdt_page
000000000000c000 d exception_stacks
0000000000014000 d entry_stack_storage
0000000000015000 D espfix_waddr
0000000000015008 D espfix_stack
Check the manpage of the nm
program for an explanation of the types.
As a rule of thumb (mostly true), lowercase means local scope while uppercase means global scope (i.e., exported symbol).
In Lab 2, task 3, you were asked to replace the init binary by a hello_world
program, which led to a kernel panic. Why?
Roles of init
Characteristics of init
Demo time!
Multiple development methods:
In this course, we will use the last method with QEMU as a hypervisor.
A module is a library dynamically loaded into the kernel. It triggers a call to a registered function when loaded and when unloaded.
The kernel provides two macros to register these functions: module_init()
and module_exit()
.
For these to work, you will need some header files included:
You should also add some information about your module with some pre-defined macros, usually at the beginning of the file:
MODULE_DESCRIPTION("Hello world module");
MODULE_AUTHOR("Redha Gouicem, RWTH");
MODULE_LICENSE("GPL");
These can be checked on any module:
Warning
The license is not only informative. It is also used to check if you are allowed to use some symbols in the kernel.
#include <linux/module.h>
#include <linux/init.h>
#include <linux/kernel.h>
MODULE_DESCRIPTION("Hello world module");
MODULE_AUTHOR("Redha Gouicem, RWTH");
MODULE_LICENSE("GPL");
static int __init hello_init(void)
{
pr_info("Hello World!\n");
return 0;
}
module_init(hello_init);
static void __exit hello_exit(void)
{
pr_info("Goodbye World...\n");
}
module_exit(hello_exit);
Annotations
The __init
and __exit
annotations are used to help the compiler optimize the memory usage.
When some module is statically built-in the kernel binary, functions tagged with these annotations are placed in specific segments:
.init.text
that is freed after the boot of the kernel.exit.text
that is never loaded in memoryThe running kernel is deployed with a generic Makefile
located in /lib/modules/$(uname -r)/build
.
You can use it from anywhere like this:
This will generate your module as a .ko
file (kernel object).
Loading a module can be done with insmod
:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
static char *month = "January";
module_param(month, charp, 0660);
static int day = 1;
module_param(day, int, 0000);
static int __init hello_init(void)
{
pr_info("Hello ! We are on %d %s\n", day, month);
return 0;
}
module_init(hello_init);
static void __exit hello_exit(void)
{
pr_info("Goodbye, cruel world\n");
}
module_exit(hello_exit);
Like shared libraries, modules are dynamically loaded: they only have access to symbols explicitly exported to them!
By default, they have access to absolutely no variable or function from the kernel, even if they are not static
!
Two macros allow to explicitly export symbols to modules:
EXPORT_SYMBOL(s)
makes the symbol s
visible to all loaded modulesEXPORT_SYMBOL_GPL(s)
makes the symbol s
visible to all modules with a license compatible with GPL (according to their MODULE_LICENSE
)Example: using the pm_power_off()
function exported in arch/x86/kernel/reboot.c
and available on my system:
If a module X uses at least one symbol from module Y, then X depends on Y.
Dependencies are not explicitly defined: they are automatically inferred during the kernel/module compilation.
You can find the list of dependencies in the file /lib/modules/<version>/modules.dep
.
This file is generated by the depmod
program, who checks which symbols are used by a module, and which module provide these symbols.
You can also check the dependencies of a module with modinfo
.
Automated dependency solving
Obviously, modules must be inserted in the proper order: if X depends on Y, Y needs to be inserted before X.
If you are using modprobe
, it will automatically insert dependencies first.
This is also true for unloading modules (in the reverse order).
When developing something in the kernel, the first design choice is “how?”
You have two choices:
Whenever possible, modules are the best choice, as they have more chances to be merged in the mainline.
While using modules should be your first choice, it also has some drawbacks depending on what you are doing.
Pros:
Cons:
If you need to modify the kernel and distribute your changes, you most likely will use patches.
A patch is the result of the diff
command applied on the original files and your modified version.
It contains all the data for the patch
to automatically apply the changes.
diff
: Compares files line by line
patch
: Apply changes to existing files
Creating a patch for the kernel tree:
-r
to enable recursive patch-u
to use the unified diff format (more compact and easier to read)unxz linux-6.5.7.tar.xz
cp -r linux-6.5.7 linux-6.5.7-orig
cd linux-6.5.7
emacs kernel/sched/fair.c
emacs kernel/sched/sched.h
cd ..
diff -r -u linux-6.5.7-orig linux-6.5.7 > new_sched.patch
xz new_sched.patch
When you think your code is ready for review by the kernel maintainer, you need to send it to them!
Note: This is just an overview of the process!
Ready your code for public eyes
Prepare your patch(es)
diff
or use git format-patch
to automatically generate patches from your commits (assuming you used git
in the first place)Format your patch series for emailing
git format-patch
Send your patch series to the mailing list
scripts/get_maintainer.pl
script on your patch. You should also CC anyone who might need to see this, e.g., they work on something similargit send-email
Don’t rely on these slides only!
Go check the full version in the kernel documentation, starting with the kernel development process and patch submission process.
You can check the mailing list online at https://lore.kernel.org.
In monolithic architectures, kernel and user programs are run in different permission levels:
supervisor mode and user mode.
There needs to be communication between the user and the kernel.
Multiple mechanisms are available in the Linux kernel, and choosing the right one is a common discussion on the mailing list.
Communication mechanisms:
An in-memory file system, or ramfs, is a file system not backed by a storage device.
The data stored on it is completely in memory or computed when accessed.
The Linux kernel provides a set of pseudo file systems that are ramfs representing kernel data or configuration.
They usually have a semantic where one file represents one value.
Each pseudo file system has slightly different semantics and answer to different needs: procfs, sysfs, configfs, debugfs, …
Advantage: User space programs can access these files with the standard POSIX file API: read
and write
.
You can thus use regular shell programs such as cat
and echo
to read/write from/to them.
Drawback: These mechanisms are synchronous from user to kernel, but asynchronous in the other direction,
i.e., user space applications cannot be notified when the value represented by a file changes in memory.
A pseudo file system is a component of the kernel that has to be enabled and mounted before being used.
The oldest pseudo file system, mounted in /proc
.
In lab 2, we saw that the procfs was mounted by the init script.
Goal: Export information about processes.
Since its creation, it has been used for more than that, e.g., exporting kernel data from various subsystems.
Using it is now discouraged for anything unrelated to processes.
Advantage: most widely documented
Drawback: no real structure enforced
The procfs provides two APIs:
PAGE_SIZE
, usually 4 KB)seq_file
API: more complex, but allows larger data to be exported with a list of buffersLet’s see an example procfs file that exports a variable from the kernel in human-readable form in /proc/my_state
.
We want to export the value of the system_enabled
global variable.
read()
function that will be called when our file is read from:
struct file_operations
that will be used for our fileread()
function following the prototypestruct file_operations
we just created in the init
function of your modulestatic int system_enabled;
static ssize_t system_state_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
const char *tmp = system_enabled ? "The system is enabled\n"
: "The system is disabled\n";;
return simple_read_from_buffer(buf, count, ppos, tmp, strlen(tmp));
}
static const struct file_operations system_state_fops = {
.open = simple_open,
.read = system_state_read,
.llseek = noop_llseek,
};
static struct proc_dir_entry *system_state_proc_dir;
static int system_state_init(void)
{
system_state_proc_dir = proc_create("my_state", 0, NULL,
&system_state_fops);
return 0;
}
module_init(system_state_init);
static void system_state_exit(void)
{
remove_proc_entry("my_state", NULL);
}
module_exit(system_state_exit);
Successor to the procfs, mounted in /sys
.
Goal: Store information about subsystems, hardware devices, drivers, …
This should be the default choice!
Advantages:
Cons:
PAGE_SIZE
)The struct kobject
is at the heart of the sysfs:
Most important fields:
struct kobject {
const char *name; // name of the directory
struct kobject *parent; // kobject of the parent directory
struct kset *kset; // the collection of kobjects this object belongs to
struct kref kref; // reference counter, used to free the memory properly
const struct kobj_type *ktype; // type of the object, with functions pointers to manipulate it
/* ... */
};
To create a kobject and add it to the sysfs, you can use this functions:
Caution
This works for simple cases (most likely what you want to do). For more complex scenarios, there are other functions to initialize, create and register kobjects. Check out the documentation for more information!
Kobject attributes
Each file in the sysfs corresponds to one single value and is associated with an instance of a struct kobj_attribute
.
struct kobj_attribute {
struct attribute attr; // file information (name, permissions)
ssize_t (*show)(struct kobject *kobj, struct kobj_attribute *attr, char *buf);
ssize_t (*store)(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count);
};
Kobject attributes can be created with the following macro:
You can also create group of attributes to have multiple files in the same directory:
We want to export the state of our system represented by an int system_enabled
global variable in a human-readable form in the file located at
/sys/kernel/my_state/system_enabled
.
static int system_enabled;
static ssize_t system_state_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
return snprintf(buf, PAGE_SIZE, "The system is %srunning\n", system_enabled ? "" : "not ");
}
static struct kobj_attribute system_state_attribute = __ATTR(system_enabled, 0400, system_state_show, NULL);
static struct kobject *my_state_kobj;
Now, we need to instantiate the sysfs file when loading our module and destroy it when unloading.
static int __init my_state_init(void)
{
int retval;
my_state_kobj = kobject_create_and_add("my_state", kernel_kobj);
if (!my_state_kobj)
goto error_init_1;
retval = sysfs_create_file(my_state_kobj, &system_state_attribute.attr);
if (retval)
goto error_init_2;
return 0;
error_init_2:
kobject_put(my_state_kobj);
error_init_1:
return -ENOMEM;
}
module_init(my_state_init);
static void __exit my_state_exit(void)
{
kobject_put(my_state_kobj);
}
module_exit(my_state_exit);
static int system_enabled;
static u64 clock;
static ssize_t system_state_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
return snprintf(buf, PAGE_SIZE, "The system is %srunning\n", system_enabled ? "" : "not ");
}
static ssize_t clock_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
return snprintf(buf, PAGE_SIZE, "%llu\n", clock++);
}
static ssize_t clock_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count)
{
u64 val;
int rc = sscanf(buf, "%llu", &val);
if (rc != 1 || rc < 0)
return -EINVAL;
clock = val;
return count;
}
static struct kobj_attribute system_state_attribute = __ATTR(system_enabled, 0400, system_state_show, NULL);
static struct kobj_attribute clock_attribute = __ATTR(clock, 0600, clock_show, clock_store);
static struct attribute *attrs[] = {
&system_state_attribute.attr,
&clock_attribute.attr,
NULL,
};
static struct attribute_group attr_grp = { .attrs = attrs };
static struct kobject *my_state_kobj;
static int __init my_state_init(void)
{
int retval;
my_state_kobj = kobject_create_and_add("my_state", kernel_kobj);
if (!my_state_kobj)
goto error_init_1;
retval = sysfs_create_group(my_state_kobj, &attr_grp);
if (retval)
goto error_init_2;
return 0;
error_init_2:
kobject_put(my_state_kobj);
error_init_1:
return -ENOMEM;
}
module_init(my_state_init);
static void __exit my_state_exit(void)
{
kobject_put(my_state_kobj);
}
module_exit(my_state_exit);
configfs is another ram-based file system offering the converse functionality to sysfs, mounted in /sys/kernel/config
.
While the sysfs is a view of kernel objects, configfs is a manager of kernel objects,
i.e., it allows to create/destroy kernel objects from user space.
Example: Allowing user programs to create and configure a virtual network devices by doing an mkdir
in the configfs directory of a driver.
Advantage:
Drawbacks:
The debugfs offers a very flexible and simple API targeted at simplifying kernel development and debugging.
It is mounted in /sys/kernel/debug
.
Advantages:
Drawback:
static int system_state;
static struct dentry *my_state_dir;
static int my_state_init(void)
{
struct dentry *new_file;
my_state_dir = debugfs_create_dir("my_state", NULL);
if (my_state_dir == 0)
return -ENOTDIR;
new_file = debugfs_create_u8("system_state", 0444, my_state_dir, (u8 *) &system_state);
if (new_file == 0) {
debugfs_remove_recursive(my_state_dir);
return -EINVAL;
}
return 0;
}
module_init(my_state_init);
static void my_state_exit(void)
{
debugfs_remove_recursive(my_state_dir);
}
module_exit(my_state_exit);
Pseudo file system-based mechanisms are:
The kernel provides various synchronous communication mechanisms:
System calls are the most classic user-kernel communication mechanism.
They allow user space applications to execute privileged kernel code:
They are the core API of the kernel, with some limitations:
There are two ways of making system calls:
syscall
instruction way (x86, amd64, ARM64)System calls are a software interrupt like the others:
1. Place the system call number in a register
2. Place the arguments in the proper registers and/or on the stack
3. Trigger the “system call” interrupt (switching to supervisor mode)
4. Jump to the “system call” interrupt handler
5. Load the system call table and jump to the index given by the syscall number
6. Execute the system call handler
7. Return the result to user space (switching back to user mode)
Some architectures provide a specific instruction (syscall
, svc
, sysenter
, …):
1. Place the system call number in a register
2. Place the arguments in the proper registers and/or on the stack
3. Use the system call instruction (switching to supervisor mode)
4. Jump to the index given by the syscall number in the syscall table
5. Execute the system call handler
6. Return the result to user space (switching back to user mode)
One level of indirection is bypassed!
Tip
You can find a very detailed (with less omissions) explanation of how system calls work in Linux here!
API: Application Programming Interface
High-level interface for programmers (function prototypes, data types, …)
ABI: Application Binary Interface
Low-level interface for compilers/OS (calling conventions, architecture-specific)
Architecture | syscall# | retval | arg1 | arg2 | arg3 | arg4 | arg5 | arg6 | arg7 |
---|---|---|---|---|---|---|---|---|---|
Arm EABI | r7 | r0 | r0 | r1 | r2 | r3 | r4 | r5 | r6 |
arm64 | w8 | x0 | x0 | x1 | x2 | x3 | x4 | x5 | |
mips | v0 | v0 | a0 | a1 | a2 | a3 | a4 | a5 | |
riscv | a7 | a0 | a0 | a1 | a2 | a3 | a4 | a5 | |
x86-64 | rax | rax | rdi | rsi | rdx | r10 | r8 | r9 |
How do we add a system call to Linux?
We’ll see that a bit later, when you’ll need to do it for an exercise.
ioctls are a way to provide custom system calls, mainly for device drivers.
They are called from user space by the ioctl
system call and provide an additional level of indirection, with each ioctl having a number.
Advantages:
Drawback:
An ioctl is tied to a file. It is registered as a file operation similar to read
or write
in the kernel.
For a given device, each ioctl has a number generated through a set of macros.
How can we use ioctls?
We won’t see the API in details here, you will have a task solely on that in the next lab.
The kernel also provide a socket-based interface with user space called netlink.
It is similar to regular sockets in user space, but with the AF_NETLINK
socket family.
Advantage:
Drawback:
Pages are the basic unit of memory management.
Even if memory is byte/word addressable, the smallest management unit is the page, to accommodate fast lookups and address translations.
Some definitions:
A page (or virtual page) is a fixed-size block of contiguous virtual memory.
A page frame (or physical page) is a fixed-size block of contiguous physical memory.
Page size depends on the architecture, some of them even support multiple sizes.
Architecture | 4 KiB | 16 KiB | 64 KiB | 2 MiB | 4 MiB | 32 MiB | 512 MiB | 1 GiB |
---|---|---|---|---|---|---|---|---|
x86_64 | X | X | X | |||||
armv7 | X | X | ||||||
aarch64 | X | X | X | X | X | X | X | |
riscv32 | X | X | ||||||
riscv64 | X | X | X |
The kernel maintains a struct page
for each page frame available on the system.
The structure is defined in include/linux/mm_types.h
.
Here is a simplified definition of the structure:
struct page {
unsigned long flags; // page status, flags available in include/linux/page-flags.h
atomic_t _refcount; // number of references to this frame
/* page cache and anonymous pages */
atomic_t _mapcount; // number of page tables this frame is mapped in
struct address_space *mapping; // if used in page cache, object associated to this frame
struct list_head lru; // least-recently used list for eviction
void *virtual; // kernel virtual address when not kmapped, used when using high memory
};
Since there is one struct page
per physical page, isn’t this a lot of memory for metadata?
In practice, a struct page
is only around 40 bytes (lots of unions in there).
So, in a system with 16 GiB of memory and 4KiB pages, you will have: \(\frac{17,179,869,184}{4,096} = 4,194,304\) pages.
Which means only \(4,194,304~\text{pages} \times 40~\text{B} = 160~\text{MiB}\), or roughly 1% of the total memory.
You can reduce this metadata footprint by increasing the page size, i.e., using huge pages.
Not all addresses are equal in hardware, so all frames are not treated identically.
The kernel separates pages in multiple zones with different properties.
The two main hardware limitations that require zones are:
Zone | Description | Physical Memory |
---|---|---|
ZONE_DMA |
Contains frames that can be accessed through DMA | \(\lt\) 16 MiB |
ZONE_NORMAL |
Permanently mapped pages (DMA also possible) | 16–896 MiB |
ZONE_HIGHMEM |
Memory not permanently mapped into the kernel’s address space | \(\gt\) 896 MiB |
64-bit architectures
On architectures that can address large address spaces, e.g., 64-bit, ZONE_HIGHMEM
usually does not exist, and all memory is split between ZONE_DMA
and ZONE_NORMAL
.
Quick recap of the memory management API in the kernel:
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order); // allocate 2^order contiguous physical pages, return the first page
void *page_address(struct page *page); // get the actual address of the page from it's struct page
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); // same as alloc_pages(), but returns the address directly
struct page *alloc_page(gfp_t gfp_mask); // wrapper to allocate a single page
unsigned long __get_free_page(gfp_t gfp_mask); // wrapper to allocate a single page and get the actual address
unsigned long get_zeroed_page(gfp_t gfp_mask); //same as __get_free_page(), but the page is filled with zeros
You might have noticed that most of the memory API works at the page granularity.
There are multiple levels of allocators:
The page frame allocator is responsible for managing the physical memory, giving page frames to allocators upon request, and reclaiming them when freed.
Page order
You might have also noticed that the page granularity API uses orders (powers of two) for allocations:
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);
Orders come from the inner working of the page frame allocator of Linux: the buddy allocator.
Linux uses the buddy allocator to manage physical page frame allocation.
Initially, physical memory is seen as a single large chunk containing all page frames, e.g., \(2^8 = 256\) pages.
Allocation:
If a chunk of the correct size is available, it is directly chosen
Free:
If two neighbors are free but not buddies, they cannot be merged!
Linux implements the buddy allocator with an array of free lists, indexed by order.
Examples:
alloc_pages(GFP_KERNEL, 3);
alloc_pages(GFP_KERNEL, 4);
When freeing a chunk of size order, it is added back its free list.
The kernel also maintains bitmaps to identify buddies: if both buddies are free, they are merged and stored in the order + 1 list.
User interface via procfs
You can check the state of your buddy allocator by reading the /proc/buddyinfo
file.
The page frame allocator works at the page granularity.
This is not convenient for most allocations that request memory for smaller objects!
Object allocators act as a layer between the page allocator and subsystems to allocate smaller chunks of memory. They allocate pages, and “redistribute” smaller chunks to subsystems that allocate memory through them.
For example, when using kmalloc()
, you are using an object allocator, not directly allocating pages.
Linux has multiple object allocation layers (you can even implement your own), but we will see the main one: the slab layer.
Allocating and freeing objects is extremely frequent, so it’s a good idea to have some sort of caching mechanism.
In Linux, this caching mechanism is called the slab layer.
The slab layer allows you to create caches, each of which contains a certain type of objects, e.g., struct task_struct
or struct inode
.
Each cache is then divided into slabs, blocks of contiguous memory that contain a certain number of instances of the object stored by this cache.
A slab contains the actual data and maintains their status (used or free).
When a slab is full, the slab layer will allocate a new one for this cache.
When the system wants to reclaim memory, empty slabs will be freed.
Note
Additionally, allocations are done at the page granularity, so for smaller objects, you would need manual management to not waste memory.
Added in Linux in 1996, implements work from Sun Microsystems in SunOS 5.41
Cache friendly
NUMA aware
Design: Page frame layout
Metadata of each slab can be embedded in the slab itself:
Note
Multiple allocations can be done by only touching one cache line (the one with the freelist). No need to touch the actual objects.
Design: Data structures
Partial description, see their real definitions for more details
Frame layout:
Object layout:
Introduced in 2007. The idea is to simplify the implementation, with less queues.
Locality by having per-cpu slabs, still NUMA aware
Frame layout:
Object layout:
SLUB is the default allocator
SLOB has been deprecated in 6.2
SLAB has been deprecated in 6.5
As seen previously, the “main” object of the slab layer is a struct kmem_cache
, as it represents an instance of a cache.
You can query which caches have been created in your system from user space:
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
ext4_groupinfo_4k 7656 7656 184 44 2 : tunables 0 0 0 : slabdata 174 174 0
ext4_fc_dentry_update 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_inode_cache 257751 258039 1192 27 8 : tunables 0 0 0 : slabdata 9557 9557 0
ext4_allocation_context 520 520 152 26 1 : tunables 0 0 0 : slabdata 20 20 0
ext4_prealloc_space 720 720 112 36 1 : tunables 0 0 0 : slabdata 20 20 0
ext4_io_end 1408 1408 64 64 1 : tunables 0 0 0 : slabdata 22 22 0
filp 17434 19008 256 32 2 : tunables 0 0 0 : slabdata 594 594 0
inode_cache 15325 15325 648 25 4 : tunables 0 0 0 : slabdata 613 613 0
dentry 357252 357252 192 42 2 : tunables 0 0 0 : slabdata 8506 8506 0
pid 4129 4160 128 32 1 : tunables 0 0 0 : slabdata 130 130 0
kmalloc-8k 496 496 8192 4 8 : tunables 0 0 0 : slabdata 124 124 0
kmalloc-4k 1765 1768 4096 8 8 : tunables 0 0 0 : slabdata 221 221 0
kmalloc-2k 2452 2496 2048 16 8 : tunables 0 0 0 : slabdata 156 156 0
kmalloc-1k 4670 4704 1024 32 8 : tunables 0 0 0 : slabdata 147 147 0
kmalloc-512 49888 49888 512 32 4 : tunables 0 0 0 : slabdata 1559 1559 0
kmalloc-256 20977 21024 256 32 2 : tunables 0 0 0 : slabdata 657 657 0
kmalloc-192 53424 53424 192 42 2 : tunables 0 0 0 : slabdata 1272 1272 0
kmalloc-128 64768 64768 128 32 1 : tunables 0 0 0 : slabdata 2024 2024 0
kmalloc-96 7856 8820 96 42 1 : tunables 0 0 0 : slabdata 210 210 0
kmalloc-64 69111 69120 64 64 1 : tunables 0 0 0 : slabdata 1080 1080 0
kmalloc-32 26610 27008 32 128 1 : tunables 0 0 0 : slabdata 211 211 0
kmalloc-16 40386 41472 16 256 1 : tunables 0 0 0 : slabdata 162 162 0
kmalloc-8 32767 32768 8 512 1 : tunables 0 0 0 : slabdata 64 64 0
kmem_cache_node 640 640 64 64 1 : tunables 0 0 0 : slabdata 10 10 0
kmem_cache 384 384 256 32 2 : tunables 0 0 0 : slabdata 12 12 0
If your code performs allocations and needs a guarantee that memory will be available, you can use memory pools.
This should be used only if your code will fail if memory is not available.
For example, some drivers performing DMA might need to allocate objects during an operation with hardware, where failure would break the hardware.
A memory pool is a chunk of pre-allocated memory that is guaranteed to be able to store at least a minimal number of objects.
Creation/Destruction
/**
* mempool_create - create a memory pool
* @min_nr: the minimum number of elements guaranteed to be
* allocated for this pool.
* @alloc_fn: user-defined element-allocation function.
* @free_fn: user-defined element-freeing function.
* @pool_data: optional private data available to the user-defined functions.
*/
mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t *free_fn, void *pool_data);
void mempool_destroy(mempool_t *pool)
typedef void * (mempool_alloc_t)(gfp_t gfp_mask, void *pool_data);
typedef void (mempool_free_t)(void *element, void *pool_data);
You can also build a memory pool on top of a slab cache with the following wrapper function:
If you want to use the kmalloc
slab cache, you can use these wrapper function:
If you allocate memory, you need to free it at some point to avoid memory leaks.
There are multiple ways of reclaiming memory for the kernel:
kfree()
or kmem_cache_free()
.kswapd
daemon and under heavy memory pressure.Reference counters keep track of the number of users of an object. Whenever the counter reaches 0, the object is not in use anymore and can be freed.
To use a reference counter, you need to embed a struct kref
into your structure. It needs to be embedded, not a pointer!
You can check the full API in include/linux/kref.h
, but the main methods are:
/**
* kref_get - increment refcount for object.
* @kref: object.
*/
void kref_get(struct kref *kref);
/**
* kref_put - decrement refcount for object.
* @kref: object.
* @release: pointer to the function that will clean up the object when the
* last reference to the object is released.
* This pointer is required, and it is not acceptable to pass kfree
* in as this function. If the caller does pass kfree to this
* function, you will be publicly mocked mercilessly by the kref
* maintainer, and anyone else who happens to notice it. You have
* been warned.
*/
int kref_put(struct kref *kref, void (*release)(struct kref *kref));
Putting a kref object
The release()
function needs to free the object containing the kref.
You can achieve this by using the container_of
macro.
If you allocate a lot of objects that are useful but not necessary, you can play nice and let the kernel reclaim your memory if needed.
When the kernel is under memory pressure, it runs the shrinker to reclaim memory from registered components.
To register a shrinker for your code, you first need to declare a struct shrinker
and define a count()
and a scan()
functions.
struct shrinker {
unsigned long (*count_objects)(struct shrinker *, struct shrink_control *sc);
unsigned long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
int seeks; /* seeks to recreate an obj */
long batch; /* reclaim batch size, 0 = default */
unsigned long flags;
/* These are for internal use */
struct list_head list;
/* objs pending delete, per node */
atomic_long_t *nr_deferred;
};
When the kernel wants to reclaim memory, it will call the count()
method of all registered shrinkers to assess how many objects can be freed.
It will then call the scan()
method if the count is positive in order to actually free the memory.
With your shrinker declared, you still need to register it so that the kernel will call it when memory reclaiming is performed.
You also need to unregister your shrinker when it is not usable anymore, e.g., when you unload your module.
Linux tries to keep a pool of available free pages to ensure future allocations won’t fail (most likely).
Pages can be of one of four types:
Page reclamation is performed on two occasions:
kswapd
daemon is asynchrounously woken up to reclaim pages
The page frame reclamation algorithm (PFRA) implements a form of Least-Recently Used (LRU) algorithm.
The rationale is the following:
In practice, it is very costly to maintain a sorted LRU list.
Thus, Linux approximates it using a clock algorithm, using the referenced bit from the page table.
There are two versions of the PFRA:
Let’s have a look at both implementations!
The PFRA maintains two lists: the active and the inactive lists.
The active list contains recently accessed pages, while the inactive list contains the pages that were not accessed recently.
In other words, the active list contains the current working set of the system.
When a page is accessed for the first time, it is added in the inactive list, with a referenced bit set to 0.
Periodically, the PFRA scans the lists:
Lengths of lists
The PFRA tries to balance the length of the active and inactive list through a ratio. By default, Linux is usually tuned to have an active list at most two thirds of the page cache’s size.
Unfortunately, the active/inactive lists have a few shortcomings:
In Linux 6.1, the Multi-Gen LRU (MGLRU) was introduced to fix these issues.
It generalises the previous concept with more lists than just active/inactive, called generations.
The general idea and algorithm are the same: pages are moved between generations depending on their use recency.
The old algorithm could be described as an MGLRU with two generations.
Having more generations gives a finer-grained estimation of the age of a page:
pages in the same generation have been last accessed roughly at the same period of time.
Each generation is also smaller than with the old version, making the scan faster.
Generations are also now split into tiers that regroup pages that were accessed the same amount of time in the generation.
Transparent Huge Pages (THP)
Transparently use huge pages (2 MiB or 1 GiB on x86-64) when allocating large memory areas. Reduces the pressure on the TLB (less entries for the same amount of data) and shortens page table operations (less page table levels to walk though).
Compaction
Reduce fragmentation of the physical memory by compacting pages close to each other. This reduces the amount of holes due to the buddy allocator, and enables merging free buddies, and thus allocating larger memory areas.
Kernel Samepage Merging (KSM)
Deduplication mechanism that can be enabled to detect page frames with the same content and merge them. The resulting page frame is then mapped at each virtual address the original page frames belonged to.
Out-of-Memory Killer (OOM)
When the system critically runs out of memory, the OOM killer chooses a process to sacrifice in order to enough free memory for the system to keep operating without freezing or crashing.
The Virtual File System (VFS) defines a set of abstractions that are then made concrete by file system implementations.
Objects
Interfaces
File descriptors represent an instance of an open file.
It contains information about the file on storage as well as the current state of the open file, e.g., cursor position.
There can be multiple file descriptors for the same file on storage when opened multiple times.
A file descriptor is defined as a struct file
in include/linux/fs.h
struct file {
fmode_t f_mode; // mode in which the file is opened (rwxrwxrwx)
atomic_long_t f_count; // number of threads sharing this file descriptor
loff_t f_pos; // position in the file, the "cursor"
struct inode *f_inode; // the inode representing the concrete file
struct file_operations *f_op;
struct address_space *f_mapping; // mapping in memory for the page cache
/* ... */
};
Inodes describe files or directories on the storage device.
Each inode corresponds to one and only one file/directory.
Conversely, each file/directory corresponds to one and only one inode.
An inode is defined as a struct inode
in include/linux/fs.h
struct inode {
umode_t i_mode; // mode of the file on disk
kuid_t i_uid; // user id of the owner of the file
kgid_t i_gid; // group id of the owner of the file
unsigned long i_ino; // inode number
unsigned int i_nlink; // number of links to this file (hard links)
loff_t i_size; // size of the file
struct timespec64 i_atime; // date of the last access
struct timespec64 i_mtime; // date of the last modification
struct timespec64 i_ctime; // date of creation
struct inode_operations *i_op;
struct super_block *i_sb; // super block (partition) that contains this inode
struct address_space *i_mapping; // mapping in memory for the page cache
};
The inode is partially stored on the disk too, in order to preserve information across mounts/reboots, e.g., file size, timestamps, mode, etc…
From a user perspective, a file/directory is identified by its path.
The VFS needs to translate this path into an inode to actually interact with the file.
This resolution operation is costly as it requires numerous string operations and walking the directory hierarchy on disk.
To avoid repeating this operation too many times, the VFS builds directory entries, or dentries.
A dentry maintains a relationship between a path and its corresponding inode.
Example: If you open the file located at /home/lkp/foo/bar
, it will create/query the following five dentries: /
, home
, lkp
, foo
and bar
.
Dentries are defined as struct dentry
in include/linux/dcache.h
Dentries can be in one of three states:
d_inode
points to an inode) and is in the dentry cached_inode
is NULL
) (you will see why this is useful for in the lab)Dentries are cached into a hash table for fast lookup.
This cache is declared as static struct hlist_bl_head *dentry_hashtable
in fs/dcache.c
.
Spoiler alert!
No more details here, you will have to work with that cache in the next lab.
In particular, you will have to find out:
Now, let’s go back to our layers, and more specifically how the VFS glues all this together
As shown in the previous figure, implementing a file system means implementing a set of file and inode operations.
File systems may also add their own flavor of inode definition.
Example: The ext4 file system has this inode definition that extends the VFS inode:
struct ext4_inode {
__le16 i_mode; /* File mode */
__le16 i_uid; /* Low 16 bits of Owner Uid */
__le32 i_size_lo; /* Size in bytes */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Inode Change time */
__le32 i_mtime; /* Modification time */
__le32 i_dtime; /* Deletion Time */
__le16 i_gid; /* Low 16 bits of Group Id */
__le16 i_links_count; /* Links count */
__le32 i_blocks_lo; /* Blocks count */
__le32 i_flags; /* File flags */
__le16 i_checksum_hi; /* crc32c(uuid+inum+inode) BE */
__le32 i_generation; /* File version (for NFS) */
__le32 i_file_acl_lo; /* File ACL */
};
Quick reminder about computer system latencies:
Device | Time (in ns) |
---|---|
Memory | 10.000 |
SSD | 1.000.000 |
HDD | 5.000.000 |
Accessing storage device is two orders of magnitude longer than accessing memory!
To alleviate this, Linux has a page cache sitting between file systems and storage devices.
The xarray
in the address_space
may contain folios instead of pages depending on the file system. We won’t go into more details on this, but if you want to have a look, you can check the kernel documentation/online resources.
A super block describes a file system partition.
When you mount a file system partition, a struct super_block
object is created and populated with information read from storage.
This structure is defined in include/linux/fs.h
:
struct super_block {
struct block_device *s_bdev; // the block device containing this partition
unsigned long s_blocksize; // size of a block
loff_t s_maxbytes; // max file size
struct file_system_type *s_type; // file system descriptor
struct super_operations *s_op; // fs ops (alloc_inode, sync, umount)
struct dentry *s_root; // dentry of the root of the mount point (the / of this partition)
unsigned long s_magic; // magic number of this file system
void *s_fs_info; // file system-specific private info
char s_id[32]; // short name
uuid_t s_uuid; // UUID
/* ... */
};
The operations on super blocks are also defined in include/linux/fs.h
:
struct super_operations {
/* inode handling */
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*free_inode)(struct inode *);
int (*write_inode) (struct inode *, struct writeback_control *wbc);
/* partition handling */
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*statfs) (struct dentry *, struct kstatfs *);
void (*umount_begin) (struct super_block *);
/* ... */
};
To implement a file system, you need to:
For 1., this mostly means implementing the mkfs
user space utility for your new file system.
For 2., this is your kernel implementation as a module.
Another way of implementing a file system in Linux is through the File system in User SpacE (FUSE) API.
With FUSE, you can implement everything in user space and register your FUSE file system with the kernel.
The VFS will then redirect system calls targeting your file system to your code in user space, which in turn may use the underlying “real” file system.
Performance
The amount of round trips between user and kernel modes greatly impacts the performance of FUSE-based file systems.
FUSE Passthrough
Since 6.9, the FUSE subsystem supports a passthrough feature to avoid mode switches.
It allows the FUSE driver to directly communicate with other file systems for read/write operations, without going back to the user space FUSE daemon.
Linux Kernel Programming