Linux Kernel Programming

Redha Gouicem

Course Organisation

Overview

Who are we?

LuFG Betriebssysteme/Operating Systems
Teaching
Research

Linux Kernel Programming

Organisation
Evaluation
Links

Who Are We?

Lehr- und Forschungsgebiet Betriebssysteme
Operating Systems Research and Teaching Unit

Faculty

Prof. Redha Gouicem

Administration

Claudia Graf

Researchers

Jérôme Coquisart, M.Sc.
Mostafa Hadizadeh, M.Sc.

Research assistants - Tutors (HiWi)

16 people, both bachelor and master students

Where Can You Find Us?

Our offices are in the UMIC building, 2nd floor

Website: https://os.rwth-aachen.de

Teaching Curriculum

Past/current theses examples

Multi-Processing Support for Unikernels (M.Sc.)
Understanding the Performance Impact of Security Mitigations in the Linux Kernel (B.Sc.)
Automatic Caching for Cloud Object Storage Systems (B.Sc.)
Automatic Detection and Mitigation of Page Reclamation Overhead (M.Sc.)
Design and Implementation of an Evaluation Framework for a NUMA-Aware Page Cache (B.Sc.)
Evaluating Memory Prefetching Policies in Linux Systems (B.Sc.)

Implement Support for Instructions in the Front-End of a Hybrid Binary Translator (B.Sc.)
Implement a Shared Page Cache in a Virtualised Environment (B.Sc./M.Sc.)
Impact of State-of-the-Art Page Table Replication Schemes on the Efficiency of NUMA Systems (B.Sc./M.Sc.)
Evaluation and Improvement for NUMA Kernel Text Replication (B.Sc./M.Sc.)

Research Activities

As the name of the group suggests, operating systems!

In short: design, implementation and optimisation of the lower software stack, between the hardware and users.

The main goals are:

Enable users to have an efficient and reliable execution environment for their applications,
Allow developers to have an easy-to-use abstraction of the hardware and transparent resource management and isolation

In a nutshell, our topics revolve around:

“Classical” operating systems

Scheduling
Memory management
Storage systems
Unikernels

Virtualisation

Hypervisors
Containers
Deployment models

Emerging hardware

Non-volatile memory
FPGAs
CXL

Binary translation

Cross architecture emulators
Memory models
Correctness

Linux Kernel Programming: Team

Lecturer:

Prof. Redha Gouicem

Teaching assistant:

Jérôme Coquisart, M.Sc.

Contact emails

General Information

In this course, you will learn how to program in the Linux kernel.

This is a very practical course, where you will mostly write code.

Lectures

Time: Tuesdays @ 16:30 - 18:00

Location: Lecture hall AH VI

Lecturer: Me

Content:

Fundamental OS concepts
Overview of kernel APIs
Linux kernel subsystems/algorithms

Labs

Time: Mondays & Thursdays @ 10:30 - 12:00

Location: UMIC 025

Teaching assistant: Jérôme Coquisart, M.Sc.

Content:

Explore the Linux kernel code base
Use Linux kernel APIs to implement modules
Learn/use kernel debugging tools

Important information

Unfortunately, we cannot provide hardware, so you need to come with your laptop.
Hardware that runs Linux is best, but Windows with WSL should also work.
Apple devices with ARM-based processors (M1/M2) should also work, but not easily…

Course Content

Lectures

History and Architecture of the Linux Kernel
C Bootcamp and Kernel Programming
Implementing Kernel Modules & Contributing to the Kernel
User-Kernel Communication
Memory Management
The Virtual File System
Tracing Facilities in the Kernel

Labs

Dusting Off Your C Skills
First Steps with the Kernel
My First Modules
Debugging in the Linux Kernel
User/Kernel Communication Mechanisms
Memory Management
Virtual File System
System Calls

Info

There might a couple more lectures and labs added this year.

Examination

Written exam (45% of the final grade)

General questions about the lecture, e.g., explain a mechanism in the kernel
General questions about the exercises, e.g., explain how some API works
No full coding exercise, but you need to understand code, and explain how to modify it

Project (45% of the final grade)

Write a set of features in the kernel
In groups of 2-3 students
Assignment will be given a few weeks before the end of the lecture period
Your submitted code will be evaluated, and you will need to make a very short presentation/demo

Weekly labs (10% of the final grade)

Every 1-2 weeks, you will get a new lab
You will have to submit some of them for evaluation
You will need to get enough points in the labs to register for the exam and project

Contact & Online Material

Contact e-mail

If you want to contact us, please use the following e-mail address: lkp@os.rwth-aachen.de

If you contact us directly, you might wait longer or get no answer.

Matrix server

We will set up a Matrix chat room with all students (and us).

If you already have an account on another server, you can use it.
Otherwise, you will be allowed to create one on ours.

Lectures and Labs

Lecture slides will be uploaded just before the lecture here: https://teaching.os.rwth-aachen.de/LKP/lecture

Labs will be available here: https://teaching.os.rwth-aachen.de/LKP

Lecture Live Q&A

During lectures, you can ask questions directly by raising your hand, or through an online Q&A tool:

Link: Claper room

Reading Material

Books

Linux Kernel Development (3rd Edition), Robert Love
Linux Device Drivers, Third Edition, Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman

Online material

Chapter 1: History and Architecture of the Linux Kernel

Operating System and Kernel

In this course, we will use the following definitions:

Definition

The operating system is the set of software components that enables applications to use the underlying hardware and provides APIs to ease development.

Definition

The kernel is the set of components of the operating system that are executed in a privileged mode, usually in supervisor mode.

Kernel Taxonomy

Kernels are usually classified in various types:

Monolithic kernels
Microkernels
Hybrid microkernels
Unikernels

Let’s have a quick recap of these kernel architectures!

Monolithic Kernels

A monolithic kernel embeds all the system functionalities in a single binary. It contains all the core features of an operating system (scheduling, memory management, etc…) as well as drivers for devices or less essential components.

Characteristics

Defines a high level interface through system calls
Good performance when kernel components communicate (regular function calls in kernel space)
Limited safety: if one kernel component crashes, the whole system crashes

Examples

Unix family: BSD, Solaris
Unix-like: Linux
DOS: MS-DOS
Critical embedded systems: Cisco IOS

Why ‘monolithic’?

Monolithic means it is built as a single binary and runs in the same address space. The source code can still be organised in a modular way (e.g., using libraries).

Modularity

Some monolithic kernels allow dynamic code loading as modules, e.g., for drivers. These are usually called modular monolithic kernels.

Microkernels

A microkernel contains only the minimal set of features needed in kernel space:
address-space management, basic scheduling and basic inter-process communication.
All other services are pushed in user space as servers:
file systems, device drivers, high level interfaces, etc.

Characteristics

Small memory footprint, making it a good choice for embedded systems
Enhanced safety: when a user space server crashes, it does not crash the whole system
Adaptability: servers can be replaced/updated easily, without rebooting
Limited performance: IPCs are costly and numerous in such an architecture

Examples

Minix
L4 family: seL4, OKL4, sepOS
Mach
Zircon

Hybrid Microkernels

The hybrid kernel architecture sits between monolithic kernels and microkernels.
It is a monolithic kernel where some components have been moved out of kernel space as servers running in user space.

While the structure is similar microkernels, i.e., using user servers, hybrid kernels do not provide the same safety guarantees as most components still run in the kernel.

Controversial architecture

This architecture’s existence is controversial, as some just define it as a stripped down monolithic kernel.

Examples

Windows NT
XNU (Mach + BSD)

Unikernels

A unikernel, or library operating system, embeds all the software in supervisor mode.
The kernel as well as all user applications run in the same privileged mode.

It is used to build single application operating systems, embedding only the necessary set of applications in a minimal image.

Characteristics

High peformance: system calls become regular function calls and no copies between user and kernel spaces
Security: attack surface is minimized, easier to harden
Usability: hard to build unikernels due to lack of features supported

Examples

Unikraft
clickOS
IncludeOS

Comparison Between Kernel Architectures

The choice of architecture has various impacts on performance, safety and interfaces:

Switching modes is costly: minimizing mode switches improves performance
Supervisor mode is not safe: minimizing code in supervisor mode improves safety and reliability
High level interfaces for programmers are in different locations depending on the architecture
i.e., in the kernel for monolithic, but in libraries or servers for microkernels

In this course, we will focus on a monolithic modular kernel: Linux.

A Brief History of the Linux Kernel

Unix Systems

In the 1960s, MIT, AT&T Bell Labs and General Electric built Multics (Multiplexed Information and Computing Service).

Multics is a time-sharing operating system for mainframes that introduced new concepts:

multitasking: multiple users can use the system simultaneously
hierarchical file system: files are organised as a tree with directories
single-level store: files on storage are all mapped in memory, thus not accessed with read/write primitives, but through regular memory accesses

In 1970, AT&T Bell Labs left the project and started Unix, led by Ken Thompson, with Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna.

Unix kept the hierarchical file system but dropped the single-level store, going for an “everything is a file” philosophy.

Unix was originally a single-tasking OS.

Why ‘Unix’?

The name Unix is a pun on Multics/Unics. Kernighan came up with the name, but states that “no one can remember” who came up with the spelling.

Timeline of Unix Systems

Source: https://en.wikipedia.org/wiki/History_of_Unix

Linux: Origins

First public appearance on the Minix newsgroup

From: Linus Benedict Torvalds
To: comp.os.minix
Subject: What would you like to see most in minix?
Date: 25 August 1991,  22:57:08

Hello everybody out there using minix -

I'm doing a (free) operating system (just a hobby, won't be
big and professional like gnu) for 386(486) AT clones. This
has been brewing since april, and is starting to get ready.
I'd like any feedback on things people like/dislike in minix,
as my OS resembles it somewhat (same physical layout of the
file-system (due to practical reasons) among other things).

I've currently ported bash(1.08) and gcc(1.40), and things
seem to work. This implies that I'll get something practical
within a few months, and I'd like to know what features most
people would want. Any suggestions are welcome, but I won't
promise I'll implement them :-)

Linus (torv...@kruuna.helsinki.fi)

PS. Yes - it's free of any minix code, and it has a
multi-threaded fs. It is NOT protable (uses 386 task switching
etc), and it probably never will support anything other than
AT-harddisks, as that's all I have :-(.

Reply from Andrew Tanenbaum (creator of Minix)

From: Andrew S. Tanenbaum
To: comp.os.minix
Subject: What would you like to see most in minix?
Date: 30 January 1992, 09:04

/* blablabla */

I still maintain the point that designing a monolithic kernel
in 1991 is a fundamental error. Be thankful you are not my
student. You would not get a high grade for such a design :-)

/* blablabla */

Prof. Andrew S. Tanenbaum (a...@cs.vu.nl)

Chronology of the Linux Kernel

Year	Version	Features
1994	1.0	stable kernel with basic UNIX functionalities
1995	1.2–1.3	round-robin scheduler, loadable modules, /dev/random
1996	2.0	PowerPC support, multicore, improved networking, Tux
1999	2.2	frame buffer, NTFS, FAT32, IPv6, USB, SLAB allocator
2001	2.4	new file systems (ext3, XFS, tmpfs), netfilter
2003	2.6	preemptible kernel, O(1) scheduler, ALSA
2004	2.6.4–2.6.10	EFI support, x86-64, ARMv6, CFQ IO scheduler
2005	2.6.14	FUSE support
2007	2.6.20–2.6.23	KVM, tickless kernel, SLUB allocator, CFS scheduler
2008	2.6.24–2.6.28	cgroups, ext4
2011	2.6.39	removal of the Big Kernel Lock (BKL)
2014	3.14–3.18	OverlayFS, eBPF, kernel address space layout randomization (KASLR)
2015	4.0	live patching
2018	4.15	kernel page table isolation (security mitigations)
2019	5.1	io_uring
2020	5.6	wireguard
2022	6.1	multi-gen LRU eviction algorithm, initial Rust support
2023	6.6	new EEVDF scheduler
2024	6.12	PREEMPT_RT, sched_ext

Linux Kernel Architecture

Linux offers six main functions:

Process management
Memory management
Network management
Storage management
System interface
Human interface

through five abstraction layers:

User space interfaces

System calls, procfs, sysfs, device files, …
Virtual subsystems

Virtual memory, virtual filesystem, network protocols, …
Functional subsystems

Filesystems, memory allocators, scheduler, …
Devices control

Interrupts, generic drivers, block devices, …
Hardware interfaces

Device drivers, architecture-specific code, …

Linux Kernel Map

Source: https://makelinux.github.io/kernel/map/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code

arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code

arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code

arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code

arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code

arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure (2)

Tools and environment:

Documentation/

scripts/

usr/

tools/

samples/

text documentation, in addition to comments

scripts used for configuration, formatting, etc…

utilities to generate the Linux image

user space tools to interact with the kernel

code samples (a good place to start)

Core components:

init/

kernel/

lib/

include/

kernel start up code (including main.c)

main kernel components code

libc used to build the kernel

headers

Specific subsystems

block/

crypto/

fs/

ipc/

mm/

net/

security/

sound/

virt/

drivers for block devices

cryptographic algorithms, hashes, …

file systems

inter-process communication

memory management

network support

kernel security mechanisms

sound drivers, audio support

virtualisation support (kvm)

Drivers

arch/

drivers/

architecture-specific code for each processor family

drivers for various hardware

Chapter 2: C Bootcamp and Kernel Programming

C Bootcamp

Function Inlining

The inline keyword allows the compiler to replace a function call by the body of the called function.

Pros

Save the cost of a function call
Allow more optimisations

Cons

Increase the code size, thus more cache misses
More pressure on registers

Inlined function definition

inline int max(unsigned int a, unsigned int b)
{
    return (a > b) ? a : b;
}

Initial call location

int f(unsigned int y)
{
    return max(y, 2 * y);
}

After inlining

int f(unsigned int y)
{
    return (y > 2 * y) ? y : 2 * y;
}

After optimisations

int f(unsigned int y)
{
    return 2 * y;
}

Branch Prediction Annotations

gcc (and most compilers) allow programmers to hint at a branch prediction with the likely() and unlikely() annotations.

These annotations are not POSIX-compliant, but supported by gcc and clang (at least).

static void next_reap_node(void)
{
    int node = __this_cpu_read(slab_reap_node);

    node = next_node(node, node_online_map);

    if (unlikely(node >= MAX_NUMNODES))
        node = first_node(node_online_map);

    __this_cpu_write(slab_reap_node, node);
}

Enforcing a Calling Convention

The asmlinkage annotation tells the compiler to always place the arguments of a function on the stack.

Without it, gcc may try to optimise function calls by placing arguments in registers instead.

Using asmlinkage prevents this optimisation, simplifying calling this function from assembly code.

It is mainly used in system calls in order to enforce the calling convention.

asmlinkage long sys_close(unsigned int fd);

In practice, asmlinkage is a macro defined in asm/linkage.h:

#define asmlinkage CPP_ASMLINKAGE __attribute__((syscall_linkage))

Unions

A union is a special type that allows storing different types of data at the same memory location.
Each member of a union is a typed alias of the same memory location.
The allocated size is equal to the size of the largest member of the union.

union {
    short x;
    long y;
    float z;
} my_union_t;

Examples in the kernel

union thread_union {
    struct thread_info thread_info;
    unsigned long stack[THREAD_SIZE/sizeof(long)];
};

The struct page is one of the worst union example $\rightarrow$

struct page {
    unsigned long flags;        /* Atomic flags, some possibly
                     * updated asynchronously */
    /*
     * Five words (20/40 bytes) are available in this union.
     * WARNING: bit 0 of the first word is used for PageTail(). That
     * means the other users of this union MUST NOT use the bit to
     * avoid collision and false-positive PageTail().
     */
    union {
        struct {    /* Page cache and anonymous pages */
            /**
             * @lru: Pageout list, eg. active_list protected by
             * lruvec->lru_lock.  Sometimes used as a generic list
             * by the page owner.
             */
            union {
                struct list_head lru;

                /* Or, for the Unevictable "LRU list" slot */
                struct {
                    /* Always even, to negate PageTail */
                    void *__filler;
                    /* Count page's or folio's mlocks */
                    unsigned int mlock_count;
                };

                /* Or, free page */
                struct list_head buddy_list;
                struct list_head pcp_list;
            };
            /* See page-flags.h for PAGE_MAPPING_FLAGS */
            struct address_space *mapping;
            union {
                pgoff_t index;      /* Our offset within mapping. */
                unsigned long share;    /* share count for fsdax */
            };
            /**
             * @private: Mapping-private opaque data.
             * Usually used for buffer_heads if PagePrivate.
             * Used for swp_entry_t if PageSwapCache.
             * Indicates order in the buddy system if PageBuddy.
             */
            unsigned long private;
        };
        struct {    /* page_pool used by netstack */
            /**
             * @pp_magic: magic value to avoid recycling non
             * page_pool allocated pages.
             */
            unsigned long pp_magic;
            struct page_pool *pp;
            unsigned long _pp_mapping_pad;
            unsigned long dma_addr;
            atomic_long_t pp_ref_count;
        };
        struct {    /* Tail pages of compound page */
            unsigned long compound_head;    /* Bit zero is set */
        };
        struct {    /* ZONE_DEVICE pages */
            /** @pgmap: Points to the hosting device page map. */
            struct dev_pagemap *pgmap;
            void *zone_device_data;
            /*
             * ZONE_DEVICE private pages are counted as being
             * mapped so the next 3 words hold the mapping, index,
             * and private fields from the source anonymous or
             * page cache page while the page is migrated to device
             * private memory.
             * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
             * use the mapping, index, and private fields when
             * pmem backed DAX files are mapped.
             */
        };

        /** @rcu_head: You can use this to free a page by RCU. */
        struct rcu_head rcu_head;
    };

    union {     /* This union is 4 bytes in size. */
        /*
         * For head pages of typed folios, the value stored here
         * allows for determining what this page is used for. The
         * tail pages of typed folios will not store a type
         * (page_type == _mapcount == -1).
         *
         * See page-flags.h for a list of page types which are currently
         * stored here.
         *
         * Owners of typed folios may reuse the lower 16 bit of the
         * head page page_type field after setting the page type,
         * but must reset these 16 bit to -1 before clearing the
         * page type.
         */
        unsigned int page_type;

        /*
         * For pages that are part of non-typed folios for which mappings
         * are tracked via the RMAP, encodes the number of times this page
         * is directly referenced by a page table.
         *
         * Note that the mapcount is always initialized to -1, so that
         * transitions both from it and to it can be tracked, using
         * atomic_inc_and_test() and atomic_add_negative(-1).
         */
        atomic_t _mapcount;
    };

    /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
    atomic_t _refcount;

#ifdef CONFIG_MEMCG
    unsigned long memcg_data;
#elif defined(CONFIG_SLAB_OBJ_EXT)
    unsigned long _unused_slab_obj_exts;
#endif

    /*
     * On machines where all RAM is mapped into kernel address space,
     * we can simply calculate the virtual address. On machines with
     * highmem some memory is mapped into kernel virtual memory
     * dynamically, so we need a place to store that address.
     * Note that this field could be 16 bits on x86 ... ;)
     *
     * Architectures with slow multiplication can define
     * WANT_PAGE_VIRTUAL in asm/page.h
     */
#if defined(WANT_PAGE_VIRTUAL)
    void *virtual;          /* Kernel virtual address (NULL if
                       not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
    int _last_cpupid;
#endif

#ifdef CONFIG_KMSAN
    /*
     * KMSAN metadata for this page:
     *  - shadow page: every bit indicates whether the corresponding
     *    bit of the original page is initialized (0) or not (1);
     *  - origin page: every 4 bytes contain an id of the stack trace
     *    where the uninitialized value was created.
     */
    struct page *kmsan_shadow;
    struct page *kmsan_origin;
#endif
} _struct_page_alignment;

Structures in Memory

A structure is a collection of one or more variables.

struct version {
    unsigned short major; // usually 2 bytes
    unsigned long minor;  // usually 8 bytes
    char flags;           // 1 byte
};

Memory alignment

A memory access is aligned if the accessed address is a multiple of the size of the access.

Example: an access to an unsigned long is aligned if the address is a multiple of 8 bytes.

Ordering and Padding

The only guarantee in the C language is the order of the members! The compiler can add padding between members for performance reasons.

An optimised version of the structure:

struct version {
    unsigned long minor;  // usually 8 bytes
    unsigned short major; // usually 2 bytes
    char flags;           // 1 byte
};

Variable-length Arrays

In C, an array must have a size. It is common to use a struct to keep it close to the array:

struct buf {
    char *buffer;
    size_t length;
};

This has several drawbacks:

Allocation is done in two steps (allocate the struct, then allocate the array)

struct buf *alloc_buffer(size_t length)
{
    struct buf *b = malloc(sizeof(struct buf));
    b->length = length;
    b->buffer = malloc(length);

    return b;
}

Freeing also requires two calls to free

void free_buffer(struct buf *b)
{
    free(b->buffer);
    free(b);
}

Copying requires a manual deep copy

struct buf *copy_buf(struct buf *b)
{
    struct buf *copy = alloc_buffer(b->length);
    memcpy(copy->buffer, b->buffer, b->length);
    
    return copy;
}

Tail-padded Structures

One way to overcome this is called tail-padded structures: placing an undefined size array as the last member of a structure.

struct buf {
    size_t length;
    char buffer[];
};

struct buf *alloc_buffer(size_t length)
{
    struct buf *b = malloc(sizeof(struct buf) + length);
    b->length = length;

    return b;
}

Allocation, free, and copy can be done in one go.

Multiple implementations are possible:

int buffer[]: in the C99 standard (flexible array member), preferred form
int buffer[1]: non-standard, but supported by compilers
int buffer[0]: non-standard, but supported by compilers

Array vs Pointer

void main(void)
{
    char *yes = "da";
    char ja[3];

    yes = ja;
    ja = yes;
}

If you run this code, you get this error:

foo.c: In function ‘main’:
foo.c:6:12: error: assignment to expression with array type
    6 |         ja = yes;
      |            ^

yes is a pointer to a char (here, the first character of the string "da").

ja is an array identifier, a symbolic constant.

Array Identifiers Ambiguity

void main(void)
{
    char *yes = "da";
    char ja[3];

    printf("yes: %p - %p\n", yes, &yes);
    printf("ja:  %p - %p\n", ja, &ja);
}

results in:

yes: 0x55b70c1d7004 - 0x7ffcf5fe4268
ja:  0x7ffcf5fe4275 - 0x7ffcf5fe4275

A symbolic constants’s address doesn’t really make sense, so the compiler gives it the value of the constant (hence ja == &ja).

Function Pointers (1)

Declaration

A function pointer is declared with the following syntax:

return_type(*function_name)(parameter_list);

Example 1: a function taking no parameters and returning nothing

void (*func_p)(void);

Example 2: a function taking an int and a char, and returning an int

int (*func_p)(int, char);

Addressing

You can get a function’s address with the & operator.

void my_func(int foo)
{
    // body
}

void (*func_ptr)(void);     // declaration
func_ptr = &my_func;        // assignment

Function pointers are also symbolic constants, which means you can use the naming ambiguity:

func_ptr = my_func;            // assignment

Function Pointers (2)

Calling a function pointer

void say_hello(char *name)
{
    printf("Hello %s\n", name;
}

int main(void)
{
    void (*func_ptr)(char *);   // declaration
    func_ptr = say_hello;       // assignment
    (*func_ptr)("zero");        // call

    return 0;
}

Since function pointers are symbolic constants, you can write:

func_ptr("zero");

Function Pointers (3)

As a function argument

Function pointers are frequently used in the kernel to set up callbacks.

void free_elem(struct elem *e)
{
    free(e);
}

void put_elem(struct elem *e, void (*release(struct elem *)))
{
    e->refcount--;
    if (!e->refcount)
        release(e);
}

int main(void)
{
    struct elem *e = malloc(sizeof(struct elem));
    put_elem(e, free_elem);
}

As a return value

int atoi(const char *nptr) { /* body */ }

int (*func_ptr(void)) (const char *)
{
    return atoi;
}

Macros

Macros for constants

#define MAX_CONNECTIONS 256

Macros as functions

#define max(a, b) ((a) > (b) ? (a) : (b))

Warning!

What happens with this code?

#define sqr(a) a * a
int a = 3;
int a_sqr = sqr(a + 1);  // expected: a_sqr = 4^2 = 16

#define sqr(a) ((a) * (a))
int a = 3;
int a_sqr = sqr(a + 1);  // expected: a_sqr = 4^2 = 16

sqr(a + 1) is expanded as 3 + 1 * 3 + 1, which returns 7.

sqr(a + 1) is expanded as ((3 + 1) * (3 + 1)), which returns 16.

Always put parenthesis around the arguments, as the macro’s expansion might provoke unwanted behaviour!

Macros (2)

Macros as code blocks

#define kthread_init_delayed_work(dwork, fn)        \
    do {                                            \
        kthread_init_work(&(dwork)->work, (fn));    \
        timer_setup(&(dwork)->timer,                \
                 kthread_delayed_work_timer_fn,     \
                 TIMER_IRQSAFE);                    \
    } while (0)

The do ... while(0) construct allows the use of this macro:

as a function: just add a ; after
e.g., kthread_init_delayed_work(dwork, fn);
as a condition: in an if or while, this will evaluate to the result of the last instruction of the loop
e.g., if (kthread_init_delayed_work(dwork, fn)) foo(x);
as an argument/operand: same as for conditions
e.g., bar(kthread_init_delayed_work(dwork, fn));

Good Practices in the Kernel

Wise Use of the Stack

Kernel stack is small compared to user stack!

Stack size is statically defined at kernel compile time, cannot grow dynamically.

Usually fits on a few pages:

8 KB for 32-bit architectures
16 KB for 64-bit architectures

What to avoid?

Large allocations on the stack
Deep recursive call chains

Floating-Point Operations

Avoid floating point operations at all cost!

Why?

Extremely costly!

Enable the FPU (Floating-Point Unit)
Save all user space state related to the FPU (i.e., registers)
Disable the FPU

Not very useful!

No access to the libc, so no existing complex functions
You can only use inline functions from gcc
Most of the time, you can work around this with integer approximation

On the Dangers of Kernel Programming

Making changes to your kernel can render it unstable and lead to a kernel panic, i.e., a full system crash.

Keep a backup kernel

Never replace your running kernel with a new one!
Always keep a fully working backup kernel installed in your bootloader!

Work in modules

Always implement your changes as modules if possible.

That will limit the impact of some crashes in your code on the rest of the kernel.
Easier to test since you can load modules dynamically, test, unload, make changes and repeat

Important

Keep in mind that a bug can corrupt persistent data, e.g. on your hard drive. You could lose data for good if you work directly on your system!

Tip

Working in a virtual machine alleviates most of these issues!

Kernel APIs

Linux Kernel API

In the kernel, you won’t have access to the usual libraries like the libc.

Thankfully, the kernel provides its own internal “library” with basic functionalities.

They are described in Documentation/core-api/index.rst.

Let’s make a quick tour of some of these functionalities!

Generic base data types
Returning errors
Printing
Memory allocation
Waiting for resources
Task queues

Use Generic Types!

To ensure portability across architectures, the kernel offers generic types defined in include/linux/types.h

u8: unsigned byte (8 bits)
u16: unsigned word (16 bits)
u32: unsigned doubleword (32 bits)
u64: unsigned quadword (64 bits)

s8: signed byte (8 bits)
s16: signed word (16 bits)
s32: signed doubleword (32 bits)
s64: signed quadword (64 bits)

If a variable is visible from user space (e.g., ioctl), you must use types prefixed with __ (double underscore)

__u8        __s8
__u16       __s16
__u32       __s32
__u64       __s64

Returning Errors

Functions in the kernel follow the same convention as system calls by returning an integer:

Success: a value $\ge 0$
Error: the negative value of the error code (i.e., -errno)

If the function returns a pointer:

Success: the pointer to return
Failure: two possibilities
- Return NULL if there is only one reason to fail
- Return the error code encoded with the ERR_PTR() macro.
  
  The calling function can check if there was an error with IS_ERR() and get the error code with PTR_ERR()

int do_shash(unsigned char *name, unsigned char *result, const u8 *data1, unsigned int data1_len,
          const u8 *data2, unsigned int data2_len, const u8 *key, unsigned int key_len)
{
    int rc;
    unsigned int size;
    struct crypto_shash *hash;
    struct sdesc *sdesc;

    hash = crypto_alloc_shash(name, 0, 0);
    if (IS_ERR(hash)) {
        rc = PTR_ERR(hash);
        pr_err("%s: Crypto %s allocation error %d\n", __func__, name, rc);
        return rc;
    }
    /* ... */

Printing

If you need to print information to be available from user space, e.g., tracing or debugging, you can use the printk() function.
It works similarly to printf(), with a couple of differences:

You should prefix your format string with a priority level defined by macros in include/linux/kern_levels.h, from KERN_EMERG to KERN_DEBUG.
The output doesn’t go to stdout, but in the kernel ring buffer that you can read from user space with the dmesg command, or with journalctl (and other commands)

Example:

printk(KERN_ERR "%s:%d: this shouldn't be reached...\n", __FILE__, __LINE__);

There are also predefined macros for each level:

pr_debug("debug message\n");
pr_info("info message\n");
pr_err("error message\n");

Tip

Formats are available at Documentation/printk-formats.txt.

Filtering your prints

You can define, at the top of your module, the following macro to add a prefix to all your prints:

#define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__

This will add your module name and the name of the function as a prefix to all you prints.

Memory Management

Memory allocation is done with the kmalloc() function, similar to malloc().
Some specific characteristics:

Fast (except if blocked waiting for pages)
Allocated memory is not initialised
Allocated memory is contiguous in physical memory
Memory is allocated by areas of $2^n - k$ bytes ($k$: a few metadata bytes).

Do not allocate 1024 B if you need 1000 B, you will end up with 2048 B!

Example:

data = kmalloc(sizeof(*data), GFP_KERNEL);

kmalloc GFP flags

The second parameter of kmalloc() is a Get Free Pages (GFP) flag:

GFP_KERNEL: Regular kernel allocation.
Can be blocking. Best choice for most cases.
GFP_ATOMIC: Non blocking allocation.
Use only in non-interruptible code.
GFP_USER: Allocate memory for a user space process.
Can block. Lowest priority.

GFP_NOIO: Can block, but no I/O can be executed.
GFP_NOFS: Can block, but no file system operation can be executed.
GFP_HIGHUSER: Allocate memory in user space high memory ($\gt 4$ GB). Can block. Low priority.

More combinations available in include/linux/gfp_types.h.

Memory Management (2)

If you need large chunks of memory, you should not use kmalloc(), and request pages directly with one of these functions:

unsigned long get_zeroed_page(int flags);

unsigned long __get_free_page(int flags);

unsigned long __get_free_pages(int flags, unsigned long order);

returns a pointer to a free page after filling it with zeros

returns a pointer to a free page

returns a pointer to a memory area with $2^{order}$ contiguous pages

Virtual allocation

If you don’t need the memory to be contiguous, you can allocate in the virtual address space instead of physical:

void *vmalloc(unsigned long size);
void vfree(void *addr);

Mapping physical to virtual addresses

You can also map a physical memory location into the virtual address space:

void *ioremap(unsigned long phys_addr, unsigned long size);
void iounmap(void *addr);

Waiting for Resources

If you need to wait for a resource (e.g., network packet, message), the interface should implement a wait queue to allow your thread to sleep and be woken up when the resource is available.

wait_event(wait_queue, condition);

wait_event_interruptible(wait_queue, condition);

thread sleeps and will be woken up if wake_up() is called on the wait queue and the condition is true

same as wait_event(), but the thread can also be woken up by a signal

The resource handler calls wake_up() on the queue to wake up waiting threads.

Workqueues

Workqueues allow you to execute code asynchronously.

At creation time, a pool of thread is initialised.

Jobs can then be submitted in the form of a function pointer and a pointer to an argument.

A thread from the workqueue will, asynchronously, check the queue, pop a job and execute it.

Tip

Documentation available in Documentation/core-api/workqueue.rst.

Generic Data Structures

The kernel also offers generic data structures to work with:

Linked lists
Maps
Circular buffers
Red-black trees
…

Important

Generic data structures in C are not obvious to build…

Generic Data Structures in C

Instead of having objects in a list, we have the list in the objects!

The “naive” version:

struct elem {
    struct object {
        int v0, v1;
    } obj;
    struct elem *next, *prev;
};

Not generic! You need one list type of list per object type.

The “good” version:

struct object {
    int v0, v1;
    struct list_head {
        struct list_head *next, *prev;
    } list;
};

You only need one list_head type to be defined, and you can reuse it for any object type!

When you iterate over the list, how do you get the containing object?

Container of

From the address of any member in a structure, how can we get the address of the structure?
e.g., from the address of a list_head element in a structure

Linux implements the container_of macro!

/**
 * container_of - cast a member of a structure out to the containing structure
 * @ptr:    the pointer to the member.
 * @type:   the type of the container struct this is embedded in.
 * @member: the name of the member within the struct.
 *
 * WARNING: any const qualifier of @ptr is lost.
 */
#define container_of(ptr, type, member) ({              \
    void *__mptr = (void *)(ptr);                   \
    static_assert(__same_type(*(ptr), ((type *)0)->member) ||   \
              __same_type(*(ptr), void),            \
              "pointer type mismatch in container_of()");   \
    ((type *)(__mptr - offsetof(type, member))); })

After expanding all macros, this looks like this:

#define offset_of(type, member) \
    (&((type *)0)->member)

#define container_of(ptr, type, member) \
    ((type *)(((void *)ptr - offset_of(type, member))))

offset_of: cast the address 0 to type * and access the member
container_of: substract this offset from the address of the member

Generic Data Structure Helpers

For each generic data structure, the kernel provides helpers to use them.

Let’s see examples for circular doubly linked lists (list_head from include/linux/list.h):

Allocators:

LIST_HEAD(name)

Insert/delete:

static inline void list_add(struct list_head *new, struct list_head *head);
static inline void list_del(struct list_head *entry);

Iterators:

list_for_each(pos, head)
list_for_each_entry(pos, head, member)

And a lot more!

Tip

You can find similar helpers for all generic data structures.
Go check them out in the kernel sources!

Concurrency in the Kernel

Resources/objects can be accessed concurrently in the kernel.

There are two reasons this can happen:

Preemption: Since version 2.6, the Linux kernel is preemptible.
This means that kernel code can be interrupted by higher priority code, e.g., device interrupt.
Multi-core processors: With multi-core CPUs, two threads can execute kernel code in parallel, and thus access the same kernel data concurrently.

Possible solutions:

Mask interrupts
Big Kernel Lock
Synchronisation primitives: semaphores, spinlocks
Atomic operations

Masking Interrupts

Concurrency problems can arise due to preemption both on single- and multi-core systems.

In the case of single-core CPUs, it can be solved solely by disabling interrupts, making the kernel code non-preemptible.

In Linux, you can use the following macros:

local_irq_disable(): this uses the proper assembly instruction to disable interrupts on the current core, e.g., cli on x86.
local_irq_enable(): this uses the proper assembly instruction to enable interrupts on the current core, e.g., sti on x86.

Example: A driver for a joystick in drivers/input/joystick/analog.c

static int analog_cooked_read(struct analog_port *port)
{
    /* some code */
    local_irq_disable();
    this = gameport_read(gameport) & port->mask;
    now = ktime_get();
    local_irq_restore(flags);
    /* some code */
}

Big Kernel Lock

On multi-core systems, disabling interrupts is not sufficient, as other cores might also access data concurrently.

One potential solution is to serialise all kernel code, allowing only one thread at a time to execute code in supervisor mode.

This was the initial solution used in Linux when support for multi-core CPUs was added.
The Big Kernel Lock (BKL) was taken when entering the kernel and released when exiting.
Only one thread at a time was running kernel code.

Pro: Extremely simple to implement and safe

Con: Large performance degradation due to the loss of parallelism for kernel code

Linux and the BKL

Linux had a Big Kernel Lock from the introduction of Symmetric Multi-Processor (SMP) in version 2.0 in 1999 until its removal in 2.6.39 in 2011.

Synchronisation Primitives

Since the BKL removal, fine-grained synchronisation mechanisms are used in the kernel.

A non-exhaustive list of synchronisation mechanisms and their (partial) API:

Mutexes

void mutex_lock(struct mutex *lock);
int mutex_trylock(struct mutex *lock);
void mutex_unlock(struct mutex *lock);

Semaphores

void down(struct semaphore *sem);
void up(struct semaphore *sem);

Spinlocks

void spin_lock(spinlock_t *lock);
void spin_unlock(spinlock_t *lock);

Readers/writer locks

void read_lock(rwlock_t *lock);
void write_lock(rwlock_t *lock);
void read_unlock(rwlock_t *lock);
void write_unlock(rwlock_t *lock);

Note

Most of these have variations that also disable interrupts when taking a lock.

Atomic Operations

Concurrent access problems can also be solved by using atomic operations in some cases.

Atomic operations are architecture-specific, and are defined in include/linux/atomic/atomic-instrumented.h

These operations should be used on a specific type, atomic_t, to represent the atomic variable.

You can find the usual atomic operations, for example:

void atomic_add(int i, atomic_t *v);
void atomic_dec(atomic_t *v);
void atomic_or(int i, atomic_t *v);
And many others…

Coding Style

The Linux Kernel Coding Style

Defined in Documentation/process/coding-style.rst.

It defines a set of rules that will be enforced when a patch is submitted:

Indentation
Line length
Spaces and braces
Error management
etc.

Tip

You can check if your patches are valid with regard to the coding style with the scripts/checkpatch.pl script!

Some Coding Style Rules

Indentation

Indentation is done with tabs, not spaces.
Tabs are 8 characters long.

Line length

For better readability, the preferred limit on the length of a line is 80 characters.
However, never break user-visible strings, as it also breaks the ability to grep them.

Since 2020, checkpatch.pl only complains about lines longer than 100 characters.

Too restrictive?

From the coding style documentation:

Now, some people will claim that having 8-character indentations makes the code
move too far to the right, and makes it hard to read on a 80-character terminal
screen. The answer to that is that if you need more than 3 levels of
indentation, you're screwed anyway, and should fix your program.

Some Coding Style Rules (2)

Braces

Opening braces are on the same line as the block they open, except for functions where they are on the next line
Closing braces are alone on an empty line
Don’t use braces to surround single-line statements…
… except if another branch of a conditional has multiple statements

for (int i = 0; i < 10; i++) {
    printk("%d\n", i);
}

int inc(int x)
{
    return ++x;
}

if (!x)
    x++;

if (x > y)
    y++;
else
    x++;

if (x > y) {
    y++;
} else {
    x++;
    y--;
}

Some Coding Style Rules (3)

Spaces

Philosphy of function-versus-keyword usage.

No spaces after functions
Spaces after keywords (except if they are used like function, e.g., sizeof)

if, switch, case, for, do, while

For operators:

Spaces on both sides of binary and ternary operators

= + - < > * / % | & ^ <= >= == != ? :
No space after unary operators or before/after postfix/prefix increment and decrement

& * + - ~ ! ++ –
No space around structure member operators

. ->
No trailing spaces at the end of lines

Chapter 3: Implementing Kernel Modules

A Quick Tour of the Kernel Configuration and Build System

Getting the Sources

From the official kernel website, kernel.org, download the tarball archive.

Or from the command line, for example:

$ wget https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.5.7.tar.xz

You can (and should) also check out the integrity of the tarball with the pgp signature:

$ wget https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.5.7.tar.sign
$ unxz linux-6.5.7.tar.xz    # the signature is done on the decompressed tarball
$ gpg --verify linux-6.5.7.tar.sign linux-6.5.7.tar

This will probably fail because you don’t have the public keys of the maintainers that generated the tarball.

Get them from the kernel’s key server (documentation):

$ gpg2 --locate-keys torvalds@kernel.org gregkh@kernel.org

You can also clone Linus Torvalds’ git tree:

$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

Configuring the Kernel

The kernel configuration describes the features that will be enabled in the built binary, as well as change their behavior.
It also describes if features should be built-in the binary or compiled as modules.

By default, the Makefile-based build system uses the file .config located at the root of the kernel sources.

You can generate initial configurations with the following commands (non-exhaustive list):

$ make allnoconfig     # minimal, everything that can be disabled is disabled
$ make defconfig       # default configuration for the local architecture
$ make localmodconfig  # configuration based on the current state of the machine (plugged devices, etc.) and builds them as modules
$ make localyesconfig  # same but everything is built-in
$ make oldconfig       # keeps the values of the current .config and asks for the new options

Or you can copy the configuration of your running kernel:

$ cp /boot/config-$(uname -r)* .config  # available on some distros
$ zcat /proc/config.gz > .config        # available if CONFIG_IKCONFIG_PROC is enabled

If you need to know more about your hardware to generate your config, check out these commands:

lshwd, lscpu, lspci, lsusb, …
dmidecode
hdparm
cat /proc/cpuinfo, cat /proc/meminfo, …
dmesg

Building the Kernel

The kernel build system is based on Makefiles.

Just run make to compile it.

$ time make

real 80m15.486s
user 74m54.606s
sys 5m32.300s

Compilation can take a long time, so do it in parallel!

$ make -j $(nproc)

The compilation produces the following important files:

vmlinux: the raw Linux kernel image. This ELF is used for debugging and profiling;
System.map: symbol table of the kernel. Not necessary to run the kernel, used for debugging;
arch/<arch>/boot/bzImage: compressed image of the kernel. This is the one that will be loaded and used.

$ du -sh vmlinux arch/x86/boot/bzImage

49M vmlinux
13M arch/x86/boot/bzImage

You can get some info on the image with the file command:

$ file arch/x86/boot/bzImage

arch/x86/boot/bzImage: Linux kernel x86 boot executable bzImage, version 6.5.7-lkp (redha@wano) \
#2 SMP PREEMPT_DYNAMIC Thu Oct 19 16:05:37 CEST 2023, RO-rootFS, swap_dev 0XC, Normal VGA

Installing a Kernel

Two main steps:

Install the kernel image, the symbol map and the initrd

$ make install

This will copy the image and symbol map in /boot, and generate the initramfs.

Install the modules

$ make modules_install

This will copy the modules (.ko files) into /lib/modules/<version>.

About the Symbol Map

The symbol map (System.map) provides the list of the symbols available in this kernel, their address and type.

$ head System.map

0000000000000000 D __per_cpu_start
0000000000000000 D fixed_percpu_data
0000000000001000 D cpu_debug_store
0000000000002000 D irq_stack_backing_store
0000000000006000 D cpu_tss_rw
000000000000b000 D gdt_page
000000000000c000 d exception_stacks
0000000000014000 d entry_stack_storage
0000000000015000 D espfix_waddr
0000000000015008 D espfix_stack

Check the manpage of the nm program for an explanation of the types.

As a rule of thumb (mostly true), lowercase means local scope while uppercase means global scope (i.e., exported symbol).

Linux Init Process

Back to the Lab

In Lab 2, task 3, you were asked to replace the init binary by a hello_world program, which led to a kernel panic. Why?

Roles of init

Initialise the system: start daemons/services, manage user sessions, mount partitions, etc.
Ancestor of all the processes on the system
Adopt all orphaned processes

Characteristics of init

Has PID 1
Cannot die

Demo time!

Linux Kernel Modules

Development Infrastructure

Multiple development methods:

Local setup
Use your usual development software, compile and run your new modules/kernel on your machine.
Pros: easy, quick
Cons: if a crash occurs, you can do nothing
Remote machine
If you have access to a separate testing machine, you can do your development on your machine and test remotely to avoid the crash issues. This machine is usually hooked through network/serial to the development machine to allow remote debugging and monitoring.
Pros: good development setup, robust to crashes
Cons: not always possible to have a second machine
Virtual machine
Develop on your local setup and deploy on a virtual machine. This replaces the previous method well, while being faster to use, and doesn’t require a second machine.
Pros: good development setup, robust to crashes, single machine
Cons: doesn’t always perfectly capture real hardware, might be slow depending on the host and guest machines

In this course, we will use the last method with QEMU as a hypervisor.

Kernel Modules Interface

A module is a library dynamically loaded into the kernel. It triggers a call to a registered function when loaded and when unloaded.

The kernel provides two macros to register these functions: module_init() and module_exit().

static int my_init(void)
{
      /* ... */
      return 0;
}
module_init(my_init);

static void my_exit(void)
{
      /* ... */
}
module_exit(my_exit);

For these to work, you will need some header files included:

// contains the module API
#include <linux/module.h>

/// contains the init and exit macros
#include <linux/init.h>

/// if needed: base types, functions, macros...
#include <linux/kernel.h>

Module Information

You should also add some information about your module with some pre-defined macros, usually at the beginning of the file:

MODULE_DESCRIPTION("Hello world module");
MODULE_AUTHOR("Redha Gouicem, RWTH");
MODULE_LICENSE("GPL");

These can be checked on any module:

$ modinfo hello.ko

filename: hello.ko
description: Hello World module
author: Redha Gouicem, RWTH
license: GPL
vermagic: 6.5.7-ARCH 686 gcc-13.2.1
depends:

Warning

The license is not only informative. It is also used to check if you are allowed to use some symbols in the kernel.

Example: Hello World

#include <linux/module.h>
#include <linux/init.h>
#include <linux/kernel.h>

MODULE_DESCRIPTION("Hello world module");
MODULE_AUTHOR("Redha Gouicem, RWTH");
MODULE_LICENSE("GPL");

static int __init hello_init(void)
{
      pr_info("Hello World!\n");

      return 0;
}
module_init(hello_init);

static void __exit hello_exit(void)
{
      pr_info("Goodbye World...\n");
}
module_exit(hello_exit);

Annotations

The __init and __exit annotations are used to help the compiler optimize the memory usage.
When some module is statically built-in the kernel binary, functions tagged with these annotations are placed in specific segments:

.init.text that is freed after the boot of the kernel
.exit.text that is never loaded in memory

Building a Module

The running kernel is deployed with a generic Makefile located in /lib/modules/$(uname -r)/build.

You can use it from anywhere like this:

$ make -C /lib/modules/$(uname -r)/build M=$PWD

This will generate your module as a .ko file (kernel object).

You can also use a custom Makefile like this one as a wrapper:

ifneq ($(KERNELRELEASE),)

  obj-m += hello.o

else

  KERNELDIR_LKP ?= /lib/modules/$(shell uname -r)/build
  PWD := $(shell pwd)

all:
        make -C $(KERNELDIR_LKP) M=$(PWD) modules

clean:
        make -C $(KERNELDIR_LKP) M=$(PWD) clean

endif

Loading/Unloading a Module

Loading a module can be done with insmod:

$ insmod hello.ko
$ dmesg

[177814.017370] Hello World!

Unloading a module can be done with rmmod:

$ rmmod hello
$ dmesg

[177919.956567] Goodbye World...

Module Parameters

#include <linux/init.h>
#include <linux/module.h>
#include <linux/moduleparam.h>

static char *month = "January";
module_param(month, charp, 0660);

static int day = 1;
module_param(day, int, 0000);

static int __init hello_init(void)
{
      pr_info("Hello ! We are on %d %s\n", day, month);
      return 0;
}
module_init(hello_init);

static void __exit hello_exit(void)
{
      pr_info("Goodbye, cruel world\n");
}
module_exit(hello_exit);

With default values:

$ insmod hello.ko
$ dmesg

[180525.067016] Hello ! We are on 1 January

With parameters:

$ insmod hello.ko month=December day=31
$ dmesg

[181086.216097] Hello ! We are on 31 December

Kernel Dynamic Linker

Like shared libraries, modules are dynamically loaded: they only have access to symbols explicitly exported to them!
By default, they have access to absolutely no variable or function from the kernel, even if they are not static!

Two macros allow to explicitly export symbols to modules:

EXPORT_SYMBOL(s) makes the symbol s visible to all loaded modules
EXPORT_SYMBOL_GPL(s) makes the symbol s visible to all modules with a license compatible with GPL (according to their MODULE_LICENSE)

Example: using the pm_power_off() function exported in arch/x86/kernel/reboot.c and available on my system:

$ grep pm_power_off /lib/modules/$(uname -r)/build/System.map

ffffffff810ed2f0 t legacy_pm_power_off
ffffffff8274d7d8 r __ksymtab_pm_power_off
ffffffff838a47f8 B pm_power_off

#include <linux/module.h>
#include <linux/kernel.h>

MODULE_DESCRIPTION("Power off module");
MODULE_LICENSE("GPL");

static int __init devil_init(void)
{
      pr_info("The end is nigh...\n");
      if (pm_power_off)
            pm_power_off();

      return 0;
}
module_init(devil_init);

Module Dependencies

If a module X uses at least one symbol from module Y, then X depends on Y.

Dependencies are not explicitly defined: they are automatically inferred during the kernel/module compilation.
You can find the list of dependencies in the file /lib/modules/<version>/modules.dep.

This file is generated by the depmod program, who checks which symbols are used by a module, and which module provide these symbols.

You can also check the dependencies of a module with modinfo.

Automated dependency solving

Obviously, modules must be inserted in the proper order: if X depends on Y, Y needs to be inserted before X.

If you are using modprobe, it will automatically insert dependencies first.

This is also true for unloading modules (in the reverse order).

Contributing to the Kernel

Patch or Module?

When developing something in the kernel, the first design choice is “how?”

You have two choices:

Implement your code in the kernel through a patch
Your code is then statically built-in the kernel binary
Implement your code in a module
Your code can be dynamically loaded by the kernel at run time

Whenever possible, modules are the best choice, as they have more chances to be merged in the mainline.

Modules: Pros and Cons

While using modules should be your first choice, it also has some drawbacks depending on what you are doing.

Pros:

Easier to develop
Easier to distribute
Avoids overloading the kernel
Lower chances of conflicts

Cons:

Internal kernel structures cannot be modified
e.g., adding a field to the file descriptor structure
Replacing/changing the behaviour of an existing kernel function
e.g., change the page frame allocator code

Patching the Kernel

If you need to modify the kernel and distribute your changes, you most likely will use patches.

A patch is the result of the diff command applied on the original files and your modified version.
It contains all the data for the patch to automatically apply the changes.

diff: Compares files line by line

Shows the added/modified/removed lines
Can ignore tabs/spaces
Can compare whole directory trees (-r)

patch: Apply changes to existing files

Can apply a patch on a file passed as an argument or stdin
Can apply a patch on a directory tree

Creating and Applying a Patch

Creating a patch for the kernel tree:

-r to enable recursive patch
-u to use the unified diff format (more compact and easier to read)

unxz linux-6.5.7.tar.xz
cp -r linux-6.5.7 linux-6.5.7-orig
cd linux-6.5.7
emacs kernel/sched/fair.c
emacs kernel/sched/sched.h
cd ..
diff -r -u linux-6.5.7-orig linux-6.5.7 > new_sched.patch
xz new_sched.patch

Applying a patch on the kernel tree:

-p 1 to omit the first level in all paths
--dry-run to only simulate the patch (for testing purposes)

unxz linux-6.5.7.tar.xz
cd linux-6.5.7
zcat new_sched.patch | patch -p 1 --dry-run

Submitting Your Work to the Kernel Community

When you think your code is ready for review by the kernel maintainer, you need to send it to them!

Note: This is just an overview of the process!

Ready your code for public eyes

Test your code first to avoid silly bugs (and being fun of publicly)
Make sure that your code is compliant with the kernel coding style and is understandable (comments?)
Use tools to helps you, e.g., checkpatch, clang-format

Prepare your patch(es)

Choose against which version of the kernel to generate your patches, usually the current mainline from Linus’ git tree (a -stable or -rc release)
Split your submission into a set of patches/commits, each being logically independent, and able to be built and run
You can manually create the patches with diff or use git format-patch to automatically generate patches from your commits (assuming you used git in the first place)

Submitting Your Work to the Kernel Community (2)

Format your patch series for emailing

Each patch email should have a one-line short description, followed by blank line and a multi-line long description, then a list of tag lines specifying who co-authored, reviewed the patch, etc. Finally, the patch should be appended.
All of this can be fairly well automated with git format-patch

Send your patch series to the mailing list

Find out to which mailing list and maintainers the email should be sent to, using the scripts/get_maintainer.pl script on your patch. You should also CC anyone who might need to see this, e.g., they work on something similar
You might need to write a first summary email (a cover letter) for your patch series
This can be fairly well automated with git send-email
The emails need to be written in plain text, no HTML!

Don’t rely on these slides only!

Go check the full version in the kernel documentation, starting with the kernel development process and patch submission process.

You can check the mailing list online at https://lore.kernel.org.

Chapter 4: User-Kernel Communication

Overview

In monolithic architectures, kernel and user programs are run in different permission levels:
supervisor mode and user mode.

There needs to be communication between the user and the kernel.

Multiple mechanisms are available in the Linux kernel, and choosing the right one is a common discussion on the mailing list.

Communication mechanisms:

Module parameters
Pseudo file systems
Sockets
System Calls
ioctl

Pseudo File Systems

In-Memory File Systems

An in-memory file system, or ramfs, is a file system not backed by a storage device.

The data stored on it is completely in memory or computed when accessed.

The Linux kernel provides a set of pseudo file systems that are ramfs representing kernel data or configuration.

They usually have a semantic where one file represents one value.

Each pseudo file system has slightly different semantics and answer to different needs: procfs, sysfs, configfs, debugfs, …

Advantage: User space programs can access these files with the standard POSIX file API: read and write.

You can thus use regular shell programs such as cat and echo to read/write from/to them.

Drawback: These mechanisms are synchronous from user to kernel, but asynchronous in the other direction,

i.e., user space applications cannot be notified when the value represented by a file changes in memory.

Using Pseudo File Systems

A pseudo file system is a component of the kernel that has to be enabled and mounted before being used.

Check that your kernel has been built with the needed pseudo file system

zcat /proc/config.gz | grep CONFIG_DEBUG_FS  # you can also grep in your .config directly

CONFIG_DEBUG_FS=y

Mount the pseudo file system

mount -t debugfs none /sys/kernel/debug

Read a value by reading from a file

cat /sys/kernel/debug/sched/wakeup_granularity_ns

Modify a value by writing to a file

echo 3000000 > /sys/kernel/debug/sched/wakeup_granularity_ns

Tip

You might need to have root privileges to read/write to/from these special files.

Reading: sudo cat <file>
Writing: echo <value> | sudo tee <file>

procfs

The oldest pseudo file system, mounted in /proc.

In lab 2, we saw that the procfs was mounted by the init script.

Goal: Export information about processes.

Since its creation, it has been used for more than that, e.g., exporting kernel data from various subsystems.

Using it is now discouraged for anything unrelated to processes.

Advantage: most widely documented

Drawback: no real structure enforced

The procfs provides two APIs:

Legacy procfs API: simple, but each file is limited to one page (PAGE_SIZE, usually 4 KB)
seq_file API: more complex, but allows larger data to be exported with a list of buffers

procfs: Example

Let’s see an example procfs file that exports a variable from the kernel in human-readable form in /proc/my_state.

We want to export the value of the system_enabled global variable.

Define a read() function that will be called when our file is read from:
- Declare a struct file_operations that will be used for our file
- Implement the read() function following the prototype

Create the file in the procfs and attach it to the struct file_operations we just created in the init function of your module
Delete the file when the module is removed, since the variable will not be available any more.

static int system_enabled;

static ssize_t system_state_read(struct file *file, char __user *buf,
                                 size_t count, loff_t *ppos)
{
    const char *tmp = system_enabled ? "The system is enabled\n"
                        : "The system is disabled\n";;
    return simple_read_from_buffer(buf, count, ppos, tmp, strlen(tmp));
}

static const struct file_operations system_state_fops = {
    .open = simple_open,
    .read = system_state_read,
    .llseek = noop_llseek,
};
static struct proc_dir_entry *system_state_proc_dir;

static int system_state_init(void)
{
    system_state_proc_dir = proc_create("my_state", 0, NULL,
                                        &system_state_fops);
    return 0;
}
module_init(system_state_init);

static void system_state_exit(void)
{
    remove_proc_entry("my_state", NULL);
}
module_exit(system_state_exit);

sysfs

Successor to the procfs, mounted in /sys.

Goal: Store information about subsystems, hardware devices, drivers, …

This should be the default choice!

Advantages:

Hierarchical topology, information is arranged logically, by subsystem, component, etc…
Provides a set of mechanisms to free memory and recursively destroy directories

Cons:

More complex than procfs
Each file represents one single piece of data
A piece of data cannot be larger than one page (PAGE_SIZE)

sysfs: Directories

The struct kobject is at the heart of the sysfs:

Each directory corresponds to a kobject
A file in the sysfs is not a kobject

Most important fields:

struct kobject {
    const char              *name;      // name of the directory
    struct kobject          *parent;    // kobject of the parent directory
    struct kset             *kset;      // the collection of kobjects this object belongs to
    struct kref              kref;      // reference counter, used to free the memory properly
    const struct kobj_type  *ktype;     // type of the object, with functions pointers to manipulate it
    /* ... */
};

To create a kobject and add it to the sysfs, you can use this functions:

struct kobject *kobject_create_and_add(const char *name, struct kobject *parent);

Caution

This works for simple cases (most likely what you want to do). For more complex scenarios, there are other functions to initialize, create and register kobjects. Check out the documentation for more information!

Don’t forget to cleanup your kobjects and their files when they are not needed anymore with:

void kobject_put(struct kobject *kobj);

sysfs: Files

Kobject attributes

Each file in the sysfs corresponds to one single value and is associated with an instance of a struct kobj_attribute.

struct kobj_attribute {
    struct attribute attr;  // file information (name, permissions)
    ssize_t (*show)(struct kobject *kobj, struct kobj_attribute *attr, char *buf);
    ssize_t (*store)(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count);
};

Kobject attributes can be created with the following macro:

#define __ATTR(_name, _mode, _show, _store)

You can also create group of attributes to have multiple files in the same directory:

static struct attribute *attrs[] = {
    &foo_attribute.attr,
    &bar_attribute.attr,
    NULL,
};

static struct attribute_group attr_grp = {
    .attrs = attrs,
};

Creating files in the sysfs

Files can be created with the following functions:

int sysfs_create_file(struct kobject *kobj, const struct attribute *attr);
int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp);

sysfs: Example 1, a Read-only Variable

We want to export the state of our system represented by an int system_enabled global variable in a human-readable form in the file located at
/sys/kernel/my_state/system_enabled.

static int system_enabled;

static ssize_t system_state_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
    return snprintf(buf, PAGE_SIZE, "The system is %srunning\n", system_enabled ? "" : "not ");
}

static struct kobj_attribute system_state_attribute = __ATTR(system_enabled, 0400, system_state_show, NULL);
static struct kobject *my_state_kobj;

sysfs: Example 1, a Read-only Variable (2)

Now, we need to instantiate the sysfs file when loading our module and destroy it when unloading.

static int __init my_state_init(void)
{
    int retval;
    
    my_state_kobj = kobject_create_and_add("my_state", kernel_kobj);
    if (!my_state_kobj)
        goto error_init_1;

    retval = sysfs_create_file(my_state_kobj, &system_state_attribute.attr);
    if (retval)
        goto error_init_2;

    return 0;

error_init_2:
    kobject_put(my_state_kobj);
error_init_1:
    return -ENOMEM;
}
module_init(my_state_init);

static void __exit my_state_exit(void)
{
    kobject_put(my_state_kobj);
}
module_exit(my_state_exit);

sysfs: Example 2, Let’s Add an RW File

static int system_enabled;
static u64 clock;

static ssize_t system_state_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
    return snprintf(buf, PAGE_SIZE, "The system is %srunning\n", system_enabled ? "" : "not ");
}

static ssize_t clock_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
    return snprintf(buf, PAGE_SIZE, "%llu\n", clock++);
}

static ssize_t clock_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count)
{
    u64 val;
    int rc = sscanf(buf, "%llu", &val);
    if (rc != 1 || rc < 0)
        return -EINVAL;

    clock = val;
    return count;
}

static struct kobj_attribute system_state_attribute = __ATTR(system_enabled, 0400, system_state_show, NULL);
static struct kobj_attribute clock_attribute = __ATTR(clock, 0600, clock_show, clock_store);

static struct attribute *attrs[] = {
    &system_state_attribute.attr,
    &clock_attribute.attr,
    NULL,
};
static struct attribute_group attr_grp = { .attrs = attrs };

sysfs: Example 2, Let’s Add an RW File (2)

static struct kobject *my_state_kobj;

static int __init my_state_init(void)
{
    int retval;
    
    my_state_kobj = kobject_create_and_add("my_state", kernel_kobj);
    if (!my_state_kobj)
        goto error_init_1;

    retval = sysfs_create_group(my_state_kobj, &attr_grp);
    if (retval)
        goto error_init_2;

    return 0;

error_init_2:
    kobject_put(my_state_kobj);
error_init_1:
    return -ENOMEM;
}
module_init(my_state_init);

static void __exit my_state_exit(void)
{
    kobject_put(my_state_kobj);
}
module_exit(my_state_exit);

configfs

configfs is another ram-based file system offering the converse functionality to sysfs, mounted in /sys/kernel/config.

While the sysfs is a view of kernel objects, configfs is a manager of kernel objects,
i.e., it allows to create/destroy kernel objects from user space.

Example: Allowing user programs to create and configure a virtual network devices by doing an mkdir in the configfs directory of a driver.

Advantage:

Allows user space programs to manage kernel objects

Drawbacks:

Complex to set up
One file equals one value

debugfs

The debugfs offers a very flexible and simple API targeted at simplifying kernel development and debugging.

It is mounted in /sys/kernel/debug.

Advantages:

Very flexible, no size limit for files
Very high level and simple API

Drawback:

Only for debugging purposes!

Quick non-exhaustive API tour:

struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
void debugfs_create_u32(const char *name, umode_t mode, struct dentry *parent, u32 *value);
void debugfs_create_str(const char *name, umode_t mode, struct dentry *parent, char **value);

debugfs: Example

static int system_state;

static struct dentry *my_state_dir;

static int my_state_init(void)
{
    struct dentry *new_file;

    my_state_dir = debugfs_create_dir("my_state", NULL);
    if (my_state_dir == 0)
        return -ENOTDIR;

    new_file = debugfs_create_u8("system_state", 0444, my_state_dir, (u8 *) &system_state);
    if (new_file == 0) {
        debugfs_remove_recursive(my_state_dir);
        return -EINVAL;
    }

    return 0;
}
module_init(my_state_init);

static void my_state_exit(void)
{
    debugfs_remove_recursive(my_state_dir);
}
module_exit(my_state_exit);

Synchronous Communication Mechanisms

Overview

Pseudo file system-based mechanisms are:

Synchronous from user to kernel:
When a user writes a value, it is changed in the kernel memory when the user program returns to user space;
Asynchronous from kernel to user:
When a value in kernel memory changes, the user space is not notified of this change until the next read.

The kernel provides various synchronous communication mechanisms:

System calls
ioctls
Sockets

System Calls

System calls are the most classic user-kernel communication mechanism.

They allow user space applications to execute privileged kernel code:

I/Os (disk, network)
Resource management (threads, memory)
Communication (signals, IPCs)
Access to specific hardware

They are the core API of the kernel, with some limitations:

Each system call is identified by a number defined at kernel compile time. You cannot add new ones dynamically.
They are a “universal” API, so they need to be the same on all systems. ¹

There are two ways of making system calls:

The “old” interrupt way (x86)
The “new” syscall instruction way (x86, amd64, ARM64)

System Calls: The Interrupt Way

System calls are a software interrupt like the others:

1. Place the system call number in a register

2. Place the arguments in the proper registers and/or on the stack

3. Trigger the “system call” interrupt (switching to supervisor mode)

4. Jump to the “system call” interrupt handler

5. Load the system call table and jump to the index given by the syscall number

6. Execute the system call handler

7. Return the result to user space (switching back to user mode)

System Calls: The Instruction Way

Some architectures provide a specific instruction (syscall, svc, sysenter, …):

1. Place the system call number in a register

2. Place the arguments in the proper registers and/or on the stack

3. Use the system call instruction (switching to supervisor mode)

4. Jump to the index given by the syscall number in the syscall table

5. Execute the system call handler

6. Return the result to user space (switching back to user mode)

One level of indirection is bypassed!

Tip

You can find a very detailed (with less omissions) explanation of how system calls work in Linux here!

System Calls: The Linux Application Binary Interface

API: Application Programming Interface
High-level interface for programmers (function prototypes, data types, …)

ABI: Application Binary Interface
Low-level interface for compilers/OS (calling conventions, architecture-specific)

Architecture	syscall#	retval	arg1	arg2	arg3	arg4	arg5	arg6	arg7
Arm EABI	r7	r0	r0	r1	r2	r3	r4	r5	r6
arm64	w8	x0	x0	x1	x2	x3	x4	x5
mips	v0	v0	a0	a1	a2	a3	a4	a5
riscv	a7	a0	a0	a1	a2	a3	a4	a5
x86-64	rax	rax	rdi	rsi	rdx	r10	r8	r9

How do we add a system call to Linux?

We’ll see that a bit later, when you’ll need to do it for an exercise.

ioctl

ioctls are a way to provide custom system calls, mainly for device drivers.

They are called from user space by the ioctl system call and provide an additional level of indirection, with each ioctl having a number.

Advantages:

Easy to extend
Provide a syscall-like interface, i.e., an arbitrary function call

Drawback:

Numbers have to stay “forever”, as changing them might break existing user space applications

An ioctl is tied to a file. It is registered as a file operation similar to read or write in the kernel.

For a given device, each ioctl has a number generated through a set of macros.

How can we use ioctls?

We won’t see the API in details here, you will have a task solely on that in the next lab.

Sockets

The kernel also provide a socket-based interface with user space called netlink.

It is similar to regular sockets in user space, but with the AF_NETLINK socket family.

Advantage:

Symmetrical communication

Drawback:

Implementation is a bit more complex than simple read/write semantics

Chapter 5: Memory Management

Memory Organization

Pages are the basic unit of memory management.

Even if memory is byte/word addressable, the smallest management unit is the page, to accommodate fast lookups and address translations.

Some definitions:

A page (or virtual page) is a fixed-size block of contiguous virtual memory.

A page frame (or physical page) is a fixed-size block of contiguous physical memory.

Page size depends on the architecture, some of them even support multiple sizes.

Supported page sizes (non-exhaustive)
Architecture	4 KiB	16 KiB	64 KiB	2 MiB	4 MiB	32 MiB	512 MiB	1 GiB
x86_64	X			X				X
armv7	X		X
aarch64	X	X	X	X		X	X	X
riscv32	X				X
riscv64	X			X				X

Kernel Representation of Pages Frames

The kernel maintains a struct page for each page frame available on the system.
The structure is defined in include/linux/mm_types.h.

Here is a simplified definition of the structure:

struct page {
    unsigned long           flags;      // page status, flags available in include/linux/page-flags.h
    atomic_t                _refcount;  // number of references to this frame

    /* page cache and anonymous pages */
    atomic_t                _mapcount;  // number of page tables this frame is mapped in
    struct address_space    *mapping;   // if used in page cache, object associated to this frame
    struct list_head        lru;        // least-recently used list for eviction

    void                    *virtual;   // kernel virtual address when not kmapped, used when using high memory
};

Since there is one struct page per physical page, isn’t this a lot of memory for metadata?

In practice, a struct page is only around 40 bytes (lots of unions in there).

So, in a system with 16 GiB of memory and 4KiB pages, you will have: $\frac{17,179,869,184}{4,096} = 4,194,304$ pages.

Which means only $4,194,304~\text{pages} \times 40~\text{B} = 160~\text{MiB}$, or roughly 1% of the total memory.

You can reduce this metadata footprint by increasing the page size, i.e., using huge pages.

Zones

Not all addresses are equal in hardware, so all frames are not treated identically.
The kernel separates pages in multiple zones with different properties.

The two main hardware limitations that require zones are:

Some hardware can only do Direct Memory Accesses (DMA) to certain addresses
Some architectures have a physical address space larger than their virtual address space,
which means that some frames are not permanently mapped into the kernel address space

Zones on x86-32 (from *Linux Kernel Development, 3rd Edition, Robert Love*)
Zone	Description	Physical Memory
`ZONE_DMA`	Contains frames that can be accessed through DMA	$\lt$ 16 MiB
`ZONE_NORMAL`	Permanently mapped pages (DMA also possible)	16–896 MiB
`ZONE_HIGHMEM`	Memory not permanently mapped into the kernel’s address space	$\gt$ 896 MiB

64-bit architectures

On architectures that can address large address spaces, e.g., 64-bit, ZONE_HIGHMEM usually does not exist, and all memory is split between ZONE_DMA and ZONE_NORMAL.

Memory API

Quick recap of the memory management API in the kernel:

Allocate pages

struct page *alloc_pages(gfp_t gfp_mask, unsigned int order); // allocate 2^order contiguous physical pages, return the first page
void *page_address(struct page *page); // get the actual address of the page from it's struct page
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); // same as alloc_pages(), but returns the address directly
struct page *alloc_page(gfp_t gfp_mask);   // wrapper to allocate a single page
unsigned long __get_free_page(gfp_t gfp_mask); // wrapper to allocate a single page and get the actual address
unsigned long get_zeroed_page(gfp_t gfp_mask); //same as __get_free_page(), but the page is filled with zeros

Free pages

void __free_pages(struct page *page, unsigned int order);
void free_pages(unsigned long addr, unsigned int order);
void free_page(unsigned long addr);

Important

Careful to not free pages you did not allocate. That would most likely break the system.

Allocate arbitrary sizes

void *kmalloc(size_t size, gfp_t flags);
void kfree(const void *ptr);
void *vmalloc(unsigned long size);
void vfree(const void *addr);

Page Frame Allocation

You might have noticed that most of the memory API works at the page granularity.

There are multiple levels of allocators:

User space allocators that allocate objects of arbitrary sizes, request memory from the kernel
e.g., glibc malloc, jemalloc, more implementations here
Kernel space object allocators that allocate objects of arbitrary sizes, request page frames
Page frame allocator that allocates page frames from the physical memory

The page frame allocator is responsible for managing the physical memory, giving page frames to allocators upon request, and reclaiming them when freed.

Page order

You might have also noticed that the page granularity API uses orders (powers of two) for allocations:

struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);

Orders come from the inner working of the page frame allocator of Linux: the buddy allocator.

Buddy Allocator

Linux uses the buddy allocator to manage physical page frame allocation.

Initially, physical memory is seen as a single large chunk containing all page frames, e.g., $2^8 = 256$ pages.

Allocation:

The allocation size is rounded to the next power of 2, i.e., the order
A chunk of the closest size is repeatedly split in half until it reaches the order size
Two chunks split from the same parent are called buddies
The resulting chunk is allocated

If a chunk of the correct size is available, it is directly chosen

Free:

The freed chunk is tagged as free
If two buddies are free, they are merged back into a larger chunk

If two neighbors are free but not buddies, they cannot be merged!

Buddy Allocator: Linux Implementation

Linux implements the buddy allocator with an array of free lists, indexed by order.

Examples:

alloc_pages(GFP_KERNEL, 3);
- Look up the list of order 3 for a free chunk
- A chunk is available, remove it from the free list and return it to the caller
alloc_pages(GFP_KERNEL, 4);
- Look up the list of order 4 for a free chunk
- No chunk of order 4 available, look up for a higher order free chunk
- A chunk of order 5 is available: split it, choose one chunk for allocation and put the other in the order 4 free list

When freeing a chunk of size order, it is added back its free list.

The kernel also maintains bitmaps to identify buddies: if both buddies are free, they are merged and stored in the order + 1 list.

User interface via procfs

You can check the state of your buddy allocator by reading the /proc/buddyinfo file.

Object Allocators

The page frame allocator works at the page granularity.

This is not convenient for most allocations that request memory for smaller objects!

Object allocators act as a layer between the page allocator and subsystems to allocate smaller chunks of memory. They allocate pages, and “redistribute” smaller chunks to subsystems that allocate memory through them.

For example, when using kmalloc(), you are using an object allocator, not directly allocating pages.

Linux has multiple object allocation layers (you can even implement your own), but we will see the main one: the slab layer.

The Slab Layer

Allocating and freeing objects is extremely frequent, so it’s a good idea to have some sort of caching mechanism.

In Linux, this caching mechanism is called the slab layer.

The slab layer allows you to create caches, each of which contains a certain type of objects, e.g., struct task_struct or struct inode.

Each cache is then divided into slabs, blocks of contiguous memory that contain a certain number of instances of the object stored by this cache.

A slab contains the actual data and maintains their status (used or free).

When a slab is full, the slab layer will allocate a new one for this cache.

When the system wants to reclaim memory, empty slabs will be freed.

Note

Additionally, allocations are done at the page granularity, so for smaller objects, you would need manual management to not waste memory.

SLAB Allocator

Added in Linux in 1996, implements work from Sun Microsystems in SunOS 5.4¹

Cache friendly

Queues to track per-cpu and per-node data
Page coloring to enforce cache locations

NUMA aware

Per-node slabs, allocation are done on the local node by default
Other policies can be used depending on the objects stored

Design: Page frame layout

Metadata of each slab can be embedded in the slab itself:

freelist contains the indexes of the free objects in this slab. It has as many entries as the number of objects that can be stored in the page frame
padding aligns the objects properly

Note

Multiple allocations can be done by only touching one cache line (the one with the freelist). No need to touch the actual objects.

SLAB Allocator (2)

Design: Data structures

Partial description, see their real definitions for more details

struct kmem_cache {
    struct array_cache __percpu *cpu_cache;
    unsigned int size;
    int object_size;
    struct kmem_cache_node *node[MAX_NUMNODES];
};

struct kmem_cache_node {
    struct list_head slabs_partial; // slabs with free objects
    struct list_head slabs_full;    // full slabs
    struct list_head slabs_free;    // empty slabs
    /* some counters, locks, and pointers to array_caches */
};

struct array_cache {
    unsigned int avail; // number of active entries
    unsigned int limit; // max number of entries
    void *entry[];      // LIFO array of free elements
};

struct slab {
    /* metadata */
    struct kmem_cache *slab_cache;
    struct list_head slab_list;
    void *freelist;
    void *s_mem;
    /* ... */
    /* actual data, containing the page frame layout shown here*/
};

Frame layout:

Object layout:

SLUB Allocator

Introduced in 2007. The idea is to simplify the implementation, with less queues.

Locality by having per-cpu slabs, still NUMA aware

struct kmem_cache {
    struct kmem_cache_cpu __percpu *cpu_slab;
    unsigned int size;
    unsigned int object size;
    struct kmem_cache_node *node[MAX_NUMNODES];
};

struct kmem_cache_node {
    unsigned long nr_partial;
    struct list_head partial;
};

struct kmem_cache_cpu {
    void **freelist;
    struct slab *slab;
};

struct slab {
    struct kmem_cache *slab_cache;
    struct list_head slab_list;
    void *freelist;
    unsigned long counters; // nr_objects, nr_inuse
};

Frame layout:

Object layout:

Current State of Slab Allocators

SLUB is the default allocator

SLOB has been deprecated in 6.2

SLAB has been deprecated in 6.5

Manipulating Slabs

As seen previously, the “main” object of the slab layer is a struct kmem_cache, as it represents an instance of a cache.

Creation

#define KMEM_CACHE (struct, flags)
// struct is the type of object that wil be stored in the cache
// flags are SLAB_POISON, SLAB_RED_ZONE and SLAB_HWCACHE_ALIGN

static struct kmem_cache *cache;

int fn(void)
{
    /* ... */
    cache = KMEM_CACHE(bio_crypt_ctx, 0);
    /* ... */
}

Destruction

void kmem_cache_destroy(struct kmem_cache *s);

Allocation/Free

Use these methods as a replacement for kmalloc() and kfree().

void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags);
void kmem_cache_free(struct kmem_cache *s, void *objp);

Slab Information

You can query which caches have been created in your system from user space:

cat /proc/slabinfo

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
ext4_groupinfo_4k   7656   7656    184   44    2 : tunables    0    0    0 : slabdata    174    174      0
ext4_fc_dentry_update      0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_inode_cache  257751 258039   1192   27    8 : tunables    0    0    0 : slabdata   9557   9557      0
ext4_allocation_context    520    520    152   26    1 : tunables    0    0    0 : slabdata     20     20      0
ext4_prealloc_space    720    720    112   36    1 : tunables    0    0    0 : slabdata     20     20      0
ext4_io_end         1408   1408     64   64    1 : tunables    0    0    0 : slabdata     22     22      0
filp               17434  19008    256   32    2 : tunables    0    0    0 : slabdata    594    594      0
inode_cache        15325  15325    648   25    4 : tunables    0    0    0 : slabdata    613    613      0
dentry            357252 357252    192   42    2 : tunables    0    0    0 : slabdata   8506   8506      0
pid                 4129   4160    128   32    1 : tunables    0    0    0 : slabdata    130    130      0
kmalloc-8k           496    496   8192    4    8 : tunables    0    0    0 : slabdata    124    124      0
kmalloc-4k          1765   1768   4096    8    8 : tunables    0    0    0 : slabdata    221    221      0
kmalloc-2k          2452   2496   2048   16    8 : tunables    0    0    0 : slabdata    156    156      0
kmalloc-1k          4670   4704   1024   32    8 : tunables    0    0    0 : slabdata    147    147      0
kmalloc-512        49888  49888    512   32    4 : tunables    0    0    0 : slabdata   1559   1559      0
kmalloc-256        20977  21024    256   32    2 : tunables    0    0    0 : slabdata    657    657      0
kmalloc-192        53424  53424    192   42    2 : tunables    0    0    0 : slabdata   1272   1272      0
kmalloc-128        64768  64768    128   32    1 : tunables    0    0    0 : slabdata   2024   2024      0
kmalloc-96          7856   8820     96   42    1 : tunables    0    0    0 : slabdata    210    210      0
kmalloc-64         69111  69120     64   64    1 : tunables    0    0    0 : slabdata   1080   1080      0
kmalloc-32         26610  27008     32  128    1 : tunables    0    0    0 : slabdata    211    211      0
kmalloc-16         40386  41472     16  256    1 : tunables    0    0    0 : slabdata    162    162      0
kmalloc-8          32767  32768      8  512    1 : tunables    0    0    0 : slabdata     64     64      0
kmem_cache_node      640    640     64   64    1 : tunables    0    0    0 : slabdata     10     10      0
kmem_cache           384    384    256   32    2 : tunables    0    0    0 : slabdata     12     12      0

Memory Pools

If your code performs allocations and needs a guarantee that memory will be available, you can use memory pools.
This should be used only if your code will fail if memory is not available.
For example, some drivers performing DMA might need to allocate objects during an operation with hardware, where failure would break the hardware.

A memory pool is a chunk of pre-allocated memory that is guaranteed to be able to store at least a minimal number of objects.

Creation/Destruction

/**
 * mempool_create - create a memory pool
 * @min_nr:    the minimum number of elements guaranteed to be
 *             allocated for this pool.
 * @alloc_fn:  user-defined element-allocation function.
 * @free_fn:   user-defined element-freeing function.
 * @pool_data: optional private data available to the user-defined functions.
 */
mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t *free_fn, void *pool_data);
void mempool_destroy(mempool_t *pool)

typedef void * (mempool_alloc_t)(gfp_t gfp_mask, void *pool_data);
typedef void (mempool_free_t)(void *element, void *pool_data);

Allocation/Free

void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask);
void mempool_free(void *element, mempool_t *pool);

Memory Pool on Top of a Slab Cache

You can also build a memory pool on top of a slab cache with the following wrapper function:

static inline mempool_t *mempool_create_slab_pool(int min_nr, struct kmem_cache *kc);

If you want to use the kmalloc slab cache, you can use these wrapper function:

static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size);

Freeing Memory

If you allocate memory, you need to free it at some point to avoid memory leaks.

There are multiple ways of reclaiming memory for the kernel:

Manually free the memory you allocated when you don’t need it anymore, e.g., with kfree() or kmem_cache_free().
This works well for data that is allocated and freed in a single code path and easy to manage.
Using reference counters to free objects when there are no more references to it.
Works well when objects can be used in different subsystems or by different actors in the kernel.
Memory reclamation by the shrinker under memory pressure.
Should be set up for objects that may take a large portion of memory but are not necessary for your code to work,
e.g., you store statistics in memory without limiting the total amount of data you keep, but you are ok with freeing the old data if the system needs memory.
Memory reclamation by the page frame reclamation algorithm (PFRA).
Automatically executed by the kswapd daemon and under heavy memory pressure.
Evicts to storage the least recently used pages.

Reference Counters

Reference counters keep track of the number of users of an object. Whenever the counter reaches 0, the object is not in use anymore and can be freed.

To use a reference counter, you need to embed a struct kref into your structure. It needs to be embedded, not a pointer!

struct kref {
    // refcount_t -> struct refcount_struct -> atomic_t -> int
    refcount_t refcount;
};

struct wacom_hdev_data {    // Example from a Wacom driver
    struct list_head list;
    struct kref kref;
    struct hid_device *dev;
    struct wacom_shared shared;
};

You can check the full API in include/linux/kref.h, but the main methods are:

/**
 * kref_get - increment refcount for object.
 * @kref: object.
 */
void kref_get(struct kref *kref);

/**
 * kref_put - decrement refcount for object.
 * @kref: object.
 * @release: pointer to the function that will clean up the object when the
 *       last reference to the object is released.
 *       This pointer is required, and it is not acceptable to pass kfree
 *       in as this function.  If the caller does pass kfree to this
 *       function, you will be publicly mocked mercilessly by the kref
 *       maintainer, and anyone else who happens to notice it.  You have
 *       been warned.
 */
int kref_put(struct kref *kref, void (*release)(struct kref *kref));

Putting a kref object

The release() function needs to free the object containing the kref.

You can achieve this by using the container_of macro.

Shrinker

If you allocate a lot of objects that are useful but not necessary, you can play nice and let the kernel reclaim your memory if needed.
When the kernel is under memory pressure, it runs the shrinker to reclaim memory from registered components.

To register a shrinker for your code, you first need to declare a struct shrinker and define a count() and a scan() functions.

struct shrinker {
    unsigned long (*count_objects)(struct shrinker *, struct shrink_control *sc);
    unsigned long (*scan_objects)(struct shrinker *, struct shrink_control *sc);

    int seeks;  /* seeks to recreate an obj */
    long batch; /* reclaim batch size, 0 = default */
    unsigned long flags;

    /* These are for internal use */
    struct list_head list;
    /* objs pending delete, per node */
    atomic_long_t *nr_deferred;
};

When the kernel wants to reclaim memory, it will call the count() method of all registered shrinkers to assess how many objects can be freed.
It will then call the scan() method if the count is positive in order to actually free the memory.

With your shrinker declared, you still need to register it so that the kernel will call it when memory reclaiming is performed.
You also need to unregister your shrinker when it is not usable anymore, e.g., when you unload your module.

int register_shrinker(struct shrinker *);
void unregister_shrinker(struct shrinker *);

Page Frame Reclamation

Linux tries to keep a pool of available free pages to ensure future allocations won’t fail (most likely).

Pages can be of one of four types:

Unreclaimable: Cannot be swapped out and reclaimed, e.g., kernel stacks, locked pages
Swappable: Can be swapped out and reclaimed, e.g., anonymous pages, shared memory
Syncable: Cached disk data that might need to be synced before reclamation if dirty, i.e., page cache
Discardable: Unused pages that can be immediately reclaimed, e.g., unused pages in kernel slab allocators

Page reclamation is performed on two occasions:

During an allocation, when the amount of free pages drops below the low watermark, the kswapd daemon is asynchrounously woken up to reclaim pages
During an allocation, when the amount of free pages drops below the min watermark, a direct reclamation is synchronously performed

Page Frame Reclamation Algorithm (PFRA)

The page frame reclamation algorithm (PFRA) implements a form of Least-Recently Used (LRU) algorithm.

The rationale is the following:

Pages are stored into an LRU list, sorted by access time (this is not really true, we’ll see why later)
When a page is accessed, it is moved to the head of the list
When pages need to be reclaimed, the PFRA starts at the tail of the LRU list

In practice, it is very costly to maintain a sorted LRU list.

Thus, Linux approximates it using a clock algorithm, using the referenced bit from the page table.

There are two versions of the PFRA:

Until 6.0: the two-list strategy
From 6.1: the Multi-Gen LRU (MGLRU)

Let’s have a look at both implementations!

PFRA: Two-List Strategy

The PFRA maintains two lists: the active and the inactive lists.

The active list contains recently accessed pages, while the inactive list contains the pages that were not accessed recently.
In other words, the active list contains the current working set of the system.

When a page is accessed for the first time, it is added in the inactive list, with a referenced bit set to 0.

Periodically, the PFRA scans the lists:

When a page is accessed, its referenced bit is set to 1 by the MMU (used)
If an inactive page’s referenced bit is 1 during the scan, the page is moved to the active list (used)
Periodically (after multiple scans), referenced bits are cleared (timeout)
If an active page’s referenced bit is 0, the page is moved to the inactive list (refill)
If the active list is too long, pages can be moved to the inactive list (refill)

Lengths of lists

The PFRA tries to balance the length of the active and inactive list through a ratio. By default, Linux is usually tuned to have an active list at most two thirds of the page cache’s size.

PFRA: Multi-Gen LRU

Unfortunately, the active/inactive lists have a few shortcomings:

Too coarse-grained: Hard to differentiate the real “age” of a page based on active/inactive status only
Costly: Scanning the lists takes a noticeable amount of time, i.e., kswapd daemon
Biased to evict file-backed pages: Due to the reference counting method, file-backed pages end up evicted much more than idle anonymous pages

In Linux 6.1, the Multi-Gen LRU (MGLRU) was introduced to fix these issues.
It generalises the previous concept with more lists than just active/inactive, called generations.

The general idea and algorithm are the same: pages are moved between generations depending on their use recency.
The old algorithm could be described as an MGLRU with two generations.

Having more generations gives a finer-grained estimation of the age of a page:
pages in the same generation have been last accessed roughly at the same period of time.

Each generation is also smaller than with the old version, making the scan faster.

Generations are also now split into tiers that regroup pages that were accessed the same amount of time in the generation.

Other Memory Features in the Kernel

Transparent Huge Pages (THP)

Transparently use huge pages (2 MiB or 1 GiB on x86-64) when allocating large memory areas. Reduces the pressure on the TLB (less entries for the same amount of data) and shortens page table operations (less page table levels to walk though).

Compaction

Reduce fragmentation of the physical memory by compacting pages close to each other. This reduces the amount of holes due to the buddy allocator, and enables merging free buddies, and thus allocating larger memory areas.

Kernel Samepage Merging (KSM)

Deduplication mechanism that can be enabled to detect page frames with the same content and merge them. The resulting page frame is then mapped at each virtual address the original page frames belonged to.

Out-of-Memory Killer (OOM)

When the system critically runs out of memory, the OOM killer chooses a process to sacrifice in order to enough free memory for the system to keep operating without freezing or crashing.

Chapter 6: Virtual File System

The Virtual File System Abstraction

Overview of the VFS

The Virtual File System (VFS) defines a set of abstractions that are then made concrete by file system implementations.

Objects

A file descriptor represents an instance of an open file
An inode represents a file on the storage device
The directory entry caches the resolution of a file path to its corresponding inode
A superblock represents an instance of a mounted file system

Interfaces

File operations operate on file content (open, read, write, …)
Inode operations operate on file metadata (create, mkdir, unlink, …)
Superblock operations operate on a partition (mount, sync, unmount, …)

File Descriptors

File descriptors represent an instance of an open file.
It contains information about the file on storage as well as the current state of the open file, e.g., cursor position.
There can be multiple file descriptors for the same file on storage when opened multiple times.

A file descriptor is defined as a struct file in include/linux/fs.h

struct file {
    fmode_t         f_mode;     // mode in which the file is opened (rwxrwxrwx)
    atomic_long_t   f_count;    // number of threads sharing this file descriptor
    loff_t          f_pos;      // position in the file, the "cursor"
    struct inode    *f_inode;   // the inode representing the concrete file
    struct file_operations  *f_op;
    struct address_space    *f_mapping; // mapping in memory for the page cache
    /* ... */
};

Inodes

Inodes describe files or directories on the storage device.
Each inode corresponds to one and only one file/directory.
Conversely, each file/directory corresponds to one and only one inode.

An inode is defined as a struct inode in include/linux/fs.h

struct inode {
    umode_t             i_mode;     // mode of the file on disk
    kuid_t              i_uid;      // user id of the owner of the file
    kgid_t              i_gid;      // group id of the owner of the file
    unsigned long       i_ino;      // inode number

    unsigned int        i_nlink;    // number of links to this file (hard links)
    loff_t              i_size;     // size of the file
    struct timespec64   i_atime;    // date of the last access
    struct timespec64   i_mtime;    // date of the last modification
    struct timespec64   i_ctime;    // date of creation

    struct inode_operations     *i_op;
    struct super_block          *i_sb;      // super block (partition) that contains this inode
    struct address_space        *i_mapping; // mapping in memory for the page cache
};

The inode is partially stored on the disk too, in order to preserve information across mounts/reboots, e.g., file size, timestamps, mode, etc…

Resolving Paths to Inodes

From a user perspective, a file/directory is identified by its path.
The VFS needs to translate this path into an inode to actually interact with the file.
This resolution operation is costly as it requires numerous string operations and walking the directory hierarchy on disk.

To avoid repeating this operation too many times, the VFS builds directory entries, or dentries.
A dentry maintains a relationship between a path and its corresponding inode.

Example: If you open the file located at /home/lkp/foo/bar, it will create/query the following five dentries: /, home, lkp, foo and bar.

Dentries are defined as struct dentry in include/linux/dcache.h

struct dentry {
    struct qstr         d_name;     // file name
    struct inode        *d_inode;   // inode of this path
    struct dentry       *d_parent;  // parent in the fs hierarchy
    struct super_block  *d_sb;      // mounted file system owning this dentry

    struct hlist_bl_node d_hash;    // list for the hash table
};

Dentries can be in one of three states:

Used: dentry points to a valid inode (d_inode points to an inode) and is in the dentry cache
Unused: dentry is allocated but not in use currently, kept in cache for future reuse
Negative: dentry does not point to a valid inode (d_inode is NULL) (you will see why this is useful for in the lab)

Dentry Cache

Dentries are cached into a hash table for fast lookup.
This cache is declared as static struct hlist_bl_head *dentry_hashtable in fs/dcache.c.

Spoiler alert!

No more details here, you will have to work with that cache in the next lab.

In particular, you will have to find out:

the dentry cache size (number of buckets)
how the hash table is implemented
what are negative dentries used for

Back to the VFS

Now, let’s go back to our layers, and more specifically how the VFS glues all this together

Relations Between VFS Objects

File System

As shown in the previous figure, implementing a file system means implementing a set of file and inode operations.

File systems may also add their own flavor of inode definition.

Example: The ext4 file system has this inode definition that extends the VFS inode:

struct ext4_inode {
    __le16  i_mode;         /* File mode */
    __le16  i_uid;          /* Low 16 bits of Owner Uid */
    __le32  i_size_lo;      /* Size in bytes */
    __le32  i_atime;        /* Access time */
    __le32  i_ctime;        /* Inode Change time */
    __le32  i_mtime;        /* Modification time */
    __le32  i_dtime;        /* Deletion Time */
    __le16  i_gid;          /* Low 16 bits of Group Id */
    __le16  i_links_count;  /* Links count */
    __le32  i_blocks_lo;    /* Blocks count */
    __le32  i_flags;        /* File flags */
    __le16  i_checksum_hi;  /* crc32c(uuid+inum+inode) BE */
    __le32  i_generation;   /* File version (for NFS) */
    __le32  i_file_acl_lo;  /* File ACL */
};

Page Cache

Quick reminder about computer system latencies:

Latency for a 1 MB sequential read
(Latency Numbers Everyone Should Know, Google SRE)
Device	Time (in ns)
Memory	10.000
SSD	1.000.000
HDD	5.000.000

Accessing storage device is two orders of magnitude longer than accessing memory!

To alleviate this, Linux has a page cache sitting between file systems and storage devices.

Pages read from block devices are cached in memory for fast access
The page cache employs a write-back policy: writes are asynchronously propagated to the storage device
The page cache does not have a size, it uses all the free memory available. When free memory is needed, the PFRA evicts pages

Relations Between VFS Objects and the Page Cache

The xarray in the address_space may contain folios instead of pages depending on the file system. We won’t go into more details on this, but if you want to have a look, you can check the kernel documentation/online resources.

Super Block

A super block describes a file system partition.
When you mount a file system partition, a struct super_block object is created and populated with information read from storage.

This structure is defined in include/linux/fs.h:

struct super_block {
    struct block_device     *s_bdev;        // the block device containing this partition
    unsigned long           s_blocksize;    // size of a block
    loff_t                  s_maxbytes;     // max file size
    struct file_system_type *s_type;        // file system descriptor
    struct super_operations *s_op;          // fs ops (alloc_inode, sync, umount)
    struct dentry           *s_root;        // dentry of the root of the mount point (the / of this partition)
    unsigned long           s_magic;        // magic number of this file system
    void                    *s_fs_info;     // file system-specific private info
    char                    s_id[32];       // short name
    uuid_t                  s_uuid;         // UUID
    /* ... */
};

The operations on super blocks are also defined in include/linux/fs.h:

struct super_operations {
    /* inode handling */
    struct inode *(*alloc_inode)(struct super_block *sb);
    void (*destroy_inode)(struct inode *);
    void (*free_inode)(struct inode *);
    int (*write_inode) (struct inode *, struct writeback_control *wbc);
    /* partition handling */
    void (*put_super) (struct super_block *);
    int (*sync_fs)(struct super_block *sb, int wait);
    int (*statfs) (struct dentry *, struct kstatfs *);
    void (*umount_begin) (struct super_block *);
    /* ... */
};

Relations Between VFS Objects (Super Block Edition)

Implementing a File System

To implement a file system, you need to:

Design the layout on your physical storage device
- Design your super block and how inodes and blocks are managed
- Design your inode content (what information must be stored)
Design the operations on your file system
- File operations (open, read, write, …)
- Inode operations (create, unlink, mkdir, …)
- Super block operations (alloc_inode, sync, …)
- Page cache operations if you want to use it

For 1., this mostly means implementing the mkfs user space utility for your new file system.
For 2., this is your kernel implementation as a module.

File system in User SpacE

Another way of implementing a file system in Linux is through the File system in User SpacE (FUSE) API.

With FUSE, you can implement everything in user space and register your FUSE file system with the kernel.

The VFS will then redirect system calls targeting your file system to your code in user space, which in turn may use the underlying “real” file system.

Performance

The amount of round trips between user and kernel modes greatly impacts the performance of FUSE-based file systems.

FUSE Passthrough

Since 6.9, the FUSE subsystem supports a passthrough feature to avoid mode switches.

It allows the FUSE driver to directly communicate with other file systems for read/write operations, without going back to the user space FUSE daemon.

Project Teaser Trailer

Spoiler Alert!!!

What we have seen today will be useful for the project!

You will need to… implement new features in a simple file system provided to you: ouichefs!

The list of features to implement is a surprise. We will announce it early next week.

You can find the source code on Github at:
https://github.com/rgouicem/ouichefs

Chapter 7: Tracing Facilities in the Kernel

Program Tracing

Definition: Recording information about the execution of a program.

Useful for debugging, fault detection or program optimization.

Logging: Record information about events happening on a system
Probing: Insert code into an executable to run when some specific address is reached/event is triggered

Kernel Tracing Facilities

The Linux kernel offers a wide variety of tracing facilities at multiple levels:

User space tracers:
- strace for system calls
- ptrace for processes
Kernel tracers:
- ftrace to trace functions
- fprobe to execute code at certain points in the code
- tracepoints to log predefined events
eBPF virtual machine

Strace – System call tracer

Strace is a user space tool that tracks system calls, their parameters and returns values.
It uses a facility provided by the kernel, ptrace, to track the system calls performed by a given process and signals raised.

You can filter which system calls or signals are tracked with -e.
Filters can work on classes of events, names, return value, etc.
Check the manual page for more info (man strace).

strace cat foo.txt

execve("/usr/bin/cat", ["cat", "foo.txt"], 0x7fff59f2c8e8 /* 49 vars */) = 0
brk(NULL)                               = 0x563ba2f52000
arch_prctl(0x3001 /* ARCH_??? */, 0x7fff77ceafd0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=179255, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 179255, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f024dc19000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220~\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=1948832, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f024dc17000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 1973104, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f024da35000
mmap(0x7f024da5b000, 1417216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f024da5b000
mmap(0x7f024dbb5000, 344064, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x180000) = 0x7f024dbb5000
mmap(0x7f024dc09000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d3000) = 0x7f024dc09000
mmap(0x7f024dc0f000, 31600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f024dc0f000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f024da32000
arch_prctl(ARCH_SET_FS, 0x7f024da32740) = 0
set_tid_address(0x7f024da32a10)         = 294998
set_robust_list(0x7f024da32a20, 24)     = 0
rseq(0x7f024da33060, 0x20, 0, 0x53053053) = 0
mprotect(0x7f024dc09000, 16384, PROT_READ) = 0
mprotect(0x563ba14e1000, 4096, PROT_READ) = 0
mprotect(0x7f024dc76000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7f024dc19000, 179255)          = 0
getrandom("\xae\xe8\x88\xce\x4d\xad\xad\x78", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x563ba2f52000
brk(0x563ba2f73000)                     = 0x563ba2f73000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=3057440, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 3057440, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f024d600000
close(3)                                = 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, "foo.txt", O_RDONLY)   = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=11, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f024dc23000
read(3, "asljdhcnas\n", 131072)         = 11
write(1, "asljdhcnas\n", 11asljdhcnas
)            = 11
read(3, "", 131072)                     = 0
munmap(0x7f024dc23000, 139264)          = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Ptrace – Process tracer

The system call ptrace allows a thread (the tracer) to observe and control the execution of another thread (the tracee).

long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);

The tracer sends commands (requests) to observe or modify the behavior of the tracee, and the tracee is stopped on events specified by the tracer.

Some commands:

PTRACE_ATTACH/PTRACE_SEIZE: trace the specified thread.
PTRACE_TRACEME: used by a tracee to ask its parent to trace it.
PTRACE_GETREGS/PTRACE_SETREGS: get/set the registers of the tracee.
PTRACE_GETSIGINFO/PTRACE_SETSIGINFO: get/set the information about the signal that caused the tracee to stop.
PTRACE_CONT: make a stopped tracee resume its execution.
PTRACE_SYSCALL: same as PTRACE_CONT, but also makes the tracee stop at the next syscall entry/exit.
PTRACE_SINGLESTEP: same as PTRACE_CONT, but also makes the tracee stop after executing one instruction.

Ftrace – Function tracer

Ftrace is a kernel facility to monitor the behavior of the kernel.
Unlike its name suggests, it is not only a function tracer.
It provides the following features:

Function tracer:
- When/how many times a function is called
- Function execution time
- Function call graph
Latency tracer:
- Interrupt handling latency
- Interrupts disabled latency
- Task wake up latency
- OS noise
Function prober:
- Hook code at the entry/exit of any function of the kernel
- Useful for tracing, monitoring, patching

Ftrace – User perspective

Ftrace exposes a pseudo file system to user space, mounted in /sys/kernel/tracing/ (and also in /sys/kernel/debug/tracing/ for backward compatibility). If it is not, you can mount it with:

mount -t tracefs nodev /sys/kernel/tracing
ls /sys/kernel/tracing

available_events                    events                      README                  set_graph_notrace   trace_options
available_filter_functions          free_buffer                 saved_cmdlines          snapshot            trace_pipe
available_filter_functions_addrs    function_profile_enabled    saved_cmdlines_size     stack_max_size      trace_stat
available_tracers                   hwlat_detector              saved_tgids             stack_trace         tracing_cpumask
buffer_percent                      instances                   set_event               stack_trace_filter  tracing_max_latency
buffer_size_kb                      kprobe_events               set_event_notrace_pid   synthetic_events    tracing_on
buffer_total_size_kb                kprobe_profile              set_event_pid           timestamp_mode      tracing_thresh
current_tracer                      max_graph_depth             set_ftrace_filter       touched_functions   uprobe_events
dynamic_events                      options                     set_ftrace_notrace      trace               uprobe_profile
dyn_ftrace_total_info               osnoise                     set_ftrace_notrace_pid  trace_clock         user_events_data
enabled_functions                   per_cpu                     set_ftrace_pid          trace_marker        user_events_status
error_log                           printk_formats              set_graph_function      trace_marker_raw

You can check the available tracers:

cat /sys/kernel/tracing/available_tracers

timerlat osnoise hwlat blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop

Set/get the tracers you want to use by writing to/reading from the current_tracer file:

echo function > /sys/kernel/tracing/current_tracer

Add filters (functions, PIDs, CPUs, etc.), set up the ring buffer containing the trace, etc.
Full documentation at: https://www.kernel.org/doc/html/latest/trace/ftrace.html

Start/stop tracing by writing 1/0 to tracing_on, and read the trace by reading trace.

Ftrace – User perspective (2)

Trace all traceable functions in the kernel:

echo function > /sys/kernel/tracing/current_tracer
echo 1 > /sys/kernel/tracing/tracing_on
head -n25 /sys/kernel/tracing/trace

# tracer: function
#
# entries-in-buffer/entries-written: 1025451/1369300398   #P:20
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
          <idle>-0       [018] ..s2. 851622.721272: call_timer_fn <-__run_timers
          <idle>-0       [018] ..s2. 851622.721273: mix_interrupt_randomness <-call_timer_fn
          <idle>-0       [018] ..s2. 851622.721273: _raw_spin_lock_irqsave <-mix_interrupt_randomness
          <idle>-0       [018] d.s3. 851622.721273: blake2s_update <-mix_interrupt_randomness
          <idle>-0       [018] d.s3. 851622.721273: _raw_spin_unlock_irqrestore <-mix_interrupt_randomness
          <idle>-0       [018] ..s2. 851622.721274: _raw_spin_lock_irq <-__run_timers
          <idle>-0       [018] d.s3. 851622.721274: _raw_spin_unlock_irq <-__run_timers
          <idle>-0       [018] ..s2. 851622.721274: run_rebalance_domains <-__do_softirq
          <idle>-0       [018] ..s2. 851622.721274: update_blocked_averages <-run_rebalance_domains
          <idle>-0       [018] d.s2. 851622.721274: raw_spin_rq_lock_nested <-_raw_spin_rq_lock_irqsave
          <idle>-0       [018] d.s3. 851622.721274: _raw_spin_lock <-raw_spin_rq_lock_nested
          <idle>-0       [018] d.s3. 851622.721275: update_rq_clock <-update_blocked_averages
          <idle>-0       [018] d.s3. 851622.721275: psi_account_irqtime <-update_rq_clock

Ftrace – User perspective (3)

Trace a function graph of all calls to the read() system call:

echo function_graph > /sys/kernel/tracing/current_tracer
echo __x64_sys_read > set_graph_function

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 ------------------------------------------
 14) QuotaMa-56957  =>   cat-519885  
 ------------------------------------------

 14)               |  __x64_sys_read() {
 14)               |    ksys_read() {
 14)               |      __fdget_pos() {
 14)   0.563 us    |        __fget_light();
 14)   1.705 us    |      }
 14)               |      vfs_read() {
 14)               |        rw_verify_area() {
 14)               |          security_file_permission() {
 14)   0.422 us    |            bpf_lsm_file_permission();
 14)   0.488 us    |            __fsnotify_parent();
 14)   2.216 us    |          }
 14)   3.035 us    |        }
 14)   0.601 us    |        __get_task_ioprio();
 14)               |        ext4_file_read_iter [ext4]() {
 14)               |          generic_file_read_iter() {
 14)               |            filemap_read() {
 14)               |              filemap_get_pages() {
 14)               |                filemap_get_read_batch() {
 14)   0.424 us    |                  __rcu_read_lock();
 14)   0.440 us    |                  __rcu_read_unlock();
 14)   2.537 us    |                }
 14)   3.419 us    |              }
 14)   0.430 us    |              folio_mark_accessed();
 14)               |              touch_atime() {
 14)               |                atime_needs_update() {
 14)   0.476 us    |                  make_vfsuid();
 14)   0.423 us    |                  make_vfsgid();
 14)               |                  current_time() {
 14)   0.480 us    |                    ktime_get_coarse_real_ts64();
 14)   1.312 us    |                  }
 14)   3.838 us    |                }
 14)   4.625 us    |              }
 14) + 11.279 us   |            }
 14) + 12.038 us   |          }
 14) + 12.860 us   |        }
 14)   0.492 us    |        __fsnotify_parent();
 14) + 19.284 us   |      }
 14) + 22.507 us   |    }
 14) + 25.423 us   |  }

Ftrace – User front ends

In practice, you most likely will use a front-end tool instead of directly configuring ftrace.

perf can use ftrace for some of its features
trace-cmd is a CLI tool from the ftrace developers (https://www.trace-cmd.org/)
- Control the ftrace file system interface
- Display the trace, statistics, etc.
KernelShark is a GUI tool from the ftrace developers to visualize traces

Ftrace – Kernel perspective

Most features of ftrace boil down to hooking callbacks before/after a function is called.

When ftrace is enabled in a kernel, the compiler adds noop instructions before and after every function of the kernel.

These instructions can then be patched at run time as calls to callbacks.

These callbacks can trigger the tracers previously presented (function, function graph, etc.) or call arbitrary functions in kernel space.

This is an overly simplified explanation of what is happening!

Check out this amazing (and very very technical) talk from the main ftrace developer, Steven Rostedt:
ftrace: Where modifying a running kernel all started – Kernel Recipes 2019

Fprobe – Function entry/exit probing

You can use the callback patching mechanism to call your own functions before/after any function in the kernel. This mechanism is called fprobe ¹.
This can be used to add measurements, filtering function parameters, etc.
The callbacks need to be in kernel space, so they should be part of a module or in the kernel binary.

To define a new probe, you must define a struct fprobe:

struct fprobe fp = {
    .entry_handler  = my_entry_callback,
    .exit_handler   = my_exit_callback,
};

/* Prototype of entry/exit handlers */
int (*entry_handler)(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data);
void (*exit_handler)(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data);

and then register/unregister it:

/* `filter` and `notfilter` are wildcard patterns to match symbol names to probe or not, respectively. */
int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter);
int unregister_fprobe(struct fprobe *fp);

/* Example: fp handlers will be called for functions names matching "func*()" (e.g. func0(), func1()) but not "func2()" */
register_fprobe(&fp, "func*", "func2");

You can also temporarily enable/disable registered probes with:

void enable_fprobe(struct fprobe *fp);
void disable_fprobe(struct fprobe *fp);

Tracepoints – Event tracer

Kernel developers can also add predefined events that can be dynamically traced: tracepoints ¹.

These events are directly in the kernel code as function calls, e.g., when a kthread stops:

int kthread_stop(struct task_struct *k)
{
    /* some code */
    trace_sched_kthread_stop(k);
    /* some code */
}

They are defined through macros in header files ²:

TRACE_EVENT(sched_kthread_stop,         // event name, the tracepoint will be called trace_<name>()

    TP_PROTO(struct task_struct *t),    // prototype of the function

    TP_ARGS(t),                         // names of the argument (same as in TP_PROTO, needed to avoid ugly macros)

    TP_STRUCT__entry(          // "local variable" called `__entry` that defines the layout of the data logged in this event
        __array(    char,   comm,   TASK_COMM_LEN   )   // equivalent to `char comm[TASK_COMM_LEN];`
        __field(    pid_t,  pid         )               // equivalent to `pid_t pid;`
    ),

    TP_fast_assign(                     // C statements to initialize the `__entry` variable
        memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
        __entry->pid    = t->pid;
    ),

    TP_printk("comm=%s pid=%d", __entry->comm, __entry->pid)    // formatted output of the event as exposed to user space
);

Tracepoints – User interface

List available tracepoints:

grep "sched:" /sys/kernel/tracing/available_events | head

sched:sched_wake_idle_without_ipi
sched:sched_swap_numa
sched:sched_stick_numa
sched:sched_move_numa
sched:sched_process_hang
sched:sched_pi_setprio
sched:sched_stat_runtime
sched:sched_stat_blocked
sched:sched_stat_iowait
sched:sched_stat_sleep

Enable/disable an event with the set_event interface:

echo "sched_wakeup" >> /sys/kernel/tracing/set_event      # enable
echo "!sched_wakeup" >> /sys/kernel/tracing/set_event     # disable

Enable/disable an event with the enable toggle:

echo 1 > /sys/kernel/tracing/events/sched/sched_wakeup/enable      # enable
echo 0 > /sys/kernel/tracing/events/sched/sched_wakeup/enable     # disable

Other informations are available in each event’s directory, e.g., its format.

Tracepoints – User interface (2)

Trace task wake up events:

echo sched_wakeup >> /sys/kernel/tracing/set_event
head -n25 /sys/kernel/tracing/trace

# tracer: nop
#
# entries-in-buffer/entries-written: 2854/33330   #P:20
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
          <idle>-0       [003] dNh4. 880955.413151: sched_wakeup: comm=irq/195-iwlwifi pid=545 prio=49 target_cpu=003
          <idle>-0       [003] dNh4. 880955.413833: sched_wakeup: comm=irq/195-iwlwifi pid=545 prio=49 target_cpu=003
          <idle>-0       [003] dNh4. 880955.530027: sched_wakeup: comm=irq/195-iwlwifi pid=545 prio=49 target_cpu=003
          <idle>-0       [003] dNh4. 880955.530614: sched_wakeup: comm=irq/195-iwlwifi pid=545 prio=49 target_cpu=003
##### CPU 9 buffer started ####
          <idle>-0       [009] dNs5. 880955.631160: sched_wakeup: comm=kworker/9:0 pid=529392 prio=120 target_cpu=009
          <idle>-0       [009] dNs5. 880956.444597: sched_wakeup: comm=kworker/9:0 pid=529392 prio=120 target_cpu=009
          <idle>-0       [003] dNh5. 880956.637659: sched_wakeup: comm=migration/3 pid=110 prio=0 target_cpu=003
          <idle>-0       [009] dNh5. 880956.637712: sched_wakeup: comm=migration/9 pid=128 prio=0 target_cpu=009
          <idle>-0       [009] dN.2. 880956.708432: sched_wakeup: comm=code pid=128079 prio=120 target_cpu=009
          <idle>-0       [009] dN.2. 880956.709104: sched_wakeup: comm=code pid=128079 prio=120 target_cpu=009
          <idle>-0       [009] dN.2. 880957.111469: sched_wakeup: comm=kworker/9:0 pid=529392 prio=120 target_cpu=009
          <idle>-0       [009] dN.2. 880957.613408: sched_wakeup: comm=MemoryPoller pid=57121 prio=120 target_cpu=009

or with trace-cmd…

eBPF – A virtual machine in the kernel

eBPF (extended Berkeley Packet Filter) is a managed runtime environment embedded in the kernel.
It defines an instruction set (bytecode), data structures (maps), primitives, as well as a run time infrastructure to execute eBPF programs.

eBPF – Writing eBPF programs

There are multiple ways of writing eBPF programs:

Directly write eBPF bytecode
Cumbersome, don’t do that
bcc (BPF Compiler Collection)
Toolkit to generate eBPF programs for the kernel in C, with Python and lua front-ends.

eBPF – Writing eBPF programs with bcc

examples/tracing/vfsreadlat.c (Source: GitHub, iovisor/bcc)

#include <uapi/linux/ptrace.h>

BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);

int do_entry(struct pt_regs *ctx)
{
    u32 pid;
    u64 ts;

    pid = bpf_get_current_pid_tgid();
    ts = bpf_ktime_get_ns();
    start.update(&pid, &ts);
    return 0;
}

int do_return(struct pt_regs *ctx)
{
    u32 pid;
    u64 *tsp, delta;

    pid = bpf_get_current_pid_tgid();
    tsp = start.lookup(&pid);

    if (tsp != 0) {
        delta = bpf_ktime_get_ns() - *tsp;
        dist.increment(bpf_log2l(delta / 1000));
        start.delete(&pid);
    }

    return 0;
}

examples/tracing/vfsreadlat.py (Source: GitHub, iovisor/bcc)

from bcc import BPF

# load BPF program
b = BPF(src_file = "vfsreadlat.c")
b.attach_kprobe(event="vfs_read", fn_name="do_entry")
b.attach_kretprobe(event="vfs_read", fn_name="do_return")

eBPF – Other applications

While eBPF originated from the Linux kernel, it is also used in other projects:

Microsoft provides an eBPF runtime for Windows
Cilium, Pixie, Calico, Falco use eBPF to provide, secure and observe network interactions on top of Kubernetes
bpftrace uses an awk inspired language to write tracing utilities through bcc
Katran uses eBPF to write high-performance network load balancers
KubeArmor uses eBPF for container runtime security enforcement in Kubernetes
…

Zone	Description	Physical Memory
`ZONE_DMA`	Contains frames that can be accessed through DMA	\(\lt\) 16 MiB
`ZONE_NORMAL`	Permanently mapped pages (DMA also possible)	16–896 MiB
`ZONE_HIGHMEM`	Memory not permanently mapped into the kernel’s address space	\(\gt\) 896 MiB