Linux Kernel Programming

Redha Gouicem & Julien Sopena




   

Chapter 1: History and Architecture of the Linux Kernel

Operating System and Kernel

In this course, we will use the following definitions:

Definition

The operating system is the set of software components that enables applications to use the underlying hardware and provides APIs to ease development.

Definition

The kernel is the set of components of the operating system that are executed in a privileged mode, usually in supervisor mode.

Kernel Taxonomy

Kernels are usually classified in various types:

  • Monolithic kernels
  • Microkernels
  • Hybrid microkernels
  • Unikernels


Let’s have a quick recap of these kernel architectures!

Monolithic Kernels

A monolithic kernel embeds all the system functionalities in a single binary. It contains all the core features of an operating system (scheduling, memory management, etc…) as well as drivers for devices or less essential components.


Characteristics

  • Defines a high level interface through system calls
  • Good performance when kernel components communicate (regular function calls in kernel space)
  • Limited safety: if one kernel component crashes, the whole system crashes

Examples

  • Unix family: BSD, Solaris
  • Unix-like: Linux
  • DOS: MS-DOS
  • Critical embedded systems: Cisco IOS

Why ‘monolithic’?

Monolithic means it is built as a single binary and runs in the same address space. The source code can still be organised in a modular way (e.g., using libraries).

Source: Wikimedia

Modularity

Some monolithic kernels allow dynamic code loading as modules, e.g., for drivers. These are usually called modular monolithic kernels.

Microkernels

A microkernel contains only the minimal set of features needed in kernel space:
address-space management, basic scheduling and basic inter-process communication.
All other services are pushed in user space as servers:
file systems, device drivers, high level interfaces, etc.


Characteristics

  • Small memory footprint, making it a good choice for embedded systems
  • Enhanced safety: when a user space server crashes, it does not crash the whole system
  • Adaptability: servers can be replaced/updated easily, without rebooting
  • Limited performance: IPCs are costly and numerous in such an architecture

Examples

  • Minix
  • L4 family: seL4, OKL4, sepOS
  • Mach
  • Zircon

Source: Wikimedia

Hybrid Microkernels

The hybrid kernel architecture sits between monolithic kernels and microkernels.
It is a monolithic kernel where some components have been moved out of kernel space as servers running in user space.

While the structure is similar microkernels, i.e., using user servers, hybrid kernels do not provide the same safety guarantees as most components still run in the kernel.

Controversial architecture

This architecture’s existence is controversial, as some just define it as a stripped down monolithic kernel.


Examples

  • Windows NT
  • XNU (Mach + BSD)

Source: Wikimedia

Unikernels

A unikernel, or library operating system, embeds all the software in supervisor mode.
The kernel as well as all user applications run in the same privileged mode.

It is used to build single application operating systems, embedding only the necessary set of applications in a minimal image.


Characteristics

  • High peformance: system calls become regular function calls and no copies between user and kernel spaces
  • Security: attack surface is minimized, easier to harden
  • Usability: hard to build unikernels due to lack of features supported


Examples

  • Unikraft
  • clickOS
  • IncludeOS

Comparison Between Kernel Architectures



The choice of architecture has various impacts on performance, safety and interfaces:

  • Switching modes is costly: minimizing mode switches improves performance
  • Supervisor mode is not safe: minimizing code in supervisor mode improves safety and reliability
  • High level interfaces for programmers are in different locations depending on the architecture
    i.e., in the kernel for monolithic, but in libraries or servers for microkernels

In this course, we will focus on a monolithic modular kernel: Linux.

A Brief History of the Linux Kernel

Unix Systems

In the 1960s, MIT, AT&T Bell Labs and General Electric built Multics (Multiplexed Information and Computing Service).

Multics is a time-sharing operating system for mainframes that introduced new concepts:

  • multitasking: multiple users can use the system simultaneously
  • hierarchical file system: files are organised as a tree with directories
  • single-level store: files on storage are all mapped in memory, thus not accessed with read/write primitives, but through regular memory accesses


In 1970, AT&T Bell Labs left the project and started Unix, led by Ken Thompson, with Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna.

Unix kept the hierarchical file system but dropped the single-level store, going for an “everything is a file” philosophy.

Unix was originally a single-tasking OS.


Why ‘Unix’?

The name Unix is a pun on Multics/Unics. Kernighan came up with the name, but states that “no one can remember” who came up with the spelling.

Timeline of Unix Systems

Source: https://en.wikipedia.org/wiki/History_of_Unix

Linux: Origins

First public appearance on the Minix newsgroup

From: Linus Benedict Torvalds
To: comp.os.minix
Subject: What would you like to see most in minix?
Date: 25 August 1991,  22:57:08

Hello everybody out there using minix -

I'm doing a (free) operating system (just a hobby, won't be
big and professional like gnu) for 386(486) AT clones. This
has been brewing since april, and is starting to get ready.
I'd like any feedback on things people like/dislike in minix,
as my OS resembles it somewhat (same physical layout of the
file-system (due to practical reasons) among other things).

I've currently ported bash(1.08) and gcc(1.40), and things
seem to work. This implies that I'll get something practical
within a few months, and I'd like to know what features most
people would want. Any suggestions are welcome, but I won't
promise I'll implement them :-)

Linus (torv...@kruuna.helsinki.fi)

PS. Yes - it's free of any minix code, and it has a
multi-threaded fs. It is NOT protable (uses 386 task switching
etc), and it probably never will support anything other than
AT-harddisks, as that's all I have :-(.

Reply from Andrew Tanenbaum (creator of Minix)

From: Andrew S. Tanenbaum
To: comp.os.minix
Subject: What would you like to see most in minix?
Date: 30 January 1992, 09:04

/* blablabla */

I still maintain the point that designing a monolithic kernel
in 1991 is a fundamental error. Be thankful you are not my
student. You would not get a high grade for such a design :-)

/* blablabla */

Prof. Andrew S. Tanenbaum (a...@cs.vu.nl)

Chronology of the Linux Kernel

Year Version Features
1994 1.0 stable kernel with basic UNIX functionalities
1995 1.2–1.3 round-robin scheduler, loadable modules, /dev/random
1996 2.0 PowerPC support, multicore, improved networking, Tux
1999 2.2 frame buffer, NTFS, FAT32, IPv6, USB, SLAB allocator
2001 2.4 new file systems (ext3, XFS, tmpfs), netfilter
2003 2.6 preemptible kernel, O(1) scheduler, ALSA
2004 2.6.4–2.6.10 EFI support, x86-64, ARMv6, CFQ IO scheduler
2005 2.6.14 FUSE support
2007 2.6.20–2.6.23 KVM, tickless kernel, SLUB allocator, CFS scheduler
2008 2.6.24–2.6.28 cgroups, ext4
2011 2.6.39 removal of the Big Kernel Lock (BKL)
2014 3.14–3.18 OverlayFS, eBPF, kernel address space layout randomization (KASLR)
2015 4.0 live patching
2018 4.15 kernel page table isolation (security mitigations)
2019 5.1 io_uring
2020 5.6 wireguard
2022 6.1 multi-gen LRU eviction algorithm, initial Rust support
2023 6.6 new EEVDF scheduler
2024 6.12 PREEMPT_RT, sched_ext

Linux Kernel Architecture

Linux offers six main functions:

  1. Process management
  2. Memory management
  3. Network management
  4. Storage management
  5. System interface
  6. Human interface

through five abstraction layers:

  1. User space interfaces

    System calls, procfs, sysfs, device files, …

  2. Virtual subsystems

    Virtual memory, virtual filesystem, network protocols, …

  3. Functional subsystems

    Filesystems, memory allocators, scheduler, …

  4. Devices control

    Interrupts, generic drivers, block devices, …

  5. Hardware interfaces

    Device drivers, architecture-specific code, …

Linux Kernel Map

Source: https://makelinux.github.io/kernel/map/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code


arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code


arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code


arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code


arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure

1. Tools and environment

2. Core components

3. Specific subsystems

4. Drivers and architecture-specific code


arch/

block/

COPYING

CREDITS

crypto/

Documentation/

drivers/

fs/

include/

init/

ipc/

Kbuild

Kconfig

kernel/

lib/

MAINTAINERS

Makefile

mm/

net/

README

REPORTING-BUGS

samples/

scripts/

security/

sound/

tools/

usr/

virt/

Linux Kernel Source Tree Structure (2)

Tools and environment:

Documentation/

scripts/

usr/

tools/

samples/

x

text documentation, in addition to comments

scripts used for configuration, formatting, etc…

utilities to generate the Linux image

user space tools to interact with the kernel

code samples (a good place to start)

Core components:

init/

x

kernel/

lib/

include/

x

kernel start up code (including main.c)

main kernel components code

libc used to build the kernel

headers

x

Specific subsystems

block/

crypto/

fs/

ipc/

mm/

net/

security/

sound/

virt/

x

x

drivers for block devices

cryptographic algorithms, hashes, …

file systems

inter-process communication

memory management

network support

kernel security mechanisms

sound drivers, audio support

virtualisation support (kvm)

x

Drivers

arch/

x

drivers/

x

x

architecture-specific code for each processor family

drivers for various hardware

Chapter 2: C Bootcamp and Kernel Programming

C Bootcamp

The Linux kernel is a C monster

By the year 2025, the Linux Kernel surpasses 40 million lines of code

Despite its size, it remains accessible because:
  • Strict adherence to the Linux kernel coding style
  • Highly structured code, resembling an object-oriented programming style
  • The Linux kernel is modular, allowing developers to focus on specific subsystems.
  • Extensive documentation is available, both in the source code and online.
  • A large and active community provides support and guidance.
  • The process of code review is rigorous and ensures high-quality contributions.
  • Tools like git, checkpatch.pl, and sparse help maintain code quality and consistency.

The Linux kernel is a C monster


Predominantly written in C

C Standard: C11 since version 5.18 (March 2022).

Assume GCC (version 5.1 or later, since september 2021).


Keyword Description
#pragma GCC Provides compiler-specific directives.
__attribute__ Modifies the behavior of functions, variables, or types.
__builtin__ Provides built-in functions for specific operations, often optimized.
__asm__ Allows embedding assembly code directly in C/C++.
__typeof__ Retrieves the type of an expression or variable.
__restrict__ Indicates exclusive use of a pointer for optimizations.
__volatile__ Prevents the compiler from optimizing an instruction or variable.
__thread Declares a thread-local variable (Thread Local Storage).
asmlinkage Enforces specific calling conventions for system calls.
__builtin_expect Hints to the compiler about branch prediction.


Note

These keywords are not part of the C standard and are specific to gcc. Some of them are also available in other compilers, but their syntax may vary. The kernel can currently be compiled with Clang and ICC.

Prerequisites for this course

Before diving into the core of this vast system, in this lecture we will review or learn some advanced aspects of the C language.

However, to fully benefit from this course, you should already have a solid understanding of the basic system concepts.


C langage
  • The basics of C: control structures, pointers, memory allocations, …
  • The different segments and their functioning: .text, .rodata, .data, .bss, .stack, and .heap
  • The basics of compilation: preprocessor, compilation, and linking
System Programming
  • The system functions provided by Glibc: file handling, process management, synchronization, and more.
  • The concepts of address spaces, including virtual and physical addresses
  • The principles of interrupts and system calls
Assembly Language
  • The basics of assembly language: registers, instructions, …
  • ABI of function calls: calling conventions, stack management, …

C Bootcamp

Macro

Macros in C are preprocessor directives defined with #define, enabling text substitution before compilation.

They are invaluable for centralizing constants, simplifying code, and enhancing maintainability.


Macro substitution in strings

#define BITS_PER_LONG 64

printf("size : BITS_PER_LONG\n");

$ ./a.out size : BITS_PER_LONG

#define BITS_PER_LONG 64

printf("size : %d\n", BITS_PER_LONG);

$ ./a.out size : 64

Using a macro to prefix strings

#define MODULE_NAME "wip"

printf("%s: NULL pointer dereference\n", MODULE_NAME);

$ ./a.out wip: NULL pointer dereference

#define MODULE_NAME "wip:"

printf(MODULE_NAME "NULL pointer dereference\n");

$ ./a.out wip: NULL pointer dereference

If there is an issue, you can observe the result of macro substitution using

$ gcc -E your_file.c -o your_file_after_preprocessor.c

Predefined Macros

Standard Predefined Macros

The C standard defines several predefined macros that provide useful information about the compilation environment:

  • __FILE__: Expands to the name of the current file as a string literal.
  • __LINE__: Expands to the current line number in the source file as an integer constant.
  • __func__: Expands to the name of the current function as a string literal (C99).
  • __DATE__: Expands to the date of compilation as a string literal in the format “Mmm dd yyyy”.
  • __TIME__: Expands to the time of compilation as a string literal in the format “hh:mm:ss”.

GCC-Specific Predefined Macros

In addition to the standard predefined macros, GCC provides several useful predefined macros:

  • __VERSION__: Expands to a string literal containing the version of GCC.
  • __OPTIMIZE__: Defined if optimizations are enabled.
  • __BASE_FILE__: Expands to the name of the main source file being compiled.
  • __FILE_NAME__: Expands to the name of the current file being compiled, including the path.
  • __COUNTER__: Expands to the number of times the macro has been invoked in the current translation unit.

Example Usage

printf("Application compiled on %s at %s with GCC %s\n", __DATE__, __TIME__, __VERSION__);
printf("ASSERT %d: NULL pointer dereference in %s:%d\n", __COUNTER__, __FILE__, __LINE__);

Macro Arguments

The C standard allows adding arguments to macros to make them more generic, but they require special attention to avoid errors.


Error 1: Space after the macro name

#define half (a) a / 2

int x = half(4);  // -> x = (a) a / 2(4) 
#define half(a) a / 2

int x = half(4);  // -> x = 4 / 2

Error 2: Using arguments without parentheses

#define half(a) a / 2

int x = half(4 + 4);  // -> x = 4 + 4 / 2
#define half(a) (a) / 2

int x = half(4 + 4);  // -> x = (4 + 4) / 2

Error 3: Not encapsulating the macro body

#define half(a) (a) / 2

int x = half(4) * 2;  // -> x = (4) / 2 * 2
#define half(a) ( (a) / 2 )

int x = half(4) * 2;  // -> x = ( (4) / 2 ) * 2

Limitation: Side-effect operators

#define double(a) ( (a) + (a) )

int x = double(a++) ;  // -> x = (a++) + (a++)

Solution with GCC’s statement expressions

#define double(a) ({ int _a = (a); _a + _a; })

int x = double(a++);

Macro Arguments: Example of Use in the Kernel

The kernel uses and abuses parameterized macros to avoid function calls and their associated overhead.

Since it does not use naming conventions, it is often necessary to look at the declaration to know if a call is a macro or a function.


Exercise

Provide the declaration of the macro phy_div found in the file net/wireless/realtek/rtw89/phy.h of the Linux kernel.

This macro divides two numbers but returns 0 if the denominator is zero.

#define phy_div(a, b) ({int _b = (b); (_b) ? ((a) / (_b)) : 0; })


Solution independent of parameter types

GCC provides a way to refer to the type of an expression is with typeof.

The syntax of using this keyword looks like sizeof, but the construct acts semantically like a type name defined with typedef.

#define phy_div(a, b) ({typeof(b) _b = (b); (_b) ? ((a) / (_b)) : 0; })

Deal with semi-colons

The use of macros should be completely transparent to the user. In the kernel, it is often discovered late that a function being used is actually a macro.

This poses a problem when the function-like macro is used in a conditional expression, as in the macro virtio_cread used in virtio_bt.c.

/* Config space accessors. */
#define virtio_cread(vdev, structname, member, ptr)         \
    {                                                       \
        typeof(((structname*)0)->member) virtio_cread_v;    \
        might_sleep();                                      \
        ...
        *(ptr) = virtio_to_cpu(vdev, virtio_cread_v);       \
    } 
        if (virtio_has_feature(vdev, VIRTIO_BT_F_CONFIG_V2))
            virtio_cread(vdev, struct virtio_bt_config_v2,
                         vendor, &vendor);
        else
            virtio_cread(vdev, struct virtio_bt_config,
                         vendor, &vendor);

$ make virtio_bt.c: In function ‘main’: virtio_bt.c:313:9: error: ‘else’ without a previous ‘if’ 313 | else | ^~

do { ... } while(0) is a generic solution commonly used in the kernel.
/* Config space accessors. */
#define virtio_cread(vdev, structname, member, ptr)         \
    do {                                                    \
        typeof(((structname*)0)->member) virtio_cread_v;    \
        might_sleep();                                      \
        ...
        *(ptr) = virtio_to_cpu(vdev, virtio_cread_v);       \
    } while(0)


GCC’s statement expressions also provide a specific solution to this problem.
/* Config space accessors. */
#define virtio_cread(vdev, structname, member, ptr)         \
    ({                                                      \
        typeof(((structname*)0)->member) virtio_cread_v;    \
        might_sleep();                                      \
        ...
        *(ptr) = virtio_to_cpu(vdev, virtio_cread_v);       \
    })

Note

In practice, the do {} while(0) construct is often used to define empty functions in macros. This ensures that the macro behaves like a regular function, even when it does nothing. An example can be seen in the file include/linux/interrupt.h when we want to disable the hard_irq_disable function:

#ifndef hard_irq_disable
#define hard_irq_disable()  do { } while(0)
#endif

From Macros to Inline Functions

While macros are powerful, they can lead to unexpected behavior due to their lack of type checking and potential for side effects.
Inline functions provide a safer and more maintainable alternative in many cases.

The inline keyword allows the compiler to replace a function call by the body of the called function.


Inlined function definition

inline int max(unsigned int a, unsigned int b)
{
    return (a > b) ? a : b;
}

Initial call location

int f(unsigned int y)
{
    return max(y, 2 * y);
}

After inlining

int f(unsigned int y)
{
    return (y > 2 * y) ? y : 2 * y;
}

After optimisations

int f(unsigned int y)
{
    return 2 * y;
}

How to choose between function, macro, and inline function

Simplicity and fonctionality

Criterion Classic Function Macro with Parameters Inline Function
Readability and debugging Easy to read and debug Makes debugging harder Easy to read and debug
Portability Standard C Depends on preprocessor Depends on standard version
Types as parameters Impossible parameter can be a type impossible

Safety and correctness

Criterion Classic Function Macro with Parameters Inline Function
Execution Control Always a function call Always substituted May or may not be inlined
Type checking Compiler checks types No type checking Compiler checks types
Side-effect handling Managed by compiler Side effects not controlled Managed by compiler

Optimizing Performance

Criterion Classic Function Macro with Parameters Inline Function
Function call cost Function call overhead No call, direct substitution No call, inlined in code
Code size Code factorization Multiple substitutions Multiple Inlining
Optimization after substitution Generic code Contextual optimization Contextual optimization
Register usage Optimized by compiler Increased register pressure Increased register pressure

GCC-Specific Pragmas

GCC provides several pragmas to control the behavior of the compiler for specific sections of code.
Pragmas are introduced with #pragma and are often used for optimizations, warnings, or memory alignment.

Pragma Description
#pragma once Ensures a header file is included only once during compilation.
#pragma GCC optimize Specifies optimization levels or flags for specific sections of code.
#pragma GCC diagnostic Controls warnings or errors (e.g., ignore or treat as errors).
#pragma pack Adjusts the alignment of structure members to reduce memory usage.
#pragma GCC poison Prevents the use of specific identifiers in the code.
#pragma GCC target Enables specific CPU instructions (e.g., SSE, AVX) for a section of code.

Examples

  • Disable a warning:

    #pragma GCC diagnostic ignored "-Wunused-variable"
    int unused_var;
  • Optimize a function:

    #pragma GCC optimize("O3")
    void my_function() {
        // Optimized code
    }
  • Align a structure:

    #pragma pack(push, 1)
    struct MyStruct {
        char a;
        int b;
    };
    #pragma pack(pop)

Warning: Pragmas are compiler-specific and may not be portable across different compilers.

GCC-Specific Attributes

To avoid name conflicts, GCC introduces a generic keyword __attribute__((...)).

Each attribute is a GCC-specific extension that allows modifying the behavior of functions, variables, and types.


Example: The noreturn attribute specifies that a function does not return, allowing the compiler to optimize accordingly.

void my_exit() __attribute__((noreturn));


Simplified usage in the kernel

The syntax of GCC attributes can be verbose, so the Linux kernel simplifies their usage with macros.
Each declaration includes links to external attribute documentation (e.g., gcc, clang).

The kernel defines, for example, an alias for the attribut noreturn as __noreturn:

/*
 *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-noreturn-function-attribute
 * clang: https://clang.llvm.org/docs/AttributeReference.html#noreturn
 * clang: https://clang.llvm.org/docs/AttributeReference.html#id1
 */
#define __noreturn     __attribute__((__noreturn__))


This alias is used in the scheduler include/linux/sched/task.h:

void __noreturn do_task_dead(void);

Branch Prediction Attributes

The kernel code introduces the attributes likely() and unlikely() to improve branch prediction.

Declaration

/*
 * Using __builtin_constant_p(x) to ignore cases where the return
 * value is always the same.  This idea is taken from a similar patch
 * written by Daniel Walker.
 */
# ifndef likely
#  define likely(x) (__branch_check__(x, 1, __builtin_constant_p(x)))
# endif
# ifndef unlikely
#  define unlikely(x)   (__branch_check__(x, 0, __builtin_constant_p(x)))
# endif

Example of use

static void next_reap_node(void)
{
    int node = __this_cpu_read(slab_reap_node);

    node = next_node(node, node_online_map);

    if (unlikely(node >= MAX_NUMNODES))
        node = first_node(node_online_map);

    __this_cpu_write(slab_reap_node, node);
}

Enforcing a Calling Convention

The asmlinkage annotation tells the compiler to always place the arguments of a function on the stack.


Without it, gcc may try to optimise function calls by placing arguments in registers instead.

Using asmlinkage prevents this optimisation, simplifying calling this function from assembly code.


It is mainly used in system calls in order to enforce the calling convention.

asmlinkage long sys_close(unsigned int fd);


In practice, asmlinkage is a macro defined in asm/linkage.h:

#define asmlinkage CPP_ASMLINKAGE __attribute__((syscall_linkage))

Unions

A union is a special type that allows storing different types of data at the same memory location.
Each member of a union is a typed alias of the same memory location.
The allocated size is equal to the size of the largest member of the union.

union {
    short x;
    long y;
    float z;
} my_union_t;


Examples in the kernel

Each kernel thread has its own stack.

Historically, the thread_info structure was placed at the bottom of the kernel stack.

The union overlays both in the same memory, allowing the kernel to find thread_info by masking the stack pointer.


union thread_union {
    struct thread_info thread_info;
    unsigned long stack[THREAD_SIZE/sizeof(long)];
};


Note

Today, major architectures (x86, ARM, ARM64, RISC-V, PowerPC, S390) have moved thread_info into task_struct, for security reasons: a stack overflow could corrupt thread_info and escalate privileges.
This union remains in use only on older or less maintained architectures (MIPS, SPARC, M68K, etc.).

struct page

The struct page is one of the worst union example \(\rightarrow\)

struct page {
    unsigned long flags;        /* Atomic flags, some possibly
                     * updated asynchronously */
    /*
     * Five words (20/40 bytes) are available in this union.
     * WARNING: bit 0 of the first word is used for PageTail(). That
     * means the other users of this union MUST NOT use the bit to
     * avoid collision and false-positive PageTail().
     */
    union {
        struct {    /* Page cache and anonymous pages */
            /**
             * @lru: Pageout list, eg. active_list protected by
             * lruvec->lru_lock.  Sometimes used as a generic list
             * by the page owner.
             */
            union {
                struct list_head lru;

                /* Or, for the Unevictable "LRU list" slot */
                struct {
                    /* Always even, to negate PageTail */
                    void *__filler;
                    /* Count page's or folio's mlocks */
                    unsigned int mlock_count;
                };

                /* Or, free page */
                struct list_head buddy_list;
                struct list_head pcp_list;
            };
            /* See page-flags.h for PAGE_MAPPING_FLAGS */
            struct address_space *mapping;
            union {
                pgoff_t index;      /* Our offset within mapping. */
                unsigned long share;    /* share count for fsdax */
            };
            /**
             * @private: Mapping-private opaque data.
             * Usually used for buffer_heads if PagePrivate.
             * Used for swp_entry_t if PageSwapCache.
             * Indicates order in the buddy system if PageBuddy.
             */
            unsigned long private;
        };
        struct {    /* page_pool used by netstack */
            /**
             * @pp_magic: magic value to avoid recycling non
             * page_pool allocated pages.
             */
            unsigned long pp_magic;
            struct page_pool *pp;
            unsigned long _pp_mapping_pad;
            unsigned long dma_addr;
            atomic_long_t pp_ref_count;
        };
        struct {    /* Tail pages of compound page */
            unsigned long compound_head;    /* Bit zero is set */
        };
        struct {    /* ZONE_DEVICE pages */
            /** @pgmap: Points to the hosting device page map. */
            struct dev_pagemap *pgmap;
            void *zone_device_data;
            /*
             * ZONE_DEVICE private pages are counted as being
             * mapped so the next 3 words hold the mapping, index,
             * and private fields from the source anonymous or
             * page cache page while the page is migrated to device
             * private memory.
             * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
             * use the mapping, index, and private fields when
             * pmem backed DAX files are mapped.
             */
        };

        /** @rcu_head: You can use this to free a page by RCU. */
        struct rcu_head rcu_head;
    };

    union {     /* This union is 4 bytes in size. */
        /*
         * For head pages of typed folios, the value stored here
         * allows for determining what this page is used for. The
         * tail pages of typed folios will not store a type
         * (page_type == _mapcount == -1).
         *
         * See page-flags.h for a list of page types which are currently
         * stored here.
         *
         * Owners of typed folios may reuse the lower 16 bit of the
         * head page page_type field after setting the page type,
         * but must reset these 16 bit to -1 before clearing the
         * page type.
         */
        unsigned int page_type;

        /*
         * For pages that are part of non-typed folios for which mappings
         * are tracked via the RMAP, encodes the number of times this page
         * is directly referenced by a page table.
         *
         * Note that the mapcount is always initialized to -1, so that
         * transitions both from it and to it can be tracked, using
         * atomic_inc_and_test() and atomic_add_negative(-1).
         */
        atomic_t _mapcount;
    };

    /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
    atomic_t _refcount;

#ifdef CONFIG_MEMCG
    unsigned long memcg_data;
#elif defined(CONFIG_SLAB_OBJ_EXT)
    unsigned long _unused_slab_obj_exts;
#endif

    /*
     * On machines where all RAM is mapped into kernel address space,
     * we can simply calculate the virtual address. On machines with
     * highmem some memory is mapped into kernel virtual memory
     * dynamically, so we need a place to store that address.
     * Note that this field could be 16 bits on x86 ... ;)
     *
     * Architectures with slow multiplication can define
     * WANT_PAGE_VIRTUAL in asm/page.h
     */
#if defined(WANT_PAGE_VIRTUAL)
    void *virtual;          /* Kernel virtual address (NULL if
                       not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
    int _last_cpupid;
#endif

#ifdef CONFIG_KMSAN
    /*
     * KMSAN metadata for this page:
     *  - shadow page: every bit indicates whether the corresponding
     *    bit of the original page is initialized (0) or not (1);
     *  - origin page: every 4 bytes contain an id of the stack trace
     *    where the uninitialized value was created.
     */
    struct page *kmsan_shadow;
    struct page *kmsan_origin;
#endif
} _struct_page_alignment;

Structures in Memory

A structure is a collection of one or more variables.

struct version {
    unsigned short major; // usually 2 bytes
    unsigned long minor;  // usually 8 bytes
    char flags;           // 1 byte
};

Memory alignment

A memory access is aligned if the accessed address is a multiple of the size of the access.

Example: an access to an unsigned long is aligned if the address is a multiple of 8 bytes.

Ordering and Padding - GCC has significant freedom in arranging structure members in memory, but:

  • The memory order of members always follows their declaration order
  • By default, each member is aligned to its size
  • For a given configuration (architecture, ABI, compiler options and attributes), the layout is always the same

The programmer can reorder fields to minimise padding:

struct version {
    unsigned long minor;  // usually 8 bytes
    unsigned short major; // usually 2 bytes
    char flags;           // 1 byte
};

Type Sizes Depend on Architecture & OS


The size of primitive types, and therefore structures, varies across architectures and operating systems.


struct version {
    unsigned short major; // usually 2 bytes
    unsigned long minor;  // 4 bytes or 8 bytes
    char flags;           // 1 byte
};




Size in Bytes of :   unsigned long

Architecture Linux (LP64) Windows (LLP64) macOS (LP64)
x86-32 4 4 4
x86-64 8 4 8
ARM32 4 4 -
ARM64 8 4 8

Size in Bytes of :   struct version

Architecture Data Linux Windows macOS
x86-32 7 12 12 12
x86-64 11 / 7 24 12 24
ARM32 7 12 12 -
ARM64 11 / 7 24 12 24

Packing Structures

Padding wastes memory, especially in arrays of structures or frequently allocated objects.

GCC provides two ways to compact a structure by removing padding, with attribute or pragma directive.

The Linux kernel also defines its own macro for this in linux/compiler_attributes.h.


GCC attribute :   __attribute__((packed))
struct version {
    unsigned short major; // 2 bytes
    unsigned long  minor; // 8 bytes
    char           flags; // 1 byte
} __attribute__((packed));

// => sizeof(struct version) == 11
GCC Pragma :   #pragma pack
#pragma pack(push, 1)  
struct version {
    unsigned short major; // 2 bytes
    unsigned long minor;  // 8 bytes
    char flags;           // 1 byte
};
#pragma pack(pop) // Restore packing

// => sizeof(struct version) == 11
Linux Kernel Macros :   __packed
#define __packed __attribute__((__packed__))
struct version {
    unsigned short major; // 2 bytes
    unsigned long  minor; // 8 bytes
    char           flags; // 1 byte
} __packed;

/* => sizeof(struct version) == 11 */

Caution

Packing removes padding but can degrade performance:

  • Unaligned accesses may require multiple memory reads instead of one
  • On some architectures (e.g., ARM), unaligned accesses cause hardware faults
  • The compiler may generate extra instructions to handle misaligned members

Only use packing when memory savings outweigh the performance cost (e.g., network protocols, on-disk formats, or memory-constrained environments).

Controlling Alignment and Packing (cont.)

__attribute__((aligned(n)))

Forces a minimum alignment for a variable or structure member.

struct swsusp_info {
    struct new_utsname  uts;
    u32         version_code;
    unsigned long       num_physpages;
    int         cpus;
    unsigned long       image_pages;
    unsigned long       pages;
    unsigned long       size;
} __aligned(PAGE_SIZE);
static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;

Note

False sharing occurs when two CPUs access different variables that share the same cache line, causing unnecessary cache invalidations and performance degradation.

Tip

In the kernel, use the macros __aligned(n) and ____cacheline_aligned:

struct double_buffer {
    char buf_a[64] ____cacheline_aligned;
    char buf_b[64] ____cacheline_aligned;
};

Summary of Alignment and Packing Attributes

Macro Definition File
__packed __attribute__((__packed__)) linux/compiler_attributes.h
__aligned(n) __attribute__((__aligned__(n))) linux/compiler_attributes.h
____cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES))) linux/cache.h
____cacheline_aligned_in_smp Same as above but only in SMP configuration linux/cache.h
__page_aligned_data Aligned to PAGE_SIZE; placed in the .data section linux/linkage.h
__page_aligned_bss Aligned to PAGE_SIZE; placed in the .bss section linux/linkage.h

Variable-length Arrays

In C, an array must have a size. It is common to use a struct to keep it close to the array:

struct buf {
    char *buffer;
    size_t length;
};


This has several drawbacks:


Allocation is done in two steps (allocate the struct, then allocate the array)

struct buf *alloc_buffer(size_t length)
{
    struct buf *b = malloc(sizeof(struct buf));
    b->length = length;
    b->buffer = malloc(length);

    return b;
}

Freeing also requires two calls to free

void free_buffer(struct buf *b)
{
    free(b->buffer);
    free(b);
}

Copying requires a manual deep copy

struct buf *copy_buf(struct buf *b)
{
    struct buf *copy = alloc_buffer(b->length);
    memcpy(copy->buffer, b->buffer, b->length);
    
    return copy;
}

Tail-padded Structures

One way to overcome this is called tail-padded structures: placing an undefined size array as the last member of a structure.


struct buf {
    size_t length;
    char buffer[];
};

struct buf *alloc_buffer(size_t length)
{
    struct buf *b = malloc(sizeof(struct buf) + length);
    b->length = length;

    return b;
}

b1 = alloc_buffer(128);
b2 = alloc_buffer(128);
memcpy(b1, b2, sizeof(struct buf) + b2->length);
free(b1);
free(b2);

Multiple implementations are possible:

  • int buffer[]: in the C99 standard (flexible array member), preferred form
  • int buffer[1]: non-standard, but supported by compilers
  • int buffer[0]: non-standard, but supported by compilers
Example in the kernel
struct xyarray {
    size_t row_size;
    size_t entry_size;
    size_t entries;
    size_t max_x;
    size_t max_y;
    char contents[] __aligned(8);
};

Array vs Pointer

void main(void)
{
    char *yes = "da";
    char ja[3];

    yes = ja;
    ja = yes;
}


If you run this code, you get this error:

foo.c: In function ‘main’:
foo.c:6:12: error: assignment to expression with array type
    6 |         ja = yes;
      |            ^


yes is a pointer to a char (here, the first character of the string "da").

ja is an array identifier, a symbolic constant.

Array Identifiers Ambiguity

void main(void)
{
    char *yes = "da";
    char ja[3];

    printf("yes: %p - %p\n", yes, &yes);
    printf("ja:  %p - %p\n", ja, &ja);
}


results in:

yes: 0x55b70c1d7004 - 0x7ffcf5fe4268
ja:  0x7ffcf5fe4275 - 0x7ffcf5fe4275


A symbolic constants’s address doesn’t really make sense, so the compiler gives it the value of the constant (hence ja == &ja).

Function Pointers (1)

Declaration

A function pointer is declared with the following syntax:

return_type(*function_name)(parameter_list);


Example 1: a function taking no parameters and returning nothing

void (*func_p)(void);

Example 2: a function taking an int and a char, and returning an int

int (*func_p)(int, char);


Addressing

You can get a function’s address with the & operator.

void my_func(int foo)
{
    // body
}

void (*func_ptr)(void);     // declaration
func_ptr = &my_func;        // assignment


Function pointers are also symbolic constants, which means you can use the naming ambiguity:

func_ptr = my_func;            // assignment

Function Pointers (2)

Calling a function pointer

void say_hello(char *name)
{
    printf("Hello %s\n", name;
}

int main(void)
{
    void (*func_ptr)(char *);   // declaration
    func_ptr = say_hello;       // assignment
    (*func_ptr)("zero");        // call

    return 0;
}


Since function pointers are symbolic constants, you can write:

func_ptr("zero");

Function Pointers (3)

As a function argument

Function pointers are frequently used in the kernel to set up callbacks.

void free_elem(struct elem *e)
{
    free(e);
}

void put_elem(struct elem *e, void (*release)(struct elem *))
{
    e->refcount--;
    if (!e->refcount)
        release(e);
}

int main(void)
{
    struct elem *e = malloc(sizeof(struct elem));
    put_elem(e, free_elem);
}


As a return value

int atoi(const char *nptr) { /* body */ }

int (*func_ptr(void)) (const char *)
{
    return atoi;
}

Good Practices in the Kernel

Wise Use of the Stack

Kernel stack is small compared to user stack!


Stack size is statically defined at kernel compile time, cannot grow dynamically.


Usually fits on a few pages:

  • 8 KB for 32-bit architectures
  • 16 KB for 64-bit architectures


What to avoid?

  • Large allocations on the stack
  • Deep recursive call chains

Floating-Point Operations

Avoid floating point operations at all cost!


Why?


Extremely costly!

  • Enable the FPU (Floating-Point Unit)
  • Save all user space state related to the FPU (i.e., registers)
  • Disable the FPU


Not very useful!

  • No access to the libc, so no existing complex functions
  • You can only use inline functions from gcc
  • Most of the time, you can work around this with integer approximation

On the Dangers of Kernel Programming

Making changes to your kernel can render it unstable and lead to a kernel panic, i.e., a full system crash.


Keep a backup kernel

Never replace your running kernel with a new one!
Always keep a fully working backup kernel installed in your bootloader!


Work in modules

Always implement your changes as modules if possible.

  • That will limit the impact of some crashes in your code on the rest of the kernel.
  • Easier to test since you can load modules dynamically, test, unload, make changes and repeat

Important

Keep in mind that a bug can corrupt persistent data, e.g. on your hard drive. You could lose data for good if you work directly on your system!

Tip

Working in a virtual machine alleviates most of these issues!

Kernel APIs

Linux Kernel API

In the kernel, you won’t have access to the usual libraries like the libc.


Thankfully, the kernel provides its own internal “library” with basic functionalities.

They are described in Documentation/core-api/index.rst.


Let’s make a quick tour of some of these functionalities!

  • Generic base data types
  • Returning errors
  • Printing
  • Memory allocation
  • Waiting for resources
  • Task queues

Use Generic Types!

To ensure portability across architectures, the kernel offers generic types defined in include/linux/types.h


u8: unsigned byte (8 bits)
u16: unsigned word (16 bits)
u32: unsigned doubleword (32 bits)
u64: unsigned quadword (64 bits)
s8: signed byte (8 bits)
s16: signed word (16 bits)
s32: signed doubleword (32 bits)
s64: signed quadword (64 bits)


If a variable is visible from user space (e.g., ioctl), you must use types prefixed with __ (double underscore)

__u8        __s8
__u16       __s16
__u32       __s32
__u64       __s64

Returning Errors

Functions in the kernel follow the same convention as system calls by returning an integer:

  • Success: a value \(\ge 0\)
  • Error: the negative value of the error code (i.e., -errno)

If the function returns a pointer:

  • Success: the pointer to return
  • Failure: two possibilities
    • Return NULL if there is only one reason to fail

    • Return the error code encoded with the ERR_PTR() macro.

      The calling function can check if there was an error with IS_ERR() and get the error code with PTR_ERR()

int do_shash(unsigned char *name, unsigned char *result, const u8 *data1, unsigned int data1_len,
          const u8 *data2, unsigned int data2_len, const u8 *key, unsigned int key_len)
{
    int rc;
    unsigned int size;
    struct crypto_shash *hash;
    struct sdesc *sdesc;

    hash = crypto_alloc_shash(name, 0, 0);
    if (IS_ERR(hash)) {
        rc = PTR_ERR(hash);
        pr_err("%s: Crypto %s allocation error %d\n", __func__, name, rc);
        return rc;
    }
    /* ... */

Printing

If you need to print information to be available from user space, e.g., tracing or debugging, you can use the printk() function.
It works similarly to printf(), with a couple of differences:

  • You should prefix your format string with a priority level defined by macros in include/linux/kern_levels.h, from KERN_EMERG to KERN_DEBUG.
  • The output doesn’t go to stdout, but in the kernel ring buffer that you can read from user space with the dmesg command, or with journalctl (and other commands)

Example:

printk(KERN_ERR "%s:%d: this shouldn't be reached...\n", __FILE__, __LINE__);

There are also predefined macros for each level:

pr_debug("debug message\n");
pr_info("info message\n");
pr_err("error message\n");

Tip

Formats are available at Documentation/printk-formats.txt.

Filtering your prints

You can define, at the top of your module, the following macro to add a prefix to all your prints:

#define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__

This will add your module name and the name of the function as a prefix to all you prints.

Memory Management

Memory allocation is done with the kmalloc() function, similar to malloc().
Some specific characteristics:

  • Fast (except if blocked waiting for pages)

  • Allocated memory is not initialised

  • Allocated memory is contiguous in physical memory

  • Memory is allocated by areas of \(2^n - k\) bytes (\(k\): a few metadata bytes).

    Do not allocate 1024 B if you need 1000 B, you will end up with 2048 B!

Example:

data = kmalloc(sizeof(*data), GFP_KERNEL);


kmalloc GFP flags

The second parameter of kmalloc() is a Get Free Pages (GFP) flag:

  • GFP_KERNEL: Regular kernel allocation.
    Can be blocking. Best choice for most cases.
  • GFP_ATOMIC: Non blocking allocation.
    Use only in non-interruptible code.
  • GFP_USER: Allocate memory for a user space process.
    Can block. Lowest priority.
  • GFP_NOIO: Can block, but no I/O can be executed.
  • GFP_NOFS: Can block, but no file system operation can be executed.
  • GFP_HIGHUSER: Allocate memory in user space high memory (\(\gt 4\) GB). Can block. Low priority.

More combinations available in include/linux/gfp_types.h.

Memory Management (2)

If you need large chunks of memory, you should not use kmalloc(), and request pages directly with one of these functions:


unsigned long get_zeroed_page(int flags);


unsigned long __get_free_page(int flags);


unsigned long __get_free_pages(int flags, unsigned long order);

returns a pointer to a free page after filling it with zeros


returns a pointer to a free page


returns a pointer to a memory area with \(2^{order}\) contiguous pages


Virtual allocation

If you don’t need the memory to be contiguous, you can allocate in the virtual address space instead of physical:

void *vmalloc(unsigned long size);
void vfree(void *addr);


Mapping physical to virtual addresses

You can also map a physical memory location into the virtual address space:

void *ioremap(unsigned long phys_addr, unsigned long size);
void iounmap(void *addr);

Waiting for Resources

If you need to wait for a resource (e.g., network packet, message), the interface should implement a wait queue to allow your thread to sleep and be woken up when the resource is available.


wait_event(wait_queue, condition);



wait_event_interruptible(wait_queue, condition);

thread sleeps and will be woken up if wake_up() is called on the wait queue and the condition is true


same as wait_event(), but the thread can also be woken up by a signal


The resource handler calls wake_up() on the queue to wake up waiting threads.

Workqueues

Workqueues allow you to execute code asynchronously.


At creation time, a pool of thread is initialised.

Jobs can then be submitted in the form of a function pointer and a pointer to an argument.

A thread from the workqueue will, asynchronously, check the queue, pop a job and execute it.


Tip

Documentation available in Documentation/core-api/workqueue.rst.

Generic Data Structures

The kernel also offers generic data structures to work with:

  • Linked lists
  • Maps
  • Circular buffers
  • Red-black trees

Important

Generic data structures in C are not obvious to build…

Generic Data Structures in C

Instead of having objects in a list, we have the list in the objects!


The “naive” version:

struct elem {
    struct object {
        int v0, v1;
    } obj;
    struct elem *next, *prev;
};



Not generic! You need one list type of list per object type.

The “good” version:

struct object {
    int v0, v1;
    struct list_head {
        struct list_head *next, *prev;
    } list;
};



You only need one list_head type to be defined, and you can reuse it for any object type!

When you iterate over the list, how do you get the containing object?

Container of

From the address of any member in a structure, how can we get the address of the structure?
e.g., from the address of a list_head element in a structure

Linux implements the container_of macro!

/**
 * container_of - cast a member of a structure out to the containing structure
 * @ptr:    the pointer to the member.
 * @type:   the type of the container struct this is embedded in.
 * @member: the name of the member within the struct.
 *
 * WARNING: any const qualifier of @ptr is lost.
 */
#define container_of(ptr, type, member) ({              \
    void *__mptr = (void *)(ptr);                   \
    static_assert(__same_type(*(ptr), ((type *)0)->member) ||   \
              __same_type(*(ptr), void),            \
              "pointer type mismatch in container_of()");   \
    ((type *)(__mptr - offsetof(type, member))); })

After expanding all macros, this looks like this:

#define offset_of(type, member) \
    (&((type *)0)->member)

#define container_of(ptr, type, member) \
    ((type *)(((void *)ptr - offset_of(type, member))))
  • offset_of: cast the address 0 to type * and access the member
  • container_of: substract this offset from the address of the member

Generic Data Structure Helpers

For each generic data structure, the kernel provides helpers to use them.

Let’s see examples for circular doubly linked lists (list_head from include/linux/list.h):

Allocators:

LIST_HEAD(name)


Insert/delete:

static inline void list_add(struct list_head *new, struct list_head *head);
static inline void list_del(struct list_head *entry);


Iterators:

list_for_each(pos, head)
list_for_each_entry(pos, head, member)


And a lot more!


Tip

You can find similar helpers for all generic data structures.
Go check them out in the kernel sources!

Concurrency in the Kernel

Resources/objects can be accessed concurrently in the kernel.


There are two reasons this can happen:

  • Preemption: Since version 2.6, the Linux kernel is preemptible.
    This means that kernel code can be interrupted by higher priority code, e.g., device interrupt.
  • Multi-core processors: With multi-core CPUs, two threads can execute kernel code in parallel, and thus access the same kernel data concurrently.


Possible solutions:

  • Mask interrupts
  • Big Kernel Lock
  • Synchronisation primitives: semaphores, spinlocks
  • Atomic operations

Masking Interrupts

Concurrency problems can arise due to preemption both on single- and multi-core systems.

In the case of single-core CPUs, it can be solved solely by disabling interrupts, making the kernel code non-preemptible.


In Linux, you can use the following macros:

  • local_irq_disable(): this uses the proper assembly instruction to disable interrupts on the current core, e.g., cli on x86.
  • local_irq_enable(): this uses the proper assembly instruction to enable interrupts on the current core, e.g., sti on x86.


Example: A driver for a joystick in drivers/input/joystick/analog.c

static int analog_cooked_read(struct analog_port *port)
{
    /* some code */
    local_irq_disable();
    this = gameport_read(gameport) & port->mask;
    now = ktime_get();
    local_irq_restore(flags);
    /* some code */
}

Big Kernel Lock

On multi-core systems, disabling interrupts is not sufficient, as other cores might also access data concurrently.

One potential solution is to serialise all kernel code, allowing only one thread at a time to execute code in supervisor mode.


This was the initial solution used in Linux when support for multi-core CPUs was added.
The Big Kernel Lock (BKL) was taken when entering the kernel and released when exiting.
Only one thread at a time was running kernel code.


Pro: Extremely simple to implement and safe

Con: Large performance degradation due to the loss of parallelism for kernel code


Linux and the BKL

Linux had a Big Kernel Lock from the introduction of Symmetric Multi-Processor (SMP) in version 2.0 in 1999 until its removal in 2.6.39 in 2011.

Synchronisation Primitives

Since the BKL removal, fine-grained synchronisation mechanisms are used in the kernel.
A non-exhaustive list of synchronisation mechanisms and their (partial) API:

  • Mutexes
void mutex_lock(struct mutex *lock);
int mutex_trylock(struct mutex *lock);
void mutex_unlock(struct mutex *lock);
  • Semaphores
void down(struct semaphore *sem);
void up(struct semaphore *sem);
  • Spinlocks
void spin_lock(spinlock_t *lock);
void spin_unlock(spinlock_t *lock);
  • Readers/writer locks
void read_lock(rwlock_t *lock);
void write_lock(rwlock_t *lock);
void read_unlock(rwlock_t *lock);
void write_unlock(rwlock_t *lock);

Note

Most of these have variations that also disable interrupts when taking a lock.

Atomic Operations

Concurrent access problems can also be solved by using atomic operations in some cases.

Atomic operations are architecture-specific, and are defined in include/linux/atomic/atomic-instrumented.h


These operations should be used on a specific type, atomic_t, to represent the atomic variable.

You can find the usual atomic operations, for example:

  • void atomic_add(int i, atomic_t *v);
  • void atomic_dec(atomic_t *v);
  • void atomic_or(int i, atomic_t *v);
  • And many others…

Coding Style

The Linux Kernel Coding Style

Defined in Documentation/process/coding-style.rst.


It defines a set of rules that will be enforced when a patch is submitted:

  • Indentation
  • Line length
  • Spaces and braces
  • Error management
  • etc.


Tip

You can check if your patches are valid with regard to the coding style with the scripts/checkpatch.pl script!

Some Coding Style Rules

Indentation

Indentation is done with tabs, not spaces.
Tabs are 8 characters long.


Line length

For better readability, the preferred limit on the length of a line is 80 characters.
However, never break user-visible strings, as it also breaks the ability to grep them.

Since 2020, checkpatch.pl only complains about lines longer than 100 characters.


Too restrictive?

From the coding style documentation:

Now, some people will claim that having 8-character indentations makes the code
move too far to the right, and makes it hard to read on a 80-character terminal
screen. The answer to that is that if you need more than 3 levels of
indentation, you're screwed anyway, and should fix your program.

Some Coding Style Rules (2)

Braces

  • Opening braces are on the same line as the block they open, except for functions where they are on the next line
  • Closing braces are alone on an empty line
  • Don’t use braces to surround single-line statements…
  • … except if another branch of a conditional has multiple statements
for (int i = 0; i < 10; i++) {
    printk("%d\n", i);
}
int inc(int x)
{
    return ++x;
}


if (!x)
    x++;
if (x > y)
    y++;
else
    x++;
if (x > y) {
    y++;
} else {
    x++;
    y--;
}

Some Coding Style Rules (3)

Spaces

Philosphy of function-versus-keyword usage.

  • No spaces after functions

  • Spaces after keywords (except if they are used like function, e.g., sizeof)

    if, switch, case, for, do, while

For operators:

  • Spaces on both sides of binary and ternary operators

    = + - < > * / % | & ^ <= >= == != ? :

  • No space after unary operators or before/after postfix/prefix increment and decrement

    & * + - ~ ! ++ –

  • No space around structure member operators

    . ->

  • No trailing spaces at the end of lines

Chapter 3: Implementing Kernel Modules

A Quick Tour of the Kernel Configuration and Build System

Getting the Sources

From the official kernel website, kernel.org, download the tarball archive.

Or from the command line, for example:

$ wget https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.5.7.tar.xz

You can (and should) also check out the integrity of the tarball with the PGP signature:

$ wget https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.5.7.tar.sign $ unxz linux-6.5.7.tar.xz # the signature is done on the decompressed tarball $ gpg –verify linux-6.5.7.tar.sign linux-6.5.7.tar

This will probably fail because you don’t have the public keys of the maintainers that generated the tarball.

Get them from the kernel’s key server (documentation):

$ gpg2 –locate-keys torvalds@kernel.org gregkh@kernel.org

You can also clone Linus Torvalds’ git tree:

$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

Configuring the Kernel

The kernel configuration describes the features that will be enabled in the built binary, as well as change their behavior.
It also describes if features should be built-in the binary or compiled as modules.

By default, the Makefile-based build system uses the file .config located at the root of the kernel sources.


You can generate initial configurations with the following commands (non-exhaustive list):

  • make allnoconfig : minimal, everything that can be disabled is disabled
  • make defconfig : default configuration for the local architecture
  • make localmodconfig : configuration based on the current state of the machine (plugged devices, etc.) and builds them as modules
  • make localyesconfig : same but everything is built-in
  • make oldconfig : keeps the values of the current .config and asks for the new options

Or you can copy the configuration of your running kernel:

$ cp /boot/config-$(uname -r)* .config # available on some distros $ zcat /proc/config.gz > .config # available if CONFIG_IKCONFIG_PROC is enabled

If you need to know more about your hardware to generate your config, check out these commands:

  • lshwd, lscpu, lspci, lsusb, …
  • cat /proc/cpuinfo, cat /proc/meminfo, …
  • dmidecode, hdparm
  • dmesg

Building the Kernel

The kernel build system is based on Makefiles.

Just run make to compile it.

$ time make real 80m15.486s user 74m54.606s sys 5m32.300s

Compilation can take a long time, so do it in parallel!

$ make -j $(nproc)

The compilation produces the following important files:

  • vmlinux: the raw Linux kernel image. This ELF is used for debugging and profiling;
  • System.map: symbol table of the kernel. Not necessary at run time, used for debugging;
  • arch/<arch>/boot/bzImage: compressed image of the kernel. This is the one that will be loaded and used.

$ du -sh vmlinux arch/x86/boot/bzImage 49M vmlinux 13M arch/x86/boot/bzImage

You can get some info on the image with the file command:

$ file arch/x86/boot/bzImage arch/x86/boot/bzImage: Linux kernel x86 boot executable bzImage, version 6.5.7-lkp (redha@wano) #2 SMP PREEMPT_DYNAMIC Thu Oct 19 16:05:37 CEST 2023, RO-rootFS, swap_dev 0XC, Normal VGA

Installing a Kernel

Two main steps:

1. Install the kernel image, the symbol map and the initrd

$ make install

This will copy the image and symbol map in /boot, and generate the initramfs.


2. Install the modules

$ make modules_install

This will copy the modules (.ko files) into /lib/modules/<version>.

About the Symbol Map

The symbol map (System.map) provides the list of the symbols available in this kernel, their address and type.

$ head System.map 0000000000000000 D __per_cpu_start 0000000000000000 D fixed_percpu_data 0000000000001000 D cpu_debug_store 0000000000002000 D irq_stack_backing_store 0000000000006000 D cpu_tss_rw 000000000000b000 D gdt_page 000000000000c000 d exception_stacks 0000000000014000 d entry_stack_storage 0000000000015000 D espfix_waddr 0000000000015008 D espfix_stack


Check the manpage of the nm program for an explanation of the types.

As a rule of thumb (mostly true), lowercase means local scope while uppercase means global scope (i.e., exported symbol).

Linux Init Process

Back to the Lab

In Lab 2, task 3, you were asked to replace the init binary by a hello_world program, which led to a kernel panic. Why?


Roles of init

  • Initialise the system: start daemons/services, manage user sessions, mount partitions, etc.
  • Ancestor of all the processes on the system
  • Adopt all orphaned processes


Characteristics of init

  • Has PID 1
  • Cannot die


Demo time!

Linux Kernel Modules

Development Infrastructure

Multiple development methods:

  • Local setup
    Use your usual development software, compile and run your new modules/kernel on your machine.
    Pros: easy, quick
    Cons: if a crash occurs, you can do nothing
  • Remote machine
    If you have access to a separate testing machine, you can do your development on your machine and test remotely to avoid the crash issues.
    This machine is usually hooked through network/serial to the development machine to allow remote debugging and monitoring.
    Pros: good development setup, robust to crashes
    Cons: not always possible to have a second machine
  • Virtual machine
    Develop on your local setup and deploy on a virtual machine.
    This replaces the previous method well, while being faster to use, and doesn’t require a second machine.
    Pros: good development setup, robust to crashes, single machine
    Cons: doesn’t always perfectly capture real hardware, might be slow depending on the host and guest machines


In this course, we will use the last method with QEMU as a hypervisor.

Kernel Modules Interface

A module is a library dynamically loaded into the kernel. It triggers a call to a registered function when loaded and when unloaded.

The kernel provides two macros to register these functions: module_init() and module_exit().


static int my_init(void)
{
      /* ... */
      return 0;
}
module_init(my_init);
static void my_exit(void)
{
      /* ... */
}
module_exit(my_exit);


For these to work, you will need some header files included:

// contains the module API
#include <linux/module.h>

/// contains the init and exit macros
#include <linux/init.h>

/// if needed: base types, functions, macros...
#include <linux/kernel.h>

Module Information

You should also add some information about your module with some pre-defined macros, usually at the beginning of the file:

MODULE_DESCRIPTION("Hello world module");
MODULE_AUTHOR("Redha Gouicem, RWTH");
MODULE_LICENSE("GPL");


These can be checked on any module:

$ modinfo hello.ko filename: hello.ko description: Hello World module author: Redha Gouicem, RWTH license: GPL vermagic: 6.5.7-ARCH 686 gcc-13.2.1 depends:


Warning

The license is not only informative. It is also used to check if you are allowed to use some symbols in the kernel.

Example: Hello World

#include <linux/module.h>
#include <linux/init.h>
#include <linux/kernel.h>

MODULE_DESCRIPTION("Hello world module");
MODULE_AUTHOR("Redha Gouicem, RWTH");
MODULE_LICENSE("GPL");

static int __init hello_init(void)
{
      pr_info("Hello World!\n");

      return 0;
}
module_init(hello_init);

static void __exit hello_exit(void)
{
      pr_info("Goodbye World...\n");
}
module_exit(hello_exit);

Annotations

The __init and __exit annotations are used to help the compiler optimize the memory usage.
When some module is statically built-in the kernel binary, functions tagged with these annotations are placed in specific segments:

  • .init.text that is freed after the boot of the kernel
  • .exit.text that is never loaded in memory

Building a Module

The running kernel is deployed with a generic Makefile located in /lib/modules/$(uname -r)/build.

You can use it from anywhere like this:

$ make -C /lib/modules/$(uname -r)/build M=$PWD

This will generate your module as a .ko file (kernel object).


You can also use a custom Makefile like this one as a wrapper:

ifneq ($(KERNELRELEASE),)

  obj-m += hello.o

else

  KERNELDIR_LKP ?= /lib/modules/$(shell uname -r)/build
  PWD := $(shell pwd)

all:
        make -C $(KERNELDIR_LKP) M=$(PWD) modules

clean:
        make -C $(KERNELDIR_LKP) M=$(PWD) clean

endif

Loading/Unloading a Module

Loading a module can be done with insmod:

$ insmod hello.ko $ dmesg [177814.017370] Hello World!


Unloading a module can be done with rmmod:

$ rmmod hello $ dmesg [177919.956567] Goodbye World…

Module Parameters

#include <linux/init.h>
#include <linux/module.h>
#include <linux/moduleparam.h>

static char *month = "January";
module_param(month, charp, 0660);

static int day = 1;
module_param(day, int, 0000);

static int __init hello_init(void)
{
      pr_info("Hello ! We are on %d %s\n", day, month);
      return 0;
}
module_init(hello_init);

static void __exit hello_exit(void)
{
      pr_info("Goodbye, cruel world\n");
}
module_exit(hello_exit);


With default values:

$ insmod hello.ko $ dmesg [180525.067016] Hello ! We are on 1 January

With parameters:

$ insmod hello.ko month=December day=31 $ dmesg [181086.216097] Hello ! We are on 31 December

Kernel Dynamic Linker

Like shared libraries, modules are dynamically loaded: they only have access to symbols explicitly exported to them!
By default, they have access to absolutely no variable or function from the kernel, even if they are not static!

Two macros allow to explicitly export symbols to modules:

  • EXPORT_SYMBOL(s) makes the symbol s visible to all loaded modules
  • EXPORT_SYMBOL_GPL(s) makes the symbol s visible to all modules with a license compatible with GPL (according to their MODULE_LICENSE)

Example: using the pm_power_off() function exported in arch/x86/kernel/reboot.c and available on my system:

$ grep pm_power_off /lib/modules/$(uname -r)/build/System.map ffffffff810ed2f0 t legacy_pm_power_off ffffffff8274d7d8 r __ksymtab_pm_power_off ffffffff838a47f8 B pm_power_off

#include <linux/module.h>
#include <linux/kernel.h>

MODULE_DESCRIPTION("Power off module");
MODULE_LICENSE("GPL");

static int __init devil_init(void)
{
      pr_info("The end is nigh...\n");
      if (pm_power_off)
            pm_power_off();

      return 0;
}
module_init(devil_init);

Module Dependencies

If a module X uses at least one symbol from module Y, then X depends on Y.

Dependencies are not explicitly defined: they are automatically inferred during the kernel/module compilation.
You can find the list of dependencies in the file /lib/modules/<version>/modules.dep.

This file is generated by the depmod program, who checks which symbols are used by a module, and which module provide these symbols.

You can also check the dependencies of a module with modinfo.


Automated dependency solving

Obviously, modules must be inserted in the proper order: if X depends on Y, Y needs to be inserted before X.

If you are using modprobe, it will automatically insert dependencies first.

This is also true for unloading modules (in the reverse order).

Contributing to the Kernel

Patch or Module?

When developing something in the kernel, the first design choice is “how?”


You have two choices:

  • Implement your code in the kernel through a patch
    Your code is then statically built-in the kernel binary
  • Implement your code in a module
    Your code can be dynamically loaded by the kernel at run time


Whenever possible, modules are the best choice, as they have more chances to be merged in the mainline.

Modules: Pros and Cons

While using modules should be your first choice, it also has some drawbacks depending on what you are doing.


Pros:

  • Easier to develop
  • Easier to distribute
  • Avoids overloading the kernel
  • Lower chances of conflicts


Cons:

  • Internal kernel structures cannot be modified
    e.g., adding a field to the file descriptor structure
  • Replacing/changing the behaviour of an existing kernel function
    e.g., change the page frame allocator code

Patching the Kernel

If you need to modify the kernel and distribute your changes, you most likely will use patches.

A patch is the result of the diff command applied on the original files and your modified version.
It contains all the data for the patch to automatically apply the changes.


diff: Compares files line by line

  • Shows the added/modified/removed lines
  • Can ignore tabs/spaces
  • Can compare whole directory trees (-r)


patch: Apply changes to existing files

  • Can apply a patch on a file passed as an argument or stdin
  • Can apply a patch on a directory tree

Creating and Applying a Patch

Creating a patch for the kernel tree:

  • -r to enable recursive patch
  • -u to use the unified diff format (more compact and easier to read)

$ unxz linux-6.5.7.tar.xz $ cp -r linux-6.5.7 linux-6.5.7-orig $ cd linux-6.5.7 $ emacs kernel/sched/fair.c $ emacs kernel/sched/sched.h $ cd .. $ diff -r -u linux-6.5.7-orig linux-6.5.7 > new_sched.patch $ xz new_sched.patch


Applying a patch on the kernel tree:

  • -p 1 to omit the first level in all paths
  • --dry-run to only simulate the patch (for testing purposes)

$ unxz linux-6.5.7.tar.xz $ cd linux-6.5.7 $ zcat new_sched.patch | patch -p 1 –dry-run

Submitting Your Work to the Kernel Community

When you think your code is ready for review by the kernel maintainer, you need to send it to them!

Note: This is just an overview of the process!


Ready your code for public eyes

  • Test your code first to avoid silly bugs (and being fun of publicly)
  • Make sure that your code is compliant with the kernel coding style and is understandable (comments?)
  • Use tools to helps you, e.g., checkpatch, clang-format


Prepare your patch(es)

  • Choose against which version of the kernel to generate your patches, usually the current mainline from Linus’ git tree (a -stable or -rc release)
  • Split your submission into a set of patches/commits, each being logically independent, and able to be built and run
  • You can manually create the patches with diff or use git format-patch to automatically generate patches from your commits (assuming you used git in the first place)

Submitting Your Work to the Kernel Community (2)

Format your patch series for emailing

  • Each patch email should have a one-line short description, followed by blank line and a multi-line long description, then a list of tag lines specifying who co-authored, reviewed the patch, etc. Finally, the patch should be appended.
  • All of this can be fairly well automated with git format-patch


Send your patch series to the mailing list

  • Find out to which mailing list and maintainers the email should be sent to, using the scripts/get_maintainer.pl script on your patch. You should also CC anyone who might need to see this, e.g., they work on something similar
  • You might need to write a first summary email (a cover letter) for your patch series
  • This can be fairly well automated with git send-email
  • The emails need to be written in plain text, no HTML!


Don’t rely on these slides only!

Go check the full version in the kernel documentation, starting with the kernel development process and patch submission process.

You can check the mailing list online at https://lore.kernel.org.

Chapter 4: User-Kernel Communication

Overview

In monolithic architectures, kernel and user programs are run in different permission levels:
supervisor mode and user mode.

There needs to be communication between the user and the kernel.


Multiple mechanisms are available in the Linux kernel, and choosing the right one is a common discussion on the mailing list.


Communication mechanisms:

  • Module parameters
  • Pseudo file systems
  • Sockets
  • System Calls
  • ioctl

Pseudo File Systems

In-Memory File Systems

An in-memory file system, or ramfs, is a file system not backed by a storage device.

The data stored on it is completely in memory or computed when accessed.


The Linux kernel provides a set of pseudo file systems that are ramfs representing kernel data or configuration.

They usually have a semantic where one file represents one value.

Each pseudo file system has slightly different semantics and answer to different needs: procfs, sysfs, configfs, debugfs, …


Advantage: User space programs can access these files with the standard POSIX file API: read and write.

You can thus use regular shell programs such as cat and echo to read/write from/to them.


Drawback: These mechanisms are synchronous from user to kernel, but asynchronous in the other direction,

i.e., user space applications cannot be notified when the value represented by a file changes in memory.

Using Pseudo File Systems

A pseudo file system is a component of the kernel that has to be enabled and mounted before being used.

1. Check that your kernel has been built with the needed pseudo file system

$ zcat /proc/config.gz | grep CONFIG_DEBUG_FS # you can also grep in your .config directly CONFIG_DEBUG_FS=y

2. Mount the pseudo file system

$ mount -t debugfs none /sys/kernel/debug

3. Read a value by reading from a file

$ cat /sys/kernel/debug/sched/wakeup_granularity_ns 4000000

4. Modify a value by writing to a file

$ echo 3000000 > /sys/kernel/debug/sched/wakeup_granularity_ns

Tip

You might need to have root privileges to read/write to/from these special files.

  • Reading: sudo cat <file>
  • Writing: echo <value> | sudo tee <file>

procfs

The oldest pseudo file system, mounted in /proc.

In lab 2, we saw that the procfs was mounted by the init script.


Goal: Export information about processes.

Since its creation, it has been used for more than that, e.g., exporting kernel data from various subsystems.

Using it is now discouraged for anything unrelated to processes.


Advantage: most widely documented

Drawback: no real structure enforced


The procfs provides two APIs:

  • Legacy procfs API: simple, but each file is limited to one page (PAGE_SIZE, usually 4 KB)
  • seq_file API: more complex, but allows larger data to be exported with a list of buffers

procfs: Example

Let’s see an example procfs file that exports a variable from the kernel in human-readable form in /proc/my_state.


We want to export the value of the system_enabled global variable.

  1. Define a read() function that will be called when our file is read from:
    • Declare a struct file_operations that will be used for our file
    • Implement the read() function following the prototype
  1. Create the file in the procfs and attach it to the struct file_operations we just created in the init function of your module
  2. Delete the file when the module is removed, since the variable will not be available any more.
static int system_enabled;

static ssize_t system_state_read(struct file *file, char __user *buf,
                                 size_t count, loff_t *ppos)
{
    const char *tmp = system_enabled ? "The system is enabled\n"
                        : "The system is disabled\n";;
    return simple_read_from_buffer(buf, count, ppos, tmp, strlen(tmp));
}

static const struct file_operations system_state_fops = {
    .open = simple_open,
    .read = system_state_read,
    .llseek = noop_llseek,
};
static struct proc_dir_entry *system_state_proc_dir;

static int system_state_init(void)
{
    system_state_proc_dir = proc_create("my_state", 0, NULL,
                                        &system_state_fops);
    return 0;
}
module_init(system_state_init);

static void system_state_exit(void)
{
    remove_proc_entry("my_state", NULL);
}
module_exit(system_state_exit);

sysfs

Successor to the procfs, mounted in /sys.


Goal: Store information about subsystems, hardware devices, drivers, …

This should be the default choice!


Advantages:

  • Hierarchical topology, information is arranged logically, by subsystem, component, etc…
  • Provides a set of mechanisms to free memory and recursively destroy directories

Cons:

  • More complex than procfs
  • Each file represents one single piece of data
  • A piece of data cannot be larger than one page (PAGE_SIZE)

sysfs: Directories

The struct kobject is at the heart of the sysfs:

  • Each directory corresponds to a kobject
  • A file in the sysfs is not a kobject

Most important fields:

struct kobject {
    const char              *name;      // name of the directory
    struct kobject          *parent;    // kobject of the parent directory
    struct kset             *kset;      // the collection of kobjects this object belongs to
    struct kref              kref;      // reference counter, used to free the memory properly
    const struct kobj_type  *ktype;     // type of the object, with functions pointers to manipulate it
    /* ... */
};


To create a kobject and add it to the sysfs, you can use this functions:

struct kobject *kobject_create_and_add(const char *name, struct kobject *parent);

Caution

This works for simple cases (most likely what you want to do). For more complex scenarios, there are other functions to initialize, create and register kobjects. Check out the documentation for more information!

Don’t forget to cleanup your kobjects and their files when they are not needed anymore with:

void kobject_put(struct kobject *kobj);

sysfs: Files

Kobject attributes

Each file in the sysfs corresponds to one single value and is associated with an instance of a struct kobj_attribute.

struct kobj_attribute {
    struct attribute attr;  // file information (name, permissions)
    ssize_t (*show)(struct kobject *kobj, struct kobj_attribute *attr, char *buf);
    ssize_t (*store)(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count);
};

Kobject attributes can be created with the following macro:

#define __ATTR(_name, _mode, _show, _store)

You can also create group of attributes to have multiple files in the same directory:

static struct attribute *attrs[] = {
    &foo_attribute.attr,
    &bar_attribute.attr,
    NULL,
};

static struct attribute_group attr_grp = {
    .attrs = attrs,
};


Creating files in the sysfs

Files can be created with the following functions:

int sysfs_create_file(struct kobject *kobj, const struct attribute *attr);
int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp);

sysfs: Example 1, a Read-only Variable

We want to export the state of our system represented by an int system_enabled global variable in a human-readable form in the file located at
/sys/kernel/my_state/system_enabled.


static int system_enabled;

static ssize_t system_state_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
    return snprintf(buf, PAGE_SIZE, "The system is %srunning\n", system_enabled ? "" : "not ");
}

static struct kobj_attribute system_state_attribute = __ATTR(system_enabled, 0400, system_state_show, NULL);
static struct kobject *my_state_kobj;

sysfs: Example 1, a Read-only Variable (2)

Now, we need to instantiate the sysfs file when loading our module and destroy it when unloading.


static int __init my_state_init(void)
{
    int retval;
    
    my_state_kobj = kobject_create_and_add("my_state", kernel_kobj);
    if (!my_state_kobj)
        goto error_init_1;

    retval = sysfs_create_file(my_state_kobj, &system_state_attribute.attr);
    if (retval)
        goto error_init_2;

    return 0;

error_init_2:
    kobject_put(my_state_kobj);
error_init_1:
    return -ENOMEM;
}
module_init(my_state_init);

static void __exit my_state_exit(void)
{
    kobject_put(my_state_kobj);
}
module_exit(my_state_exit);

sysfs: Example 2, Let’s Add an RW File

static int system_enabled;
static u64 clock;

static ssize_t system_state_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
    return snprintf(buf, PAGE_SIZE, "The system is %srunning\n", system_enabled ? "" : "not ");
}

static ssize_t clock_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
    return snprintf(buf, PAGE_SIZE, "%llu\n", clock++);
}

static ssize_t clock_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count)
{
    u64 val;
    int rc = sscanf(buf, "%llu", &val);
    if (rc != 1 || rc < 0)
        return -EINVAL;

    clock = val;
    return count;
}

static struct kobj_attribute system_state_attribute = __ATTR(system_enabled, 0400, system_state_show, NULL);
static struct kobj_attribute clock_attribute = __ATTR(clock, 0600, clock_show, clock_store);

static struct attribute *attrs[] = {
    &system_state_attribute.attr,
    &clock_attribute.attr,
    NULL,
};
static struct attribute_group attr_grp = { .attrs = attrs };

sysfs: Example 2, Let’s Add an RW File (2)

static struct kobject *my_state_kobj;

static int __init my_state_init(void)
{
    int retval;
    
    my_state_kobj = kobject_create_and_add("my_state", kernel_kobj);
    if (!my_state_kobj)
        goto error_init_1;

    retval = sysfs_create_group(my_state_kobj, &attr_grp);
    if (retval)
        goto error_init_2;

    return 0;

error_init_2:
    kobject_put(my_state_kobj);
error_init_1:
    return -ENOMEM;
}
module_init(my_state_init);

static void __exit my_state_exit(void)
{
    kobject_put(my_state_kobj);
}
module_exit(my_state_exit);

configfs

configfs is another ram-based file system offering the converse functionality to sysfs, mounted in /sys/kernel/config.

While the sysfs is a view of kernel objects, configfs is a manager of kernel objects,
i.e., it allows to create/destroy kernel objects from user space.


Example: Allowing user programs to create and configure a virtual network devices by doing an mkdir in the configfs directory of a driver.


Advantage:

  • Allows user space programs to manage kernel objects

Drawbacks:

  • Complex to set up
  • One file equals one value

debugfs

The debugfs offers a very flexible and simple API targeted at simplifying kernel development and debugging.

It is mounted in /sys/kernel/debug.


Advantages:

  • Very flexible, no size limit for files
  • Very high level and simple API

Drawback:

  • Only for debugging purposes!


Quick non-exhaustive API tour:

struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
void debugfs_create_u32(const char *name, umode_t mode, struct dentry *parent, u32 *value);
void debugfs_create_str(const char *name, umode_t mode, struct dentry *parent, char **value);

debugfs: Example

static int system_state;

static struct dentry *my_state_dir;

static int my_state_init(void)
{
    struct dentry *new_file;

    my_state_dir = debugfs_create_dir("my_state", NULL);
    if (my_state_dir == 0)
        return -ENOTDIR;

    new_file = debugfs_create_u8("system_state", 0444, my_state_dir, (u8 *) &system_state);
    if (new_file == 0) {
        debugfs_remove_recursive(my_state_dir);
        return -EINVAL;
    }

    return 0;
}
module_init(my_state_init);

static void my_state_exit(void)
{
    debugfs_remove_recursive(my_state_dir);
}
module_exit(my_state_exit);

Synchronous Communication Mechanisms

Overview

Pseudo file system-based mechanisms are:

  • Synchronous from user to kernel:
    When a user writes a value, it is changed in the kernel memory when the user program returns to user space;
  • Asynchronous from kernel to user:
    When a value in kernel memory changes, the user space is not notified of this change until the next read.


The kernel provides various synchronous communication mechanisms:

  • System calls
  • ioctls
  • Sockets

System Calls

System calls are the most classic user-kernel communication mechanism.

They allow user space applications to execute privileged kernel code:

  • I/Os (disk, network)
  • Resource management (threads, memory)
  • Communication (signals, IPCs)
  • Access to specific hardware


They are the core API of the kernel, with some limitations:

  • Each system call is identified by a number defined at kernel compile time. You cannot add new ones dynamically.
  • They are a “universal” API, so they need to be the same on all systems. 1


There are two ways of making system calls:

  • The “old” interrupt way (x86)
  • The “new” syscall instruction way (x86, amd64, ARM64)


Source: Wikimedia

System Calls: The Interrupt Way

System calls are a software interrupt like the others:


1. Place the system call number in a register

2. Place the arguments in the proper registers and/or on the stack

3. Trigger the “system call” interrupt (switching to supervisor mode)

4. Jump to the “system call” interrupt handler

5. Load the system call table and jump to the index given by the syscall number

6. Execute the system call handler

7. Return the result to user space (switching back to user mode)

System Calls: The Instruction Way

Some architectures provide a specific instruction (syscall, svc, sysenter, …):


1. Place the system call number in a register

2. Place the arguments in the proper registers and/or on the stack

3. Use the system call instruction (switching to supervisor mode)

4. Jump to the index given by the syscall number in the syscall table

5. Execute the system call handler

6. Return the result to user space (switching back to user mode)


One level of indirection is bypassed!




Tip

You can find a very detailed (with less omissions) explanation of how system calls work in Linux here!

System Calls: The Linux Application Binary Interface

API: Application Programming Interface
High-level interface for programmers (function prototypes, data types, …)

ABI: Application Binary Interface
Low-level interface for compilers/OS (calling conventions, architecture-specific)


Architecture syscall# retval arg1 arg2 arg3 arg4 arg5 arg6 arg7
Arm EABI r7 r0 r0 r1 r2 r3 r4 r5 r6
arm64 w8 x0 x0 x1 x2 x3 x4 x5
mips v0 v0 a0 a1 a2 a3 a4 a5
riscv a7 a0 a0 a1 a2 a3 a4 a5
x86-64 rax rax rdi rsi rdx r10 r8 r9


How do we add a system call to Linux?

We’ll see that a bit later, when you’ll need to do it for an exercise.

ioctl

ioctls are a way to provide custom system calls, mainly for device drivers.

They are called from user space by the ioctl system call and provide an additional level of indirection, with each ioctl having a number.


Advantages:

  • Easy to extend
  • Provide a syscall-like interface, i.e., an arbitrary function call

Drawback:

  • Numbers have to stay “forever”, as changing them might break existing user space applications


An ioctl is tied to a file. It is registered as a file operation similar to read or write in the kernel.

For a given device, each ioctl has a number generated through a set of macros.

How can we use ioctls?

We won’t see the API in details here, you will have a task solely on that in the next lab.

Sockets

The kernel also provide a socket-based interface with user space called netlink.

It is similar to regular sockets in user space, but with the AF_NETLINK socket family.


Advantage:

  • Symmetrical communication

Drawback:

  • Implementation is a bit more complex than simple read/write semantics