2024秋冬开源操作系统训练营第三阶段总结-杨学聪

After the 3rd stage of the 2024 autumn winter open source operating system training camp, I would like to share my experience and summary with you.

There are three system structures to compare:

Unikernel
- Single-Privilege Physical-Address-Space Combination-of-Kernel-and-UserApps
Macrokernel Mode
- Multi-Privilege Paging-Address-Space Isolation-between-Kernel-and-UserApps
Hypervisor (Virtual) Mode
- Isolation-between-Host-and-Guest

The critical key to the Component-based design is the Feature. This can be configured in the Cargo.toml file and in the Makefile environment variable, which can decide which content to compile and link. It feels like a advanced #if - #endif switch option.

Unikernel

We use 3 stages to express the system advancing from bare-metal program to a component-based system.

Bare-metal program
- Hardfirm + Bootloader
- Initialize the special registers
- Initialize MMU (for Paging)
- Initialize Stack
- Initialize Interrupt Vector
- Initialize Peripherals / Devices
- Transfer to main program
Layer / Hierarchy
- Hardfirm + Bootloader (Layer)
- Hardware Initialization(Layer): the special registers 、MMU(for Paging)、STACK、Interrupt Vector
- Peripherals (Layer)
- Transfer to main program (Layer)
Component-based
- Hardfirm + Bootloader
- Hardware Initialization (HAL)
- Runtime Environment (RTE)
- Transfer to main program

void Reset_Handler(void) {
  __asm__ volatile(
  ".code 32                                         \n"
  "CPSID   if                                      \n"// Mask interrupts
  /* Put any cores other than 0 to sleep */
  "MRC     p15, 0, R0, c0, c0, 5                   \n"  /* Read MPIDR */
  "ANDS    R0, R0, #3                              \n"
  "goToSleep:                                      \n"
  "ITT  NE                                         \n"  /* Needed when in Thumb mode for following WFINE instruction */
  "WFINE                                           \n"
  "BNE     goToSleep                               \n"
  /* Reset SCTLR Settings */
  "MRC     p15, 0, R0, c1, c0, 0                   \n"  /* Read CP15 System Control register */
  "BIC     R0, R0, #(0x1 << 12)                    \n"  /* Clear I bit 12 to disable I Cache */
  "BIC     R0, R0, #(0x1 <<  2)                    \n"  /* Clear C bit  2 to disable D Cache */
  "BIC     R0, R0, #0x1                            \n"  /* Clear M bit  0 to disable MMU */
  "BIC     R0, R0, #(0x1 << 11)                    \n"  /* Clear Z bit 11 to disable branch prediction */
  "BIC     R0, R0, #(0x1 << 13)                    \n"  /* Clear V bit 13 to disable hivecs */
  "BIC     R0, R0, #(0x1 << 29)                    \n"  /* Clear AFE bit 29 to enable the full range of access permissions */
  "ORR     R0, R0, #(0x1 << 30)                    \n"  /* Set TE bit to take exceptions in Thumb mode */
  "MCR     p15, 0, R0, c1, c0, 0                   \n"  /* Write value back to CP15 System Control register */
  "ISB                                             \n"
  /* Configure ACTLR */
  "MRC     p15, 0, r0, c1, c0, 1                   \n"  /* Read CP15 Auxiliary Control Register */
  "ORR     r0, r0, #(1 <<  1)                      \n"  /* Enable L2 prefetch hint (UNK/WI since r4p1) */
  "MCR     p15, 0, r0, c1, c0, 1                   \n"  /* Write CP15 Auxiliary Control Register */
  /* Set Vector Base Address Register (VBAR) to point to this application's vector table */
  "LDR    R0, =Vectors                             \n"
  "MCR    p15, 0, R0, c12, c0, 0                   \n"
  "ISB                                             \n"
  ...
  "CPSIE  if                                       \n"// Unmask interrupts
  "BL __libc_init_array \n"
  "BL     main                                     \n"
  ...)
}

The above code selected from the HAL Code of STM32MP13, as the initialization and reset handler code, which will help initialize STACK, Interrupt-Vector and critical registers, and end up transferring to main program.
The code is followed similar logic as the rCore, and can help us have a better understanding of the MCU, MPU and CPU.

Like rCore, We need to provide implementation for the memory interfaces about heap operations, to avoid memory leaks and memory fragmentation.
This can help us manage memory more efficiently, and provide convenience for future expansion.

There are 2 kinds of Memory Allocation Functions: One is based on Page (palloc), and the other is based on Byte (balloc), where the “ByteAlloc” is based on “PageAlloc”.

If we take “PageAlloc” based on “ByteAlloc”, it will be difficult to align.

Algorithm for Memory Allocation

TLSF, Two-Level Segregated Fit
Buddy
Slab
Bump

How to enable paging mechanism:

Early in the kernel startup, use the ruled identity mapping part of memory
- 0xffff_ffc0_8000_0000 ~ 0xffff_ffc0_C000_FFFF $\rightarrow$ 0x8000_0000~0xC000_0000
- Note that some address-related registers, such as SP, need to be changed to linear addresses
Then if paging feature specified, rebuild the complete paging reflect.

Task switching:
Swaps the task currently being executed with a task in the ready queue.
For Single-core CPU, the form of multitasking can only be concurrent, but not parallel.

State of Task

Running
- The number is equal to the number of processor cores
- SWITCH-TO: Ready or Blocked or Exited
Ready
- Is ready to be scheduled at any time
- SWITCH-TO: Running
Blocked
- Waiting for an event or resource satisfying a condition
- SWITCH-TO: Ready
Exited
- The task is finished and waiting to be recycled

Task switching: Firsty, save the context of the current task. Then, restore the context of the new task. Finally, trandfer to switch tasks.
Note that the interrupt enable state of the processor switching the task, should be off during the switching process (CLI).
If neccessary, the Spinlock and Mutex are required to be used to avoid race conditions (SMP?).

There are usual preemption conditions: One is the time slice of the task is exhausted, and the other is the interrupt source, such as the clock (Timer). The privilege level may be used to determine the nested or re-entry of the traps.

Algorithm of Scheduling

Collaborative scheduling algorithm (FIFO, fair)
Preemptive scheduling algorithm (Privileged)
- ROUND_ROBIN
- CFS (Completely Fair Scheduler)

The DEVICEs are usually in the kinds of net, block, display and so on.

How to discover and initialize devices

(axruntime at startup) discovers the device and initializes it with the appropriate driver
axdriver Indicates the process of discovering and initializing devices
- 2-stage cyclic detection discovers the device
  - Level 1: traverses all virtio_mmio address ranges, determined by the platform physical memory layout, and performs transition page mapping
  - Level 2: Enumerate devices with the for_each_drivers macro, and then probe each virtio device probe_mmio
probe discovers devices based on the bus, matches drivers one by one, and initializes them
- Bus connecting with devices
  - PCI
  - MMIO

A File System is a mechanism used in an operating system to manage files and data on computer storage devices such as hard disks, SSDS, flash memory, etc.
(In Linux, every device will also exist as one or more files)

File System

RAMFS: A memory-based virtual file system
- For temporary data storage that is fast but easy to lose.
DEVFS: Device file system
- For managing and accessing hardware devices, simplifying the development and access of device drivers.
ProcFS: process file system
- Provides system process and status information for system monitoring and management.
SysFS: System file system
- Exports kernel objects and properties for viewing and managing hardware devices and drivers.

Macro kernel

More than before:

Privilege Level
Address space

So we need

Map user/kernel space (address space)
- Usual method
  - The high end of the page table is used as kernel space
  - The low end is used as user application space, because user programs mostly start from low addresses
- Kernel space is shared and user space is used independently
Add system call (cross-privilege + address space)

The user applications toggle between two privilege levels:

Task Context : User Level, execute application logic
Trap Context : Kernel Level, handle system calls and exceptions

Address space Area mapping Back-end Backend maps specific areas in a space

Linear
- case The target physical address space area already exists, and the mapping relationship is directly established
- It can be used for MMIO area mapping and special shared address area mapping
- Corresponding physical page frames must be consecutive
Alloc (Lazy)
- case Use missing page exception (亡羊补牢)
- Mapped by page, the corresponding physical page frames are usually discontinuous

To compatible with Linux applications, we should implement the compatible system calls, file systems and other system resources. This asks us to follow the POSIX standard.

POSIX allows developers to write applications using a standard set of apis without worrying about differences in the underlying operating system.

Except Windows, many modern operating systems, such as Linux, macOS (based on Unix), and many embedded systems (RtOS ?) , are partially or fully POSIX compliant. This allows applications developed on these systems to run seamlessly on different platforms.

When loading the ELF-formatted application, the VirtAddr and MemSiz are used to place the segment in the target virtual memory location. Beware that some segments like .bss are all zeros, so the actual data is not stored in the ELF file, but rather dynamically allocated space is requested.

The important support to do is to implemwnt hosted-standing environment main function.
We should provide:

Parameter information (argc and argv): parameter number, parameter string pointer array
Environment information (envv): environment variable string pointer array

Hypervisor

Each virtual machine has its own independent virtual machine hardware, based on the physical computer. Each virtual machine runs its own operating system, and the operating system believes that it is alone in the execution environment, and cannot distinguish whether the environment is a physical machine or a virtual machine. (ideal state)

This is similar to virtualization software, like VMware ?

Is the virtual 8086 mode on x86 the same or similar principle?

The difference between a Hypervisor and an Emulator is whether the architecture of the virtual running environment is the same as that of the physical running environment that supports it.

It Seems to have something to do with emulation KVM acceleration.

Layers of resource objects supported by the Hypervisor:

VM: manages the address space.
vCPU: indicates the virtualization of computing resources and the flow of execution on VMS
vMem: Memory virtualization based on the physical space layout of VMS
vDevice: indicates device virtualization, including direct mapping and emulation
vUtilities: interrupts virtualization and bus device operation

Usually use below items to implement a Hypervisor:

run_guest is responsible for entering Guest environment
guest_exit is responsible for exiting Guest environment