0%

Chapter 6-2

Introduction

We gonna load our easy-fs to kernel. First, we need to know our device interface. Second, our Inode should be wrapped in OS as OSInode for extended functionality. Then we implement sys_read/write for it.

MMIO-Memory-Mapped I/O

The device registers of peripherals can be accessed through specific physical memory addresses, VirtIO MMIO physical address range for the peripheral bus is 4KiB starting at 0X10001000.

1
2
3
4
5
6
// os/src/config.rs

#[cfg(feature = "board_qemu")]
pub const MMIO: &[(usize, usize)] = &[
(0x10001000, 0x1000),
];

OS Inode

We only take restriction on our operations with readable and writable by OpenFlags. offset and Arc is a way to tackle simple situation of multi processes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// os/src/fs/inode.rs

bitflags! {
pub struct OpenFlags: u32 {
const RDONLY = 0;
const WRONLY = 1 << 0;
const RDWR = 1 << 1;
const CREATE = 1 << 9;
const TRUNC = 1 << 10;
}
}

impl OpenFlags {
/// Do not check validity for simplicity
/// Return (readable, writable)
pub fn read_write(&self) -> (bool, bool) {
if self.is_empty() {
(true, false)
} else if self.contains(Self::WRONLY) {
(false, true)
} else {
(true, true)
}
}
}

pub struct OSInode {
readable: bool,
writable: bool,
inner: Mutex<OSInodeInner>,
}

pub struct OSInodeInner {
offset: usize,
inode: Arc<Inode>,
}

impl OSInode {
pub fn new(
readable: bool,
writable: bool,
inode: Arc<Inode>,
) -> Self {
Self {
readable,
writable,
inner: Mutex::new(OSInodeInner {
offset: 0,
inode,
}),
}
}
}

File Descriptor Table

Now we need to connect file operations with process, each process need a descriptors table(which manage many files!) to indicate file record infos.

1
2
3
4
5
6
// os/src/task/task.rs

pub struct TaskControlBlockInner {
// ...
pub fd_table: Vec<Option<Arc<dyn File + Send + Sync>>>,
}

Remember our previous root inode? We load it directly for easy manipulation. So our workflow would be for current process, push a descriptor of allocation, and return the ptr of this allocation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// os/src/fs/inode.rs

lazy_static! {
pub static ref ROOT_INODE: Arc<Inode> = {
let efs = EasyFileSystem::open(BLOCK_DEVICE.clone());
Arc::new(EasyFileSystem::root_inode(&efs))
};
}

pub fn open_file(name: &str, flags: OpenFlags) -> Option<Arc<OSInode>> {
let (readable, writable) = flags.read_write();
if flags.contains(OpenFlags::CREATE) {
if let Some(inode) = ROOT_INODE.find(name) {
// clear size
inode.clear();
Some(Arc::new(OSInode::new(
readable,
writable,
inode,
)))
} else {
// create file
ROOT_INODE.create(name)
.map(|inode| {
Arc::new(OSInode::new(
readable,
writable,
inode,
))
})
}
} else {
ROOT_INODE.find(name)
.map(|inode| {
if flags.contains(OpenFlags::TRUNC) {
inode.clear();
}
Arc::new(OSInode::new(
readable,
writable,
inode
))
})
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// os/src/syscall/fs.rs

pub fn sys_open(path: *const u8, flags: u32) -> isize {
let task = current_task().unwrap();
let token = current_user_token();
let path = translated_str(token, path);
if let Some(inode) = open_file(
path.as_str(),
OpenFlags::from_bits(flags).unwrap()
) {
let mut inner = task.acquire_inner_lock();
let fd = inner.alloc_fd();
inner.fd_table[fd] = Some(inode);
fd as isize
} else {
-1
}
}

// os/src/syscall/fs.rs

pub fn sys_close(fd: usize) -> isize {
let task = current_task().unwrap();
let mut inner = task.inner_exclusive_access();
if fd >= inner.fd_table.len() {
return -1;
}
if inner.fd_table[fd].is_none() {
return -1;
}
inner.fd_table[fd].take();
0
}

The implementation is same for sys_read/write.

Chapter 7-1

Introduction

We gonna abstract Stdin and Stdout by file, and insert into file descriptor. Therefore support Pipe operation and IO Redirection across each process.

Everything Is a File

The design philosophy of Everything is a file will generalize everything to file based on IO operations while omit concrete content semantics.

Abstraction of IO hardware:

  • read-only: s.t. keyboard
  • write-only: s.t. screen
  • read-write: s.t. serial device

Abstraction of IO operations(based on file descriptor):

  • open: open file while possessing it by certain process.
  • close: close file while discarding it by certain process.
  • read: read file into memory.
  • write: write file from memory.

When a process is created, it owns three file as operation abstraction:

  • 0: Stdin
  • 1: Stdout
  • 2: Stderr(which we will merge with Stdout)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
impl TaskControlBlock {
pub fn new(elf_data: &[u8]) -> Self {
...
let task_control_block = Self {
pid: pid_handle,
kernel_stack,
inner: Mutex::new(TaskControlBlockInner {
// ...
fd_table: vec![
// 0 -> stdin
Some(Arc::new(Stdin)),
// 1 -> stdout
Some(Arc::new(Stdout)),
// 2 -> stderr
Some(Arc::new(Stdout)),
],
}),
};
...
}
}

Pipe

In usual shell, | is the symbolic of pipe. Manage input from left and output to right. If we abstract everything to file, s.t. Stdin or Stdout, so does Pipe, it has read and write ends, user could read thing from this end and write thing(often in child process) to other end, transfer those underneath thing.

We already has file descriptor as the indication of file, we will implement same operation for pipe.

sys_pipe get the ptr of a array with len = 2, output the write and the read ends of descriptors of pipe in the ptr.

1
2
3
4
5
6
7
// user/src/syscall.rs

const SYSCALL_PIPE: usize = 59;

pub fn sys_pipe(pipe: &mut [usize]) -> isize {
syscall(SYSCALL_PIPE, [pipe.as_mut_ptr() as usize, 0, 0])
}

So What’s the basic design of pipe?

It should has write and read ends which means ends share the same data, and record read and write informations on this data. We will construct RingBuffer to achieve this. Pipe owns a buffer control read and write, buffer will record data from head to tail index. Why we can’t just use two piece of data or Queue?

Because there’s no copy and suitable for our restriction! We will read data from head and move forward and push data to end in a fixed array rather allocation for Queue.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// os/src/fs/pipe.rs

pub struct Pipe {
readable: bool,
writable: bool,
buffer: Arc<Mutex<PipeRingBuffer>>,
}

const RING_BUFFER_SIZE: usize = 32;

#[derive(Copy, Clone, PartialEq)]
enum RingBufferStatus {
FULL,
EMPTY,
NORMAL,
}

pub struct PipeRingBuffer {
arr: [u8; RING_BUFFER_SIZE],
head: usize, // head index of ring buffer
tail: usize, // tail index of ring buffer
status: RingBufferStatus,
write_end: Option<Weak<Pipe>>,
}

impl PipeRingBuffer {
pub fn set_write_end(&mut self, write_end: &Arc<Pipe>) {
self.write_end = Some(Arc::downgrade(write_end));
}
}

/// Return (read_end, write_end)
pub fn make_pipe() -> (Arc<Pipe>, Arc<Pipe>) {
let buffer = Arc::new(Mutex::new(PipeRingBuffer::new()));
let read_end = Arc::new(
Pipe::read_end_with_buffer(buffer.clone())
);
let write_end = Arc::new(
Pipe::write_end_with_buffer(buffer.clone())
);
buffer.lock().set_write_end(&write_end);
(read_end, write_end)
}

impl PipeRingBuffer {
pub fn read_byte(&mut self) -> u8 {
self.status = RingBufferStatus::NORMAL;
let c = self.arr[self.head];
// move forward
self.head = (self.head + 1) % RING_BUFFER_SIZE;
if self.head == self.tail {
self.status = RingBufferStatus::EMPTY;
}
c
}
pub fn available_read(&self) -> usize {
if self.status == RingBufferStatus::EMPTY {
0
} else {
// data from head to tail!
if self.tail > self.head {
self.tail - self.head
} else {
self.tail + RING_BUFFER_SIZE - self.head
}
}
}
pub fn all_write_ends_closed(&self) -> bool {
self.write_end.as_ref().unwrap().upgrade().is_none()
}
}

In one process, there’s a possible that can’t read all thing in once, if so, we will pause and run other thing until the write end is finished.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// os/src/fs/pipe.rs

impl File for Pipe {
fn read(&self, buf: UserBuffer) -> usize {
assert!(self.readable());
let want_to_read = buf.len();
let mut buf_iter = buf.into_iter();
let mut already_read = 0usize;
loop {
let mut ring_buffer = self.buffer.exclusive_access();
let loop_read = ring_buffer.available_read();
if loop_read == 0 {
if ring_buffer.all_write_ends_closed() {
return already_read;
}
drop(ring_buffer);
suspend_current_and_run_next();
continue;
}
for _ in 0..loop_read {
if let Some(byte_ref) = buf_iter.next() {
unsafe {
*byte_ref = ring_buffer.read_byte();
}
already_read += 1;
if already_read == want_to_read {
return want_to_read;
}
} else {
return already_read;
}
}
}
}
}

Arguments

We will combine our pipe with our shell.

First, parse our arguments and push 0 to end to indicated end.

1
2
3
4
5
6
7
8
9
10
11
12
// user/src/bin/user_shell.rs

let args: Vec<_> = line.as_str().split(' ').collect();
let mut args_addr: Vec<*const u8> = args
.iter()
.map(|&arg| {
let s = arg.to_string();
s.push('\0');
s.as_ptr()
})
.collect();
args_addr.push(0 as *const u8)

Now task will accept a series of args rather than solely one string. So make sys_exec to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// os/src/syscall/process.rs

pub fn sys_exec(path: *const u8, mut args: *const usize) -> isize {
let token = current_user_token();
let path = translated_str(token, path);
let mut args_vec: Vec<String> = Vec::new();
// args would be a ptr of array contains ptr of string.
loop {
let arg_str_ptr = *translated_ref(token, args);
if arg_str_ptr == 0 {
break;
}
args_vec.push(translated_str(token, arg_str_ptr as *const u8));
unsafe { args = args.add(1); }
}
if let Some(app_inode) = open_file(path.as_str(), OpenFlags::RDONLY) {
let all_data = app_inode.read_all();
let task = current_task().unwrap();
let argc = args_vec.len();
task.exec(all_data.as_slice(), args_vec);
// return argc because cx.x[10] will be covered with it later
argc as isize
} else {
-1
}
}

Now, we really gonna use user stack to store these args!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// os/src/task/task.rs

impl TaskControlBlock {
// notice exec will allocate a new memory set!
pub fn exec(&self, elf_data: &[u8], args: Vec<String>) {
// ...
// first allocate memory for ptr of strings.
user_sp -= (args.len() + 1) * core::mem::size_of::<usize>();
let argv_base = user_sp;
// allocate new memory in user stack addr as a vector of strings
let mut argv: Vec<_> = (0..=args.len())
.map(|arg| {
translated_refmut(
memory_set.token(),
(argv_base + arg * core::mem::size_of::<usize>()) as *mut usize
)
})
.collect();
*argv[args.len()] = 0;
for i in 0..args.len() {
// allocate for strings themselves.
user_sp -= args[i].len() + 1;
*argv[i] = user_sp;
let mut p = user_sp;
for c in args[i].as_bytes() {
*translated_refmut(memory_set.token(), p as *mut u8) = *c;
p += 1;
}
*translated_refmut(memory_set.token(), p as *mut u8) = 0;
}
// make the user_sp aligned to 8B for k210 platform
user_sp -= user_sp % core::mem::size_of::<usize>();

// **** hold current PCB lock
let mut inner = self.acquire_inner_lock();
// substitute memory_set
inner.memory_set = memory_set;
// update trap_cx ppn
inner.trap_cx_ppn = trap_cx_ppn;
// initialize trap_cx
let mut trap_cx = TrapContext::app_init_context(
entry_point,
user_sp,
KERNEL_SPACE.lock().token(),
self.kernel_stack.get_top(),
trap_handler as usize,
);
// a[0] be args len
trap_cx.x[10] = args.len();
// a[1] be args base addr
trap_cx.x[11] = argv_base;
*inner.get_trap_cx() = trap_cx;
// **** release current PCB lock
}
}

Now we provide receive operation in _start, in which main could use it at first time S-level reading data and passing to U-level:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// user/src/lib.rs

#[no_mangle]
#[link_section = ".text.entry"]
pub extern "C" fn _start(argc: usize, argv: usize) -> ! {
unsafe {
HEAP.lock()
.init(HEAP_SPACE.as_ptr() as usize, USER_HEAP_SIZE);
}
let mut v: Vec<&'static str> = Vec::new();
for i in 0..argc {
let str_start = unsafe {
((argv + i * core::mem::size_of::<usize>()) as *const usize).read_volatile()
};
let len = (0usize..).find(|i| unsafe {
((str_start + *i) as *const u8).read_volatile() == 0
}).unwrap();
v.push(
core::str::from_utf8(unsafe {
core::slice::from_raw_parts(str_start as *const u8, len)
}).unwrap()
);
}
exit(main(argc, v.as_slice()));
}

Redirection

Redirection usually represent using > and < for output and input.

If we really want to redirect IO, we will combine user_shell and sys_dup.

First, sys_dup will duplicate a new file descriptor already opened in this process.

Then we parse user arguments, if there exist > or <, fork a new child process, open the file and close our corresponding Stdin and Stdout descriptor, using dup to hold the place of it by file itself! Then exec by original parsed arguments, and receive results in parent process.

Chapter 7-2

Introduction

If a process want to notify other process with event semantics, such one-side mechanism called Signal, one process received specific event will pause and implement corresponding operation to handle the notification.

For example, a program could receive the stop event sended by Ctrl+C, and stop itself.

The abstraction of handling of signal:

  • ignore: do own thing and ignore signal
  • trap: call corresponding operation of the received signal
  • stop: stop itself

Now, beside this raw idea, we want to classify such abstraction with specified data.


Signal Data

First, we define raw info for each possible event.

1
2
3
4
5
6
7
8
9
10
11
12
13
// user/src/lib.rs

pub const SIGDEF: i32 = 0; // Default signal handling
pub const SIGHUP: i32 = 1;
pub const SIGINT: i32 = 2;
pub const SIGQUIT: i32 = 3;
pub const SIGILL: i32 = 4;
pub const SIGTRAP: i32 = 5;
pub const SIGABRT: i32 = 6;
pub const SIGBUS: i32 = 7;
pub const SIGFPE: i32 = 8;
pub const SIGKILL: i32 = 9;
...

So, what if a process want to omit the signal, what should this process do? We will introduce Mask in bit design, which means higher number contains lower number, indicating higher priority.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// user/src/lib.rs

bitflags! {
pub struct SignalFlags: i32 {
const SIGDEF = 1; // Default signal handling
const SIGHUP = 1 << 1;
const SIGINT = 1 << 2;
const SIGQUIT = 1 << 3;
const SIGILL = 1 << 4;
const SIGTRAP = 1 << 5;
...
const SIGSYS = 1 << 31;
}
}

In a task block, it should record its current mask and current signal priority and each action corresponding to each flags, so we need a fixed array contains ptrs and its priority. After that, we need to record current flag it should implement.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// user/src/lib.rs

/// Action for a signal
#[repr(C, align(16))]
#[derive(Debug, Clone, Copy)]
pub struct SignalAction {
pub handler: usize,
pub mask: SignalFlags,
}

// os/src/task/signal.rs

pub const MAX_SIG: usize = 31;

// os/src/task/action.rs

#[derive(Clone)]
pub struct SignalActions {
pub table: [SignalAction; MAX_SIG + 1],
}

// os/src/task/task.rs

pub struct TaskControlBlockInner {
...
pub handling_sig: isize,
// priority allowed
pub signals: SignalFlags,
// priority forbidden
pub signal_mask: SignalFlags,
pub signal_actions: SignalActions,
...
}

Then our task know which signal should be implemented, which should be omitted.


Signal Handle

Recall that, each process should receive signal and trap into possible level, some may be in S-level, some may be in U-level. And some of them may be illegal or atrocious that we should stop or frozen to wait. If so, we should backup our trap_ctx, because handler contains different environement.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// os/src/task/task.rs

pub struct TaskControlBlockInner {
...
pub killed: bool,
pub frozen: bool,
pub handling_sig: isize,
pub trap_ctx_backup: Option<TrapContext>,
...
}

// os/src/task/mod.rs

// Some signals are severe and handled by kernel.
fn call_kernel_signal_handler(signal: SignalFlags) {
let task = current_task().unwrap();
let mut task_inner = task.inner_exclusive_access();
match signal {
SignalFlags::SIGSTOP => {
task_inner.frozen = true;
task_inner.signals ^= SignalFlags::SIGSTOP;
}
SignalFlags::SIGCONT => {
if task_inner.signals.contains(SignalFlags::SIGCONT) {
task_inner.signals ^= SignalFlags::SIGCONT;
task_inner.frozen = false;
}
}
_ => {
// println!(
// "[K] call_kernel_signal_handler:: current task sigflag {:?}",
// task_inner.signals
// );
task_inner.killed = true;
}
}
}

// Some signals are normal and handled by user.
fn call_user_signal_handler(sig: usize, signal: SignalFlags) {
let task = current_task().unwrap();
let mut task_inner = task.inner_exclusive_access();

let handler = task_inner.signal_actions.table[sig].handler;
if handler != 0 {
// register signal into task
task_inner.handling_sig = sig as isize;
task_inner.signals ^= signal;

// backup
let mut trap_ctx = task_inner.get_trap_cx();
task_inner.trap_ctx_backup = Some(*trap_ctx);

// modify current trap for our event handler
trap_ctx.sepc = handler;
trap_ctx.x[10] = sig;
} else {
// default action
println!("[K] task/call_user_signal_handler: default action: ignore it or kill process");
}
}

Based on this, we could check our pending signal based on priority of signals, signal_mask of task and signal_mask of signal itself of table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// os/src/task/mod.rs

fn check_pending_signals() {
for sig in 0..(MAX_SIG + 1) {
let task = current_task().unwrap();
let task_inner = task.inner_exclusive_access();
let signal = SignalFlags::from_bits(1 << sig).unwrap();
if task_inner.signals.contains(signal) && (!task_inner.signal_mask.contains(signal)) {
let mut masked = true;
let handling_sig = task_inner.handling_sig;
if handling_sig == -1 {
masked = false;
} else {
let handling_sig = handling_sig as usize;
if !task_inner.signal_actions.table[handling_sig]
.mask
.contains(signal)
{
masked = false;
}
}
if !masked {
drop(task_inner);
drop(task);
if signal == SignalFlags::SIGKILL
|| signal == SignalFlags::SIGSTOP
|| signal == SignalFlags::SIGCONT
|| signal == SignalFlags::SIGDEF
{
// signal is a kernel signal
call_kernel_signal_handler(signal);
} else {
// signal is a user signal
call_user_signal_handler(sig, signal);
return;
}
}
}
}
}

Then record a loop function to handle repeatedly while changing the state of task.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// os/src/task/mod.rs

pub fn handle_signals() {
loop {
check_pending_signals();
let (frozen, killed) = {
let task = current_task().unwrap();
let task_inner = task.inner_exclusive_access();
(task_inner.frozen, task_inner.killed)
};
if !frozen || killed {
break;
}
suspend_current_and_run_next();
}
}

System Operation

Finally, we will design sys operations to construct interface.

  • procmask: set mask of current process
  • sigaction: set handler of a signal of current process and move original handler to our input old_action ptr.
  • kill: current process send signal to the other
  • sigreturn: clear current signal and back to original trap state

We will construct it one by one.

procmask is simple, we just set it directly and return original one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// os/src/process.rs

pub fn sys_sigprocmask(mask: u32) -> isize {
if let Some(task) = current_task() {
let mut inner = task.inner_exclusive_access();
let old_mask = inner.signal_mask;
if let Some(flag) = SignalFlags::from_bits(mask) {
inner.signal_mask = flag;
old_mask.bits() as isize
} else {
-1
}
} else {
-1
}
}

sigaction is a bit harder but still easy, however, notice that the ptr may be null.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// os/src/syscall/process.rs

fn check_sigaction_error(signal: SignalFlags, action: usize, old_action: usize) -> bool {
if action == 0
|| old_action == 0
|| signal == SignalFlags::SIGKILL
|| signal == SignalFlags::SIGSTOP
{
true
} else {
false
}
}

pub fn sys_sigaction(
signum: i32,
action: *const SignalAction,
old_action: *mut SignalAction,
) -> isize {
let token = current_user_token();
let task = current_task().unwrap();
let mut inner = task.inner_exclusive_access();
if signum as usize > MAX_SIG {
return -1;
}
if let Some(flag) = SignalFlags::from_bits(1 << signum) {
if check_sigaction_error(flag, action as usize, old_action as usize) {
return -1;
}
let prev_action = inner.signal_actions.table[signum as usize];
*translated_refmut(token, old_action) = prev_action;
inner.signal_actions.table[signum as usize] = *translated_ref(token, action);
0
} else {
-1
}
}

kill is simple, we will extract the task from pid, and insert flag to it if there’s no flag has been set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// os/src/syscall/process.rs

fn check_sigaction_error(signal: SignalFlags, action: usize, old_action: usize) -> bool {
if action == 0
|| old_action == 0
|| signal == SignalFlags::SIGKILL
|| signal == SignalFlags::SIGSTOP
{
true
} else {
false
}
}

pub fn sys_sigaction(
signum: i32,
action: *const SignalAction,
old_action: *mut SignalAction,
) -> isize {
let token = current_user_token();
let task = current_task().unwrap();
let mut inner = task.inner_exclusive_access();
if signum as usize > MAX_SIG {
return -1;
}
if let Some(flag) = SignalFlags::from_bits(1 << signum) {
if check_sigaction_error(flag, action as usize, old_action as usize) {
return -1;
}
let prev_action = inner.signal_actions.table[signum as usize];
*translated_refmut(token, old_action) = prev_action;
inner.signal_actions.table[signum as usize] = *translated_ref(token, action);
0
} else {
-1
}
}

sigreturn is simple, because we only need to restore our backup one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// os/src/syscall/process.rs

pub fn sys_sigreturn() -> isize {
if let Some(task) = current_task() {
let mut inner = task.inner_exclusive_access();
inner.handling_sig = -1;
// restore the trap context
let trap_ctx = inner.get_trap_cx();
*trap_ctx = inner.trap_ctx_backup.unwrap();
0
} else {
-1
}
}

Phew! We finish our Signal mechanism!

Chapter 8-1

Introduction

As the growth of OS, dispatch resource could be divided to smaller piece for more efficient operations. Now, process can’t satisfied our demand, we want some programs could be implemented in parallel. Then, we introduce Thread.

Therefore, process will be the container of threads, each threads contain its id, state, current instruction ptr, registers, stack. However, it will share data(which means same memory and addr) in the same process. So, we will develop a accompany exclusion mechanism for parallel operations by each threads.

Design Data

Now, clarify our resource dispatch for one thread:

Immutable:

  • kernel stack

Mutable:

  • thread id
  • user stack
  • trap context
  • trap status
  • exit code

Every tasks is a thread unit and contained in one process, so now, process is really a process rather a task, it can owns many tasks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// os/src/task/process.rs

pub struct ProcessControlBlock {
// immutable
pub pid: PidHandle,
// mutable
inner: UPSafeCell<ProcessControlBlockInner>,
}

pub struct ProcessControlBlockInner {
// ...
pub task_res_allocator: RecycleAllocator,
pub tasks: Vec<Option<Arc<TaskControlBlock>>>,
}

Notice, we should separate user stack and kernel stack, we shouldn’t allocate user stack and kernel stack by same logic. Kernel stack is immutable, we only need its top place for trap context.

Because every thread use the same memory set, so each user stack and its trampoline would be allocated by its thread id. We encapsulate these to TaskUserRes data.

We can see many structure need a id allocation, we could design a general id allocator.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// os/src/task/id.rs

pub struct RecycleAllocator {
current: usize,
recycled: Vec<usize>,
}

impl RecycleAllocator {
pub fn new() -> Self {
RecycleAllocator {
current: 0,
recycled: Vec::new(),
}
}
pub fn alloc(&mut self) -> usize {
if let Some(id) = self.recycled.pop() {
id
} else {
self.current += 1;
self.current - 1
}
}
pub fn dealloc(&mut self, id: usize) {
assert!(id < self.current);
assert!(
!self.recycled.iter().any(|i| *i == id),
"id {} has been deallocated!",
id
);
self.recycled.push(id);
}
}

Kernel Stack Allocation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// os/src/task/id.rs

lazy_static! {
static ref KSTACK_ALLOCATOR: UPSafeCell<RecycleAllocator> =
unsafe { UPSafeCell::new(RecycleAllocator::new()) };
}

pub struct KernelStack(pub usize);

/// Return (bottom, top) of a kernel stack in kernel space.
pub fn kernel_stack_position(kstack_id: usize) -> (usize, usize) {
let top = TRAMPOLINE - kstack_id * (KERNEL_STACK_SIZE + PAGE_SIZE);
let bottom = top - KERNEL_STACK_SIZE;
(bottom, top)
}

pub fn kstack_alloc() -> KernelStack {
let kstack_id = KSTACK_ALLOCATOR.exclusive_access().alloc();
let (kstack_bottom, kstack_top) = kernel_stack_position(kstack_id);
KERNEL_SPACE.exclusive_access().insert_framed_area(
kstack_bottom.into(),
kstack_top.into(),
MapPermission::R | MapPermission::W,
);
KernelStack(kstack_id)
}

impl Drop for KernelStack {
fn drop(&mut self) {
let (kernel_stack_bottom, _) = kernel_stack_position(self.0);
let kernel_stack_bottom_va: VirtAddr = kernel_stack_bottom.into();
KERNEL_SPACE
.exclusive_access()
.remove_area_with_start_vpn(kernel_stack_bottom_va.into());
}
}

We will do the same for user stack:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// os/src/config.rs

pub const TRAMPOLINE: usize = usize::MAX - PAGE_SIZE + 1;
pub const TRAP_CONTEXT_BASE: usize = TRAMPOLINE - PAGE_SIZE;

// os/src/task/id.rs

fn trap_cx_bottom_from_tid(tid: usize) -> usize {
TRAP_CONTEXT_BASE - tid * PAGE_SIZE
}

fn ustack_bottom_from_tid(ustack_base: usize, tid: usize) -> usize {
ustack_base + tid * (PAGE_SIZE + USER_STACK_SIZE)
}

Then, TaskUserRes could be allocated with trap and user stack.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// impl TaskUserRes
pub fn alloc_user_res(&self) {
let process = self.process.upgrade().unwrap();
let mut process_inner = process.inner_exclusive_access();
// alloc user stack
let ustack_bottom = ustack_bottom_from_tid(self.ustack_base, self.tid);
let ustack_top = ustack_bottom + USER_STACK_SIZE;
process_inner.memory_set.insert_framed_area(
ustack_bottom.into(),
ustack_top.into(),
MapPermission::R | MapPermission::W | MapPermission::U,
);
// alloc trap_cx
let trap_cx_bottom = trap_cx_bottom_from_tid(self.tid);
let trap_cx_top = trap_cx_bottom + PAGE_SIZE;
process_inner.memory_set.insert_framed_area(
trap_cx_bottom.into(),
trap_cx_top.into(),
MapPermission::R | MapPermission::W,
);
}

Now, combine all things together:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// os/src/task/task.rs

pub struct TaskControlBlock {
// immutable
pub process: Weak<ProcessControlBlock>,
pub kstack: KernelStack,
// mutable
inner: UPSafeCell<TaskControlBlockInner>,
}

pub struct TaskControlBlockInner {
pub res: Option<TaskUserRes>,
pub trap_cx_ppn: PhysPageNum,
pub task_cx: TaskContext,
pub task_status: TaskStatus,
pub exit_code: Option<i32>,
}

Design Data Operation

We still get one task in operation rather process, because it’s the smallest instance unit. However, we need some interface to control process id.

1
2
3
4
5
6
7
pub fn add_task(task: Arc<TaskControlBlock>);
pub fn remove_task(task: Arc<TaskControlBlock>);
pub fn fetch_task() -> Option<Arc<TaskControlBlock>>;

pub fn pid2process(pid: usize) -> Option<Arc<ProcessControlBlock>>;
pub fn insert_into_pid2process(pid: usize, process: Arc<ProcessControlBlock>);
pub fn remove_from_pid2process(pid: usize);

Actually, many thing is same, for example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// os/src/task/process.rs

impl ProcessControlBlock {
pub fn new(elf_data: &[u8]) -> Arc<Self> {
// ...
let pid_handle = pid_alloc();
let process = ...;
let task = Arc::new(TaskControlBlock::new(
Arc::clone(&process),
ustack_base,
true,
));
// initiation of task...

let mut process_inner = process.inner_exclusive_access();
process_inner.tasks.push(Some(Arc::clone(&task)));
drop(process_inner);
insert_into_pid2process(process.getpid(), Arc::clone(&process));
// add main thread to scheduler
add_task(task);
process
}
}

If we fork a process, we only extract the first task which is itself, so no others will be copied.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
pub fn fork(self: &Arc<Self>) -> Arc<Self> {
let child = ...;
parent.children.push(Arc::clone(&child));
let task = Arc::new(TaskControlBlock::new(
Arc::clone(&child),
parent
.get_task(0)
.inner_exclusive_access()
.res
.as_ref()
.unwrap()
.ustack_base(),
// here we do not allocate trap_cx or ustack again
// but mention that we allocate a new kstack here
false,
));
let mut child_inner = child.inner_exclusive_access();
child_inner.tasks.push(Some(Arc::clone(&task)));
drop(child_inner);
...
}

Design System Operation

If we want to create a thread, as a naive designer, we only need the function entry addr, and arguments, yes! That’s it!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// os/src/syscall/thread.rs

pub fn sys_thread_create(entry: usize, arg: usize) -> isize {
let task = current_task().unwrap();
let process = task.process.upgrade().unwrap();
// create a new thread
let new_task = Arc::new(TaskControlBlock::new(
Arc::clone(&process),
task.inner_exclusive_access().res.as_ref().unwrap().ustack_base,
true,
));
// add new task to scheduler
add_task(Arc::clone(&new_task));
let new_task_inner = new_task.inner_exclusive_access();
let new_task_res = new_task_inner.res.as_ref().unwrap();
let new_task_tid = new_task_res.tid;
let mut process_inner = process.inner_exclusive_access();
// add new thread to current process
let tasks = &mut process_inner.tasks;
while tasks.len() < new_task_tid + 1 {
tasks.push(None);
}
tasks[new_task_tid] = Some(Arc::clone(&new_task));
let new_task_trap_cx = new_task_inner.get_trap_cx();
*new_task_trap_cx = TrapContext::app_init_context(
entry,
new_task_res.ustack_top(),
kernel_token(),
new_task.kstack.get_top(),
trap_handler as usize,
);
(*new_task_trap_cx).x[10] = arg;
new_task_tid as isize
}

Now, sys_exit will receive a exit_code and recycle its resource. Notice, if tid == 0, the thread of process itself will make other sub threads moved to init process.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// pub fn exit_current_and_run_next(exit_code: i32) {
// ...
{
let mut initproc_inner = INITPROC.inner_exclusive_access();
for child in process_inner.children.iter() {
child.inner_exclusive_access().parent = Some(Arc::downgrade(&INITPROC));
initproc_inner.children.push(child.clone());
}
}

let mut recycle_res = Vec::<TaskUserRes>::new();
for task in process_inner.tasks.iter().filter(|t| t.is_some()) {
let task = task.as_ref().unwrap();
remove_inactive_task(Arc::clone(&task));
let mut task_inner = task.inner_exclusive_access();
if let Some(res) = task_inner.res.take() {
recycle_res.push(res);
}
}

sys_waittid will check thread state and recycle if could, return -2 if it has not exited. Why need it? Because sys_exit can’t recycle itself unless the thread of process, other thread can call waittid to remove it from tasks queue, then it will be cleared by rust!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// os/src/syscall/thread.rs

/// thread does not exist, return -1
/// thread has not exited yet, return -2
/// otherwise, return thread's exit code
pub fn sys_waittid(tid: usize) -> i32 {
let task = current_task().unwrap();
let process = task.process.upgrade().unwrap();
let task_inner = task.inner_exclusive_access();
let mut process_inner = process.inner_exclusive_access();
// a thread cannot wait for itself
if task_inner.res.as_ref().unwrap().tid == tid {
return -1;
}
let mut exit_code: Option<i32> = None;
let waited_task = process_inner.tasks[tid].as_ref();
if let Some(waited_task) = waited_task {
if let Some(waited_exit_code) = waited_task.inner_exclusive_access().exit_code {
exit_code = Some(waited_exit_code);
}
} else {
// waited thread does not exist
return -1;
}
if let Some(exit_code) = exit_code {
// dealloc the exited thread
process_inner.tasks[tid] = None;
exit_code
} else {
// waited thread has not exited
-2
}
}

Chapter 8-2

Introduction

We will develop exclusion mechanism previously mentioned.

Beside construction, we need to abstract possible situation of data sharing. A usual native thought is a thread want to modify one thing but due to thread switch, the data is already modified and we get wrong result. So based on this, we want a operation to be Atomic, which means the operation excluding others. Now we can alleviate this restriction and generalize this.

Generalization:

  • Allow multiple but finite thread can join one atomic operation.
  • Allow condition of atomic operation.

Before such generalization, we want a way to represent atomic operation. We call the content of this operation Critical Section, and multiple threads operations in indeterminate time sequence Race Condition. So the basic problem of data sharing push us to identify multiple different operations by different threads, we can’t restrict data because the problem is on modification by threads, we need to Lock operations!

So, it there’s a lock sharing by threads, each threads can declare Lock it!, and no other threads can access this thread again.

Now, back to our generalization. If this lock has a bound of access number, many can access until reaching a bound. That’s also a reasonable design, we call this Semaphore; If this lock has a signal which one thread can send it to others to allow others to access it, That’s also a reasonable design, we call this Condition Variable.

If the real minimal sharing thing is Lock rather than data, we can discard so called data problem, and focus on lock itself, each threads can do anything in this lock and excluding others.

Design

No matter which kinds of lock, this is shared among threads.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
pub struct ProcessControlBlock {
// immutable
pub pid: PidHandle,
// mutable
inner: UPSafeCell<ProcessControlBlockInner>,
}
pub struct ProcessControlBlockInner {
...
pub lock_list: Vec<Option<Arc<Lock>>>,
}

pub struct Lock {
pub inner: UPSafeCell<LockInner>,
}
pub struct LockInner {
pub data: ...
pub wait_queue: VecDeque<Arc<TaskControlBlock>>,
}

In such design, one lock can push one thread to wait_queue to stop it, and pop front to start it. data is a generalization for various locks.

Then, in one process, it owns many locks used in various conditions, one can easily take it as a generalization of many data(actually nothing related to real data) we want to share.

Basic Lock

Now, we want to construct a basic lock allowing simple lock, unlock operation.

1
2
3
4
pub trait Mutex: Sync + Send {
fn lock(&self);
fn unlock(&self);
}

Usually, there’s U-level, M-level, S-level implementation. First, we gonna try first one easily, knowing the heuristic design of M-level, and extend basic thought to S-level.


U-level

A naive approach is to declare a global boolean indicating block state. lock will wait if the boolean is true and try to set it to true, and unlock will set it to false to release.

1
2
3
4
5
6
7
8
9
10
static mut mutex :i32 = 0;

fn lock(mutex: i32) {
while (mutex);
mutex = 1;
}

fn unlock(mutex: i32){
mutex = 0;
}

However, that’s wrong! We can’t construct lock by things we want to lock! Threads can jump in any instructions and break it! That’s means we can’t do it in U-level? We should ponder further in real situation, imagine two threads modify one thing in nearly same time, if we could set two global state in a operation that excluding each other(for example, one state set to 1 and another state set to 0), then only one operation can really be implemented and we can check this condition, allow it to get the lock.

1
2
3
4
5
6
7
8
9
10
11
12
static mut flag : [i32;2] = [0,0]; // 哪个线程想拿到锁?
static mut turn : i32 = 0; // 排号:轮到哪个线程? (线程 0 or 1?)

fn lock() {
flag[self] = 1; // 设置自己想取锁 self: 线程 ID
turn = 1 - self; // 设置另外一个线程先排号
while ((flag[1-self] == 1) && (turn == 1 - self)); // 忙等
}

fn unlock() {
flag[self] = 0; // 设置自己放弃锁
}

Now analyze the code, we find that no matter which flag is 1, or both 1, indicating certain thread want to get lock, turn will be a excluding state to flag, which means if another thread modify turn in same time, the turn can only be in one of the state and only one thread can get the lock.


M-level

Is there any predefined operation in instructions that is atomic? Then we can use it as a lock. The answer is Yes, in RISC-V, it’s:

  • AMO: Atomic memory operation
  • LR/SC: Load Reserved/Store Conditional

AMO: will read the value in memory and write new value, then store the old value to target register(s.t. amoadd.w rd, rs2, (rs1)).

LR/SC: LR will read memory and store in target register, and leave the addr of this memory, then SC could check the addr and write data to this addr, output a condition(0/1) to target register.(s.t. lr.w rd, (rs1), sc.w rd, rs2, (rs1))

We can use it to implement a atomic function:

1
2
3
4
5
6
7
8
9
10
# RISC-V sequence for implementing a TAS  at (s1)
li t2, 1 # t2 <-- 1
Try: lr t1, s1 # t1 <-- mem[s1] (load reserved)
bne t1, x0, Try # if t1 != 0, goto Try:
sc t0, s1, t2 # mem[s1] <-- t2 (store conditional)
bne t0, x0, Try # if t0 !=0 ('sc' Instr failed), goto Try:
Locked:
... # critical section
Unlock:
sw x0,0(s1) # mem[s1] <-- 0

Here the logic of Try is mem[s1] would be zero if it’s unlocked, and would be non-zero if it’s locked. So, Try will compare t1 and x0, actually mem[s1] and 0, if equal to zero, then try to store t2 into s1, if succeed, it will compare it again for the output signal t0 and x0, actually the output signal and 0, if succeed, it will jump out otherwise repeat.In this process, if the write operation failed, t0 would be non-zero, and repeat in Try.

If we want to Unlock, we write x0 to s1 to set mem[s1] to zero. Which is the unlocked state.

S-level

Then we could take the function to rust and package it. A simple refactor is when we in repetition loop, we yield, and give CPU to others.


Now, for any kinds of locks, we could apply it to our structure.

First, when we create a lock, we create and push it to list or set in empty element.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// os/src/syscall/sync.rs
pub fn sys_mutex_create(blocking: bool) -> isize {
let process = current_process();
let mutex: Option<Arc<dyn Mutex>> = if !blocking {
Some(Arc::new(MutexSpin::new()))
} else {
Some(Arc::new(MutexBlocking::new()))
};
let mut process_inner = process.inner_exclusive_access();
if let Some(id) = process_inner
.mutex_list
.iter()
.enumerate()
.find(|(_, item)| item.is_none())
.map(|(id, _)| id)
{
process_inner.mutex_list[id] = mutex;
id as isize
} else {
process_inner.mutex_list.push(mutex);
process_inner.mutex_list.len() as isize - 1
}
}

When we call lock, we should provide corresponding id of the lock, if it’s already locked, we push to wait_queue, else we lock it and goes on.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// os/src/syscall/sync.rs
pub fn sys_mutex_lock(mutex_id: usize) -> isize {
let process = current_process();
let process_inner = process.inner_exclusive_access();
let mutex = Arc::clone(process_inner.mutex_list[mutex_id].as_ref().unwrap());
drop(process_inner);
drop(process);
mutex.lock();
0
}
// os/src/sync/mutex.rs
impl Lock for MutexBlocking {
fn lock(&self) {
let mut mutex_inner = self.inner.exclusive_access();
if ... {
mutex_inner.wait_queue.push_back(current_task().unwrap());
// ... other operations
drop(mutex_inner);
block_current_and_run_next();
} else {
// ... other operations
}
}
}

Same reverse operation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// os/src/syscall/sync.rs
pub fn sys_mutex_unlock(mutex_id: usize) -> isize {
let process = current_process();
let process_inner = process.inner_exclusive_access();
let mutex = Arc::clone(process_inner.mutex_list[mutex_id].as_ref().unwrap());
drop(process_inner);
drop(process);
mutex.unlock();
0
}
// os/src/sync/mutex.rs
impl Mutex for MutexBlocking {
fn unlock(&self) {
let mut mutex_inner = self.inner.exclusive_access();
// ... other operation
if ... {
if let Some(waking_task) = mutex_inner.wait_queue.pop_front() {
add_task(waking_task);
}
}
}
}

Semaphore

It’s simple, we only need to switch boolean to number and check the bound. So, the initiated count is the bound, if one thread access, it will minus one, and release, add one. We only need to check positive or negative.

Apply our structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
pub fn up(&self) {
let mut inner = self.inner.exclusive_access();
inner.count += 1;
if inner.count <= 0 {
if let Some(task) = inner.wait_queue.pop_front() {
add_task(task);
}
}
}

pub fn down(&self) {
let mut inner = self.inner.exclusive_access();
inner.count -= 1;
if inner.count < 0 {
inner.wait_queue.push_back(current_task().unwrap());
drop(inner);
block_current_and_run_next();
}
}

If the initiated count equal to 1, we back to mutex!, which indicates sole thread access!

Actually, we could use it for synchronization operation, we set count to 0, if one thread access, it will be blocked, and another thread will could release and add one to count, then the original thread finally could access. Then the second thread will always be advanced to first one.

Here, the first is always advanced to second.

1
2
3
4
5
6
7
8
9
10
11
12
13
const SEM_SYNC: usize = 0; //信号量ID
unsafe fn first() -> ! {
sleep(10);
println!("First work and wakeup Second");
semaphore_up(SEM_SYNC); //信号量V操作
exit(0)
}
unsafe fn second() -> ! {
println!("Second want to continue,but need to wait first");
semaphore_down(SEM_SYNC); //信号量P操作
println!("Second can work now");
exit(0)
}

Conditional Variable

If we want one thread owns the ability of release lock for others, we need the CondVar. We have to dispatch operation in wait_queue, if one thread signal others, it will pop out a thread, which means trigger it You are free!. And if one thread wait, it will push itself to queue to wait, The unlock and lock is important because in wait operation, it allow other thread to modify condition, but it should be after of the push operation, in case that the signal is before the push, then we can never receive the signal again! We won’t encapsulate condition check to CondVar because it should leave to user to design it, we only leave out interface for user.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
pub fn signal(&self) {
let mut inner = self.inner.exclusive_access();
if let Some(task) = inner.wait_queue.pop_front() {
add_task(task);
}
}
pub fn wait(&self, mutex: Arc<dyn Mutex>) {
let mut inner = self.inner.exclusive_access();
inner.wait_queue.push_back(current_task().unwrap());
drop(inner);
mutex.unlock();
block_current_and_run_next();
mutex.lock();
}

However, if condition check is leave out to user, we can’t ensure the condition be violated due to data sharing, so usually we need to append mutex lock for this section.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static mut A: usize = 0;   //全局变量

const CONDVAR_ID: usize = 0;
const MUTEX_ID: usize = 0;

unsafe fn first() -> ! {
sleep(10);
println!("First work, Change A --> 1 and wakeup Second");
mutex_lock(MUTEX_ID);
A=1;
condvar_signal(CONDVAR_ID);
mutex_unlock(MUTEX_ID);
...
}
unsafe fn second() -> ! {
println!("Second want to continue,but need to wait A=1");
mutex_lock(MUTEX_ID);
while A==0 {
condvar_wait(CONDVAR_ID, MUTEX_ID);
}
mutex_unlock(MUTEX_ID);
...
}

We can see that if A=1, second won’t wait repeatly, and goes out.

Day-1

Component Kernel

Based on experiment, we will construct kernel in increment by demand.

  • UniKernel: Single S-Level, App is within kernel.

Each kernel instance can be considered as a construction based on unikernel.

  • MacroKernel: Manage U-Level with support on multiple apps, process management etc…
  • Hypervisor: Virtual state with restricted communication between U-level and S-level.

Aceros Design

1
2
3
graph TD
App <--> Runtime
Runtime <--> HAL

The design of Aceros is simple, first HAL(axhal) is the abstraction of hardware to initiation trap, stack, MMU, registers based on various architectures. Then Runtime(ax*) will be classified as many components to support various environments, like net, task, fs etc…

Each arrow is reversible, in boot, it will be from bottom to top to initiate App. Then when App call something, it will be from top to bottom to evoke functionality.

In real situation, we choose thing based on features.

1
2
3
4
5
6
7
8
graph TD
App --> axstd
axstd --> |axfeat| aceros_api
aceros_api --> axruntime
axruntime -->|alloc| axalloc
axruntime --> axhal
axruntime -->|irq| irq
axruntime -->|multitask| axtask

Day-2

Paging

We delve into paging.

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/// Physical address for pflash#1
const PFLASH_START: usize = 0x2200_0000;

#[cfg_attr(feature = "axstd", no_mangle)]
fn main() {
// Makesure that we can access pflash region.
let va = phys_to_virt(PFLASH_START.into()).as_usize();
let ptr = va as *const u32;
unsafe {
println!("Try to access dev region [{:#X}], got {:#X}", va, *ptr);
let magic = mem::transmute::<u32, [u8; 4]>(*ptr);
println!("Got pflash magic: {}", str::from_utf8(&magic).unwrap());
}
}

PFlash is the simulation of flash memory of qemu. When qemu boot, it will automatically load file to fixed MMIO, and can be directly accessed.

Paging: feature = ["paging"] is the way to evoke virtual memory management tu support MMIO. Located in axruntime.

The workflow would be:

  • qemu fdt: from 0x0c00_0000 to 0x3000_0000. Construct the space of device.
  • SBI: from 0x8000_0000 to 0x8020_0000. RISC-V Supervisor Binary Interface, it construct a interface for programming language to manipulate device level things.
  • Kernel Image: from 0x8020_0000. _skernel contains S-level things like static data, code etc… _ekernel is user thing.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#[link_section = ".data.boot_page_table"]
static mut BOOT_PT_SV39: [u64; 512] = [0; 512];

unsafe fn init_boot_page_table() {
// 0x8000_0000..0xc000_0000, VRWX_GAD, 1G block
BOOT_PT_SV39[2] = (0x80000 << 10) | 0xef;
// 0xffff_ffc0_8000_0000..0xffff_ffc0_c000_0000, VRWX_GAD, 1G block
// shift 10 bits to store flags
BOOT_PT_SV39[0x102] = (0x80000 << 10) | 0xef;
}

unsafe fn init_mmu() {
let page_table_root = BOOT_PT_SV39.as_ptr() as usize;
satp::set(satp::Mode::Sv39, 0, page_table_root >> 12);
riscv::asm::sfence_vma_all();
}

Each entry of page table will map 1G(0x4000_0000) memory. From 0x8000_0000 to 0xc0000_0000 at pgd_idx = 2 to 0xffff_ffc0_8000_0000 to 0xffff_ffc0_c000_0000 at pgd_idx = 102. This will map to a bigger range.

Task

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
let worker = thread::spawn(move || {
println!("Spawned-thread ...");

// Makesure that we can access pflash region.
let va = phys_to_virt(PFLASH_START.into()).as_usize();
let ptr = va as *const u32;
let magic = unsafe {
mem::transmute::<u32, [u8; 4]>(*ptr)
};
if let Ok(s) = str::from_utf8(&magic) {
println!("Got pflash magic: {s}");
0
} else {
-1
}
});

Each task will be in concurrency and dispatched by strategy. If it’s blocked, it will be moved to wait_queue to wait. If it’s ready, it will be moved to run_queue which is scheduler to be dispatched.

Message Communication

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
let q1 = Arc::new(SpinNoIrq::new(VecDeque::new()));
let q2 = q1.clone();

let worker1 = thread::spawn(move || {
println!("worker1 ...");
for i in 0..=LOOP_NUM {
println!("worker1 [{i}]");
q1.lock().push_back(i);
// NOTE: If worker1 doesn't yield, others have
// no chance to run until it exits!
thread::yield_now();
}
println!("worker1 ok!");
});

let worker2 = thread::spawn(move || {
println!("worker2 ...");
loop {
if let Some(num) = q2.lock().pop_front() {
println!("worker2 [{num}]");
if num == LOOP_NUM {
break;
}
} else {
println!("worker2: nothing to do!");
// TODO: it should sleep and wait for notify!
thread::yield_now();
}
}
println!("worker2 ok!");
});

Cooperative Scheduling: Each tasks kindly yield themselves or exit otherwise it will block everyone because the power of CPU occupation is ownned by each tasks.

Preemptive Scheduling: Each tasks will be automatically suspended by external condition: No lock, no device access; inner condition: run out of current time slice. We can use a disable_count to record this, even for multiple condition restriction, we can sum them up.

1
2
3
4
5
6
7
8
axhal::irq::register_handler(TIMER_IRQ_NUM, || {
update_timer();
#[cfg(feature = "multitask")]
axtask::on_timer_tick();
});

// Enable IRQs before starting app
axhal::arch::enable_irqs()

on_timer_tick will be trigger in time slice. When time ticker ticks, run_queue will check and suspend task if possible.

We can make it more dynamic. Which means each task has priority and during the implementation of cpu, each task has a vruntime to be dynamically adjusted by init_vruntime + (delta/weight(nice)) where delta and nice are dynamic adjustment number. delta will be incremented by timer, weight(nice) is actually the priority of the task. We ensure that task with lowest vruntime will be placed at top.

Day-3

Device

In common, devices can be separated to FS, Net, Dispaly.

1
2
3
4
5
6
7
8
9
10
11
12
13
/// A structure that contains all device drivers, organized by their category.
#[derive(Default)]
pub struct AllDevices {
/// All network device drivers.
#[cfg(feature = "net")]
pub net: AxDeviceContainer<AxNetDevice>,
/// All block device drivers.
#[cfg(feature = "block")]
pub block: AxDeviceContainer<AxBlockDevice>,
/// All graphics device drivers.
#[cfg(feature = "display")]
pub display: AxDeviceContainer<AxDisplayDevice>,
}

Devices will be initiated in axruntime, where axdriver module will be loaded to seek each device and mount drivers.

In qemu, virtio-mmio will send request to probe driver response otherwise return 0 as non-driver.

Block Driver

Block driver provide interface to write and read block providing IO operations and perennial storage.

Aceros use module axfs, with definition of interface vfs, and concrete implementation of ramfs and devfs.

Monolith

In U-Level, we will separate kernel memory and user memory, allowing user context used for process.

The basic logic would be construct new user space,load file to it and initiate user stack, then spawn user task with app_entry.

The top of page root would be shared as kernel space, and below would be independent as user space.

In user space separation, many kinds of resources can’t be shared as global resources, rather the demand of TaskExt as a reference to those independent resources owned by each user apps.

In TaskInner, we store the ptr of TaskExt by macro declaration of such type.

1
2
3
4
struct AxTask {
...
task_ext_ptr: *mut u8
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/// Task extended data for the monolithic kernel.
pub struct TaskExt {
/// The process ID.
pub proc_id: usize,
/// The user space context.
pub uctx: UspaceContext,
/// The virtual memory address space.
pub aspace: Arc<Mutex<AddrSpace>>,
}

// It will expanded as a trait implmentation of reference to ptr as the `TaskExt` type.
def_task_ext!(TaskExt)

pub fn spawn_user_task(aspace: Arc<Mutex<AddrSpace>>, uctx: UspaceContext) -> AxTaskRef {
let mut task = TaskInner::new(
|| {
let curr = axtask::current();
let kstack_top = curr.kernel_stack_top().unwrap();
ax_println!(
"Enter user space: entry={:#x}, ustack={:#x}, kstack={:#x}",
curr.task_ext().uctx.get_ip(),
curr.task_ext().uctx.get_sp(),
kstack_top,
);
unsafe { curr.task_ext().uctx.enter_uspace(kstack_top) };
},
"userboot".into(),
crate::KERNEL_STACK_SIZE,
);
task.ctx_mut()
.set_page_table_root(aspace.lock().page_table_root());
task.init_task_ext(TaskExt::new(uctx, aspace));
axtask::spawn_task(task)
}

NameSpace

To reuse resources, we will construct a axns_resource section in compilation to form a global namespace. Each will be shared by Arc::new().

If there’s a demand of uniqueness, we will allocate space and copy them.

Page Fault

We could implement lazy allocation of user space memory. We register PAGE_FAULT for our function and call handle_page_fault for AddrSpace.

1
2
3
4
5
6
7
8
9
10
impl AddrSpace
pub fn handle_page_fault(...) -> ...
if let Some(area) = self.areas.find(vaddr) {
let orig_flags = area.flags();
if orig_flags.contains(access_flags) {
return area
.backend()
.handle_page_fault(vaddr, orig_flags, &mut self.pt);
}
}

MemoryArea has two way:

  • Linear: direct construct map relation of memory based on physical contiguous mmemory.
  • Alloc: only construct null-map, and call handle_page_fault to really allocate memory.

User App

ELF is the default format of many apps. Kernel take the responsibility to load app to correct region.

Notice the offset of file and virtual space may be different due to optimization of ELF.

In order to load apps from linux, we will construct a Posix Api given interface mimic to linux.

Day-4

Hypervisor

A physical computer system can build multiple virtual computer system with its own virtual resources. Just like apps in U-level, each virtual system will consider themselves uniquely occupies these resources.

Emulator like a interpretor to stimulate a virtual system while in loose demand of efficiency.

Hypervisor will execute most instructions directly as a isomorphism of the stimulated virtual system to gain a huge efficiency.

*I type: Each virtual OS is equal on hardware.
*II type
: Virtual OS is on host OS.

Each instance as Guest(OS Image) be loaded on our host os kernel.

Design

Only focus on hypervisor(I type).

Levels are extended, because we need to separate host and guest, so U-Level become U, VU-Level. So does the kernel level because we need to separate host, the hypervisor and guest, the virtual OS. So S-Level become HS, VS-Level.

Communication

Instructions will be implemented in communication of HS and VS, when there’s a sbi-call, VS will communicate HS to implement.

In hstatus of RISC-V, design the virtualization mode:

SPV: the source of HS or VS, which determines the sret to VU or U.
SPVP: the permission of modification of memory that HS to V.

We need to store guest context and host context, then switch between ret(VM-Exit) and sret. We implement this by run_guest and guest_exit which both is the other’s reverse.


Timer will be injected to sbi-call by setting a virtual clock in VS, when set timer, we clear timer of guest and set timer of host; when interrupt, we set clear timer of host and set timer of guest waiting for next request of timer.


Memory will be separated based on guest and host too. GVA will be a map of GPA as guest memory. However, HPA take responsibility of handle GPA as the virtualization process.


Dev will record each start vaddr and when VM-Exit of PageFault, it will findvmdevs.find(addr) and call handle_mmio for corresponding request.

起点

十多年前的一个下午,我坐在电脑前,刷完了所有周常任务。游戏公会里最后一位成员下线回老家吃饭。连续在线十几个小时后,我退出游戏、关掉电源,一股巨大的空虚感突然涌上心头。

我毕业于青岛一所专科院校计算机专业,从达内培训班结业后,进入一家对日外包公司,维护着1995年开发的VB6.0程序。在青岛,月薪三千元省吃俭用才能勉强度日。我总觉得自己的能力不止于此,人生不该是这副模样。

想要改变却毫无方向,既缺资源又缺机会。

后来在网上翻帖,听说《计算机程序设计艺术》是计算机领域的圣经。这位没有方向的年轻人,决定啃下这部巨著。它的作者是图灵奖得主、算法领域先驱、被誉为”算法之父”的Donald E. Knuth。为配合本书,作者甚至专门编写了配套教材《具体数学》。我买来这本书,却只坚持读了一天就放弃了。

找不到《具体数学》的公开课,偶然发现其目录与离散数学相似。得知清华大学在学堂在线开设《组合数学》课程,我立即报名并完成学习。同时在知乎了解到,若想补充编译原理知识,可学习《SICP》,便买来这本书跟着上古录像学了前三章。

这两门课让我获得薪资六倍的offer,从青岛搬到上海加入创业公司。如今从业十余年,从外包程序员到前端/全栈工程师,再到系统架构师,甚至担任过CTO。相较于起点,这或许算得上逆袭。

但命运的齿轮究竟何时开始转动?

是关闭游戏感到空虚的那个下午?还是买下《具体数学》翻开扉页的那天?

都不是。我始终认为,命运的齿轮真正转动始于报名《组合数学》课程的那一刻。

现在问参加操作系统训练营的同学们:你们是否已感受到命运齿轮开始转动?

通关攻略

当今社会被资本构建的信息茧房笼罩,每天被Python编程课广告轰炸,甚至上万的AI课程、数千元的知识星球…所谓大厂高P讲师们坐享时代红利,利用信息不对称收割韭菜,教材却是东拼西凑、充斥大量错误和误导的八股文。

而一流大学的优质课程免费开放却鲜为人知。恭喜诸位突破信息茧房,本次训练营依托清华大学操作系统课程,配有专业师资团队。

登录官网可见先修要求:

  • Rust语言基础
  • 计算机组成与设计
  • RISC-V 汇编特权指令集

但作为注重实践的训练营,最终考核是完成操作系统实现并在QEMU验证。还有一些官网未提及的隐藏门槛:

  • Git:是否熟练掌握commit/checkout/branch/merge/rebase等操作?
  • Linux:是否有使用经验并熟悉常用命令?
  • 英语:能否阅读英文文档(即使借助翻译工具)?
  • 编码与调试:是否具备万行代码经验,能否将代码拆分到多个函数和文件?
  • 网络:是否能无障碍访问国际互联网,并为开发工具配置代理或镜像源?
  • 提问:能否精准描述问题并明确所需帮助?

每项技能至少需要10小时学习投入。若缺少其中任何一项,实验过程中将需要付出更多时间精力来补足。

总结

感谢陈渝、向勇、石磊等教授开设课程,感谢李明老师推广宣传,感谢助教团队的付出。这门课程:

  • 于国:培养操作系统人才,助力科技自立自强
  • 于民:促进教育公平,让教育资源匮乏地区的学生获得一流课程
  • 于我:这个专科生终于能与985/211学生站在同一起跑线。过去总抱怨命运不公,如今证明自己的机会就在眼前——我只需全力以赴。