.altmacro .macro SAVE_SN n sd s\n, (\n+2)*8(a0) .endm .macro LOAD_SN n ld s\n, (\n+2)*8(a1) .endm .section .text .globl __switch __switch: ... # logic by store previous context and take out the new one to current register
Store each App state array and current running app.
each state store TaskContext and TaskState for running or exited etc…
Init and ready by store the __restore ctx to TaskContext
Run for switch cx if needed.
1 2 3 4 5 6 7 8 9 10
let current_task_cx_ptr = &mut inner.tasks[current].task_cx as *mut TaskContext; let next_task_cx_ptr = &inner.tasks[next].task_cx as *const TaskContext; drop(inner); // before this, we should drop local variables that must be dropped manually unsafe { __switch( current_task_cx_ptr, next_task_cx_ptr, ); }rust
extern"C" { fn__restore(cx_addr: usize); } unsafe { __restore(KERNEL_STACK.push_context( TrapContext::app_init_context(APP_BASE_ADDRESS, USER_STACK.get_sp()) // This context store the ptr to UserStack for restoration ) as *const _ asusize); }
Construct User App
Link app binary to kernel with specify memory layout
Read the layout, use AppManager to maintain and store
Load app from memory layout, copy consecutively to APP_BASE_ADDRESS(Currently we have no ability to dynamically read address)
// part of read in static init of AppManager let num_app_ptr = _num_app asusizeas *constusize; let num_app = num_app_ptr.read_volatile(); letmut app_start: [usize; MAX_APP_NUM + 1] = [0; MAX_APP_NUM + 1]; let app_start_raw: &[usize] = core::slice::from_raw_parts( num_app_ptr.add(1), num_app + 1 ); app_start[..=num_app].copy_from_slice(app_start_raw);
Load App:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
// part of code of copying to kernel asm!("fence.i"); // clear app area core::slice::from_raw_parts_mut( APP_BASE_ADDRESS as *mutu8, APP_SIZE_LIMIT ).fill(0); let app_src = core::slice::from_raw_parts( self.app_start[app_id] as *constu8, self.app_start[app_id + 1] - self.app_start[app_id] ); let app_dst = core::slice::from_raw_parts_mut( APP_BASE_ADDRESS as *mutu8, app_src.len() ); app_dst.copy_from_slice(app_src);
pubfnrun_next_app() -> ! { letmut app_manager = APP_MANAGER.exclusive_access(); let current_app = app_manager.get_current_app(); unsafe { app_manager.load_app(current_app); } app_manager.move_to_next_app(); drop(app_manager); // before this we have to drop local variables related to resources manually // and release the resources extern"C" { fn__restore(cx_addr: usize); } unsafe { __restore(KERNEL_STACK.push_context( TrapContext::app_init_context(APP_BASE_ADDRESS, USER_STACK.get_sp()) ) as *const _ asusize); } panic!("Unreachable in batch::run_current_app!"); }
After the demand of code execution one by one, we want to introduce Process which will be a full space-time description of execution process of binary file in OS. It means for one process, it should hold independent resources to be executed.
After Process, Thread and Coroutine are also developed in growth of OS. They are different in resources taken up, usually, Thread will be in one process and hold their own independent stack and workflow; Coroutine will be in one Thread and hold their own independent workflow.
Design
Every process need independent memory layout, can be dispatch by cpu. It’s the functionality based on Task, after that, each process can Fork their own children processes, so there’s a workflow in time discrepancy. Its resource can’t be recycled in time due to children processes, we need to mark it as Zombie Process.
To clarify which is children, which is parent, and each isolated process, we mark each with PID-Process Identifier. Notice if we fork a process, it will be same as parent only a0 which is the register called for return will be different, parent process return new PID of child process, child process return 0 as none of fork.
fork: copy a process state(like sp etc…) as its child process.
waitpid: wait a child become zombie and recycle all resources.
exec: clear a process state and load a execution file.
Data Constructon
We will recycle all presumed pid by PidAllocator, No need to worry about previous pid used.
And cpu dispatch for newly introduced Processor. We introduce a idle process that used to call other process.
Why not direct call next by previous one? rather use idle process?
Separate idle process for start and others for its own, then dispatch data won’t occur in other process and make the dispatch process invisible for Trap for each process.
// run_tasks() // loop to fetch task and switch possible task ifletSome(task) = fetch_task() { let idle_task_cx_ptr = processor.get_idle_task_cx_ptr(); letmut task_inner = task.inner_exclusive_access(); let next_task_cx_ptr = &task_inner.task_cx as *const TaskContext; task_inner.task_status = TaskStatus::Running; drop(task_inner); processor.current = Some(task); drop(processor); unsafe { __switch( idle_task_cx_ptr, next_task_cx_ptr, ); } }
// switch to idle process if one task run out of its time. pubfnschedule(switched_task_cx_ptr: *mut TaskContext) { letmut processor = PROCESSOR.exclusive_access(); let idle_task_cx_ptr = processor.get_idle_task_cx_ptr(); drop(processor); unsafe { __switch( switched_task_cx_ptr, idle_task_cx_ptr, ); } }
Dispatch Construction
Previously, we use suspend_current_and_run_next to pause task and switch to next, now we need to adapt it to process design.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// os/src/task/mod.rs
pubfnsuspend_current_and_run_next() { let task = take_current_task().unwrap();
// ---- access current TCB exclusively letmut task_inner = task.inner_exclusive_access(); let task_cx_ptr = &mut task_inner.task_cx as *mut TaskContext; task_inner.task_status = TaskStatus::Ready; drop(task_inner); // ---- stop exclusively accessing current PCB
// add to task deque add_task(task); // change current to idle process schedule(task_cx_ptr); }
In previous case, task won’t be created by its parent task, but process will. So, if its TrapContext has been recycled, we need to refactor our trap_handler for such case.
1 2 3 4 5 6 7 8 9 10 11
// fn trap_handler() -> ! Trap::Exception(Exception::UserEnvCall) => { // jump to next instruction anyway letmut cx = current_trap_cx(); cx.sepc += 4; // syscall may create new process and change trap context. let result = syscall(cx.x[17], [cx.x[10], cx.x[11], cx.x[12]]); // wether cx is changed or not, we will refetch it. cx = current_trap_cx(); cx.x[10] = result asusize; }
Now we will construct fork, exec, waitpid,exit.
Fork
We need to copy all memory layout and its task state. Then reallocate new kernel stack for it.
// impl MemorySet pubfnfrom_existed_user(user_space: &MemorySet) -> MemorySet { letmut memory_set = Self::new_bare(); // map trampoline memory_set.map_trampoline(); // copy data sections/trap_context/user_stack for area in user_space.areas.iter() { let new_area = MapArea::from_another(area); memory_set.push(new_area, None); // copy data from another space for vpn in area.vpn_range { let src_ppn = user_space.translate(vpn).unwrap().ppn(); let dst_ppn = memory_set.translate(vpn).unwrap().ppn(); dst_ppn.get_bytes_array().copy_from_slice(src_ppn.get_bytes_array()); } } memory_set }
// impl TaskControlBlock // fn fork let trap_cx_ppn = memory_set .translate(VirtAddr::from(TRAP_CONTEXT).into()) .unwrap() .ppn(); // alloc a pid and a kernel stack in kernel space let pid_handle = pid_alloc(); let kernel_stack = KernelStack::new(&pid_handle); let kernel_stack_top = kernel_stack.get_top(); let task_control_block = Arc::new(TaskControlBlock { pid: pid_handle, kernel_stack, inner: unsafe { UPSafeCell::new(TaskControlBlockInner { trap_cx_ppn, base_size: parent_inner.base_size, task_cx: TaskContext::goto_trap_return(kernel_stack_top), task_status: TaskStatus::Ready, memory_set, parent: Some(Arc::downgrade(self)), children: Vec::new(), exit_code: 0, })}, }); // add child parent_inner.children.push(task_control_block.clone()); // modify kernel_sp in trap_cx // **** access children PCB exclusively let trap_cx = task_control_block.inner_exclusive_access().get_trap_cx(); trap_cx.kernel_sp = kernel_stack_top;
Finally, implement sys_fork
1 2 3 4 5 6 7 8 9 10 11 12
pubfnsys_fork() -> isize { let current_task = current_task().unwrap(); let new_task = current_task.fork(); let new_pid = new_task.pid.0; let trap_cx = new_task.inner_exclusive_access().get_trap_cx();
// for child process, fork returns 0 trap_cx.x[10] = 0; //x[10] is a0 reg
add_task(new_task); new_pid asisize }
We can see that if trap_handler call sys_fork, parent process x[10] would be new_pid as return value.
Exec
If we want to execute a task by its name, we need to first load string in app load.
1 2 3 4 5 6 7 8
// os/build.rs
writeln!(f, r#" .global _app_names _app_names:"#)?; for app in apps.iter() { writeln!(f, r#" .string "{}""#, app)?; }
// move all its children to the initial process // ++++++ access initproc TCB exclusively { letmut initproc_inner = INITPROC.inner_exclusive_access(); for child in inner.children.iter() { child.inner_exclusive_access().parent = Some(Arc::downgrade(&INITPROC)); initproc_inner.children.push(child.clone()); } } // ++++++ stop exclusively accessing parent PCB
// clear all memory. inner.children.clear(); inner.memory_set.recycle_data_pages(); drop(inner); // **** stop exclusively accessing current PCB // drop task manually to maintain rc correctly drop(task); // use _unused replace original context, which will be recycled by rust. letmut _unused = TaskContext::zero_init(); schedule(&mut _unused as *mut _);
// os/src/syscall/process.rs
pubfnsys_exit(exit_code: i32) -> ! { exit_current_and_run_next(exit_code); panic!("Unreachable in sys_exit!"); }
WaitPid
WaitPid will return -1 if there’s no specified pid process exist, if it’s running, return -2, finally, if it finished, recycle it and return 0.
let pair = // search task managers and find (idx,task_block) p.inner_exclusive_access().is_zombie() && (pid == -1 || pid asusize == p.getpid())
ifletSome((idx,_)) = pair { let child = inner.children.remove(idx); // confirm that child will be deallocated after removing from children list assert_eq!(Arc::strong_count(&child), 1); let found_pid = child.getpid(); // ++++ temporarily access child TCB exclusively let exit_code = child.inner_exclusive_access().exit_code; // ++++ stop exclusively accessing child PCB *translated_refmut(inner.memory_set.token(), exit_code_ptr) = exit_code; found_pid asisize } else { // pid process is running -2 }
// user/src/lib.rs
pubfnwait(exit_code: &muti32) -> isize { loop { match sys_waitpid(-1, exit_code as *mut _) { -2 => { yield_(); } // -1 or a real pid exit_pid => return exit_pid, } } }
The demand of persistent storage is with growth of computer. Currently we can only store things on our map memory, but it’s only exist in running time, which is Internal Storage, now we want to store it to External Storage.
Play the role of mapping the given dir tree structure to persistent storage. For example: windows-FAT/NTPS; linux-Ext3/Ext4/Btrfs. Therefore, construct a VFS-Virtual File System is necessary to restrict unified interface.
Design
Simplification
flatten: only root dir /
permission: only user and no restriction on file access
no timestamp
no soft/hard link
many…
Cache
For a persistent external storage, it will separate file in basic storage unit. Which is called sector(usually 512 bytes, 4KB), rather, file system will set its own storage unit which is called block, usually different from sector, but in this implementation, we set it as 512 bytes, same as sector.
File will not read and write directly often which will slow down speed, we will construct Block Cache to store read data. Then we unify all block cache in to a manager with limit size and used for allocation and deallocation.
Bit Map is a nice structure to handle mapping operations, we set each block as 512 bytes(4KB), each bits represent a state of allocation(1/0 for allocated/deallocated).
Based on such structure, we could exposit what is Inode and Data Block, not all block will store real data because some of them need to be used as guidance. However, we also need to know where and how these route blocks be allocated. That’s the reason of Bit Map! Now we delve into Inode.
To make one inode control many data blocks, we will design layer of route for it. Beside direct index, it also store the index of layer 1 and layer 2 to route other index block(which is considered same as data block), and route to real data block. Notice one block contains 512 bytes, which is 512 u8, so it contains 512/4 = 128 u32, so one index block can route 128 * 512 bytes = 128 * 0.5 KB = 64 KB in one layer. In second layer, it can route as much as 128 * 64 KB = 64 MB.
Such Inode take 128 bytes, so in one block, it could contains 4 inodes. We should make a data structure could be fit exactly into a block size. Now we design route method.
// impl DiskInode // pub fn read_at( // &self, // offset: usize, // buf: &mut [u8], // block_device: &Arc<dyn BlockDevice>, loop { // calculate end of current block letmut end_current_block = (start / BLOCK_SZ + 1) * BLOCK_SZ; end_current_block = end_current_block.min(end); // read and update read size let block_read_size = end_current_block - start; let dst = &mut buf[read_size..read_size + block_read_size]; get_block_cache( self.get_block_id(start_block asu32, block_device) asusize, Arc::clone(block_device), ) .lock() .read(0, |data_block: &DataBlock| { let src = &data_block[start % BLOCK_SZ..start % BLOCK_SZ + block_read_size]; dst.copy_from_slice(src); }); read_size += block_read_size; // move to next block if end_current_block == end { break; } start_block += 1; start = end_current_block; }
File System
Due to our consecutive layout, we will store bitmap and start block, then initiate a unified system to control allocation and route. We call it File System.
Our Disk Inode is aims for underneath system, not for user, so we need a real Inode as a interface for Disk Inode to route, which store its id and offset.
However, we still need one special data block which is DirEntry, as directory which store inode_number to route inode, DirEntry takes 32 bytes, so each block can store 4 DirEntry. Thus we can route to inode by &str.
Notice disk_inode contains type for dir and file. So some of inodes will store dir data and some of inodes will store file data, we can get inode of data by inode of dir.
fnfind_inode_id( &self, name: &str, disk_inode: &DiskInode, ) -> Option<u32> { // assert it is a directory assert!(disk_inode.is_dir()); let file_count = (disk_inode.size asusize) / DIRENT_SZ; letmut dirent = DirEntry::empty(); for i in0..file_count { // note assert_eq! has side effect: read data to dirent. assert_eq!( disk_inode.read_at( DIRENT_SZ * i, dirent.as_bytes_mut(), &self.block_device, ), DIRENT_SZ, ); if dirent.name() == name { returnSome(dirent.inode_number() asu32); } } None } }
Usually, the workflow of create or delete, read or write would be:
read/write
get root inode which is dir type
read/write closure of disk inode through root inode
resize specified inode
create/clear
allocation/deallocation: alloc/dealloc inode by bitmap and get its index
initialization/clear by get its block cache by its index
We don’t want a fixed physical addr for allocation, rather, we want a unified abstract interface for dynamic memory layout for app storage. We call it Virtual Address
Safety: Every app will be allocated in its own virtual memory space, so each can’t interfere others.
Efficiency: Every app can coexist in same time without demand of reading outer peripherals to switch app.(With development of memory size)
We need MMU(Memory Management Unit) to achieve Address Translation for interview from virtual to physical.
Different designs occur.
Design
Segment Design
Each app exist in one fixed slot for one segment as $[0,bound)$, with a linear map by MMU.
Problem: Wasteful and inflexible
We may want a different linear map for each app, for example, its allocation for heap, data, code etc… So we can dispatch memory in more finer style, but it can’t resolve the problem because now even slot is dynamically allocated, it may still exist some free memory too small to reuse, cause the External Fragment rather than Internal Fragment which is the problem due to fixed slot.
Paging Design
We could set a Protection bit as r for read, w for write, x for execution.
Another way is to inverse our mind, rather take a slot on memory, we take slot on MMU, it can map its slot(now we call it Page) for real physical memory layout. To adjust, we can take slice for Page to form Frame which is a smaller unit to suit physical memory layout, each app can take many Page Number for a Page Table, record the map.
Page Design
SV39 only use lower 39 bits rather whole 64 bits in design even bit width is 64 bits(it’s a fairly large amount!)
In a address, $[11:0]$ called Page Offset, and $[38:12]$ is the VPN which will be used for location of page and use offset to direct in one page(with 4KiB in one page).
We should modify satp to open Paging for S and U-level memory.
Usually, a simple design for page manager would be a linear map from base addr and follow up. But actually it will take a huge amount of memory due to the storage of offset by base addr for each app.
A finer design is from Trie. We will take index slice for each 8 bits(it will fit in to 4KB just right!), it means for each slice has 512 states, and link those state up, form a tree. Usually with 3-level for Page Index. Total 27 bits.
Virtual Page will reserve 27 bits for the index slice and 12 bits for offset for certain page. Total 39 bits.
When we transform a virtual addr to physical one, we will do the following exposition reversely.
Page Management Design
A simple allocation for page(rather management!) is stack style.
It means for one physical page, we will record each allocation by vector of FrameTracker as a representation of real Frame after the root.
We should design transformation from virtual addr to physical addr.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
// for each idxs represent index slices of virtual addr. for i in0..3 { let pte = &mut ppn.get_pte_array()[idxs[i]]; if i == 2 { result = Some(pte); break; } if !pte.is_valid() { let frame = frame_alloc().unwrap(); // create or return None *pte = PageTableEntry::new(frame.ppn, PTEFlags::V); self.frames.push(frame); } ppn = pte.ppn(); }
Page Map Design
Based on our abstraction, we need a design for MapArea to given a map from continuous virtual address(no matter their corresponding page!) to physical address.
impl MemorySet { pubfnactivate(&self) { let satp = self.page_table.token(); unsafe { satp::write(satp); asm!("sfence.vma" :::: "volatile"); // sfence.vma is a special instruction to clear `Translaton Lookaside Buffer` which is used for quick search for memory addr to reduce performance expenses. } } }
Therefore, it will fill current root of physical page number as activation.
Notice that we should make instruction contigeous for SV39 open in physical address to amend the gap of transformation of address before and after open.
A problem is that separation of Kernel and App will also isolate Trap, the process need info from App to Kernel but App can’t see it. Therefore, we need a transfer operation. We achieve this by storing related info in TrapContext.
(Because there’s no more register to store these without breaking state like sscratch designed for kernel stack.)
# load kernel_satp into t0 ld t0, 34*8(sp) # load trap_handler into t1 ld t1, 36*8(sp) # move to kernel_sp ld sp, 35*8(sp) # switch to kernel space csrw satp, t0 sfence.vma # jump to trap_handler jr t1
# __restore: # a0: *TrapContext in user space(Constant); a1: user space token # switch to user space csrw satp, a1 sfence.vma csrw sscratch, a0
To amend the problem of contigeous instructions after switch, we need to adjust memory layout for trampoline which is in the same location in U-level and S-level(unified for all app to trap!). It will be placed at highest virtual page.
In usual shell, | is the symbolic of pipe. Manage input from left and output to right. If we abstract everything to file, s.t. Stdin or Stdout, so does Pipe, it has read and write ends, user could read thing from this end and write thing(often in child process) to other end, transfer those underneath thing.
We already has file descriptor as the indication of file, we will implement same operation for pipe.
sys_pipe get the ptr of a array with len = 2, output the write and the read ends of descriptors of pipe in the ptr.
It should has write and read ends which means ends share the same data, and record read and write informations on this data. We will construct RingBuffer to achieve this. Pipe owns a buffer control read and write, buffer will record data from head to tail index. Why we can’t just use two piece of data or Queue?
Because there’s no copy and suitable for our restriction! We will read data from head and move forward and push data to end in a fixed array rather allocation for Queue.
pubstructPipeRingBuffer { arr: [u8; RING_BUFFER_SIZE], head: usize, // head index of ring buffer tail: usize, // tail index of ring buffer status: RingBufferStatus, write_end: Option<Weak<Pipe>>, }
impl TaskControlBlock { // notice exec will allocate a new memory set! pubfnexec(&self, elf_data: &[u8], args: Vec<String>) { // ... // first allocate memory for ptr of strings. user_sp -= (args.len() + 1) * core::mem::size_of::<usize>(); let argv_base = user_sp; // allocate new memory in user stack addr as a vector of strings letmut argv: Vec<_> = (0..=args.len()) .map(|arg| { translated_refmut( memory_set.token(), (argv_base + arg * core::mem::size_of::<usize>()) as *mutusize ) }) .collect(); *argv[args.len()] = 0; for i in0..args.len() { // allocate for strings themselves. user_sp -= args[i].len() + 1; *argv[i] = user_sp; letmut p = user_sp; for c in args[i].as_bytes() { *translated_refmut(memory_set.token(), p as *mutu8) = *c; p += 1; } *translated_refmut(memory_set.token(), p as *mutu8) = 0; } // make the user_sp aligned to 8B for k210 platform user_sp -= user_sp % core::mem::size_of::<usize>();
// **** hold current PCB lock letmut inner = self.acquire_inner_lock(); // substitute memory_set inner.memory_set = memory_set; // update trap_cx ppn inner.trap_cx_ppn = trap_cx_ppn; // initialize trap_cx letmut trap_cx = TrapContext::app_init_context( entry_point, user_sp, KERNEL_SPACE.lock().token(), self.kernel_stack.get_top(), trap_handler asusize, ); // a[0] be args len trap_cx.x[10] = args.len(); // a[1] be args base addr trap_cx.x[11] = argv_base; *inner.get_trap_cx() = trap_cx; // **** release current PCB lock } }
Now we provide receive operation in _start, in which main could use it at first time S-level reading data and passing to U-level:
#[no_mangle] #[link_section = ".text.entry"] pubextern"C"fn_start(argc: usize, argv: usize) -> ! { unsafe { HEAP.lock() .init(HEAP_SPACE.as_ptr() asusize, USER_HEAP_SIZE); } letmut v: Vec<&'staticstr> = Vec::new(); for i in0..argc { let str_start = unsafe { ((argv + i * core::mem::size_of::<usize>()) as *constusize).read_volatile() }; let len = (0usize..).find(|i| unsafe { ((str_start + *i) as *constu8).read_volatile() == 0 }).unwrap(); v.push( core::str::from_utf8(unsafe { core::slice::from_raw_parts(str_start as *constu8, len) }).unwrap() ); } exit(main(argc, v.as_slice())); }
Redirection
Redirection usually represent using > and < for output and input.
If we really want to redirect IO, we will combine user_shell and sys_dup.
First, sys_dup will duplicate a new file descriptor already opened in this process.
Then we parse user arguments, if there exist > or <, fork a new child process, open the file and close our corresponding Stdin and Stdout descriptor, using dup to hold the place of it by file itself! Then exec by original parsed arguments, and receive results in parent process.
We gonna load our easy-fs to kernel. First, we need to know our device interface. Second, our Inode should be wrapped in OS as OSInode for extended functionality. Then we implement sys_read/write for it.
MMIO-Memory-Mapped I/O
The device registers of peripherals can be accessed through specific physical memory addresses, VirtIO MMIO physical address range for the peripheral bus is 4KiB starting at 0X10001000.
We only take restriction on our operations with readable and writable by OpenFlags. offset and Arc is a way to tackle simple situation of multi processes.
Remember our previous root inode? We load it directly for easy manipulation. So our workflow would be for current process, push a descriptor of allocation, and return the ptr of this allocation.
If a process want to notify other process with event semantics, such one-side mechanism called Signal, one process received specific event will pause and implement corresponding operation to handle the notification.
For example, a program could receive the stop event sended by Ctrl+C, and stop itself.
The abstraction of handling of signal:
ignore: do own thing and ignore signal
trap: call corresponding operation of the received signal
stop: stop itself
Now, beside this raw idea, we want to classify such abstraction with specified data.
Signal Data
First, we define raw info for each possible event.
So, what if a process want to omit the signal, what should this process do? We will introduce Mask in bit design, which means higher number contains lower number, indicating higher priority.
In a task block, it should record its current mask and current signal priority and each action corresponding to each flags, so we need a fixed array contains ptrs and its priority. After that, we need to record current flag it should implement.
Then our task know which signal should be implemented, which should be omitted.
Signal Handle
Recall that, each process should receive signal and trap into possible level, some may be in S-level, some may be in U-level. And some of them may be illegal or atrocious that we should stop or frozen to wait. If so, we should backup our trap_ctx, because handler contains different environement.
// Some signals are severe and handled by kernel. fncall_kernel_signal_handler(signal: SignalFlags) { let task = current_task().unwrap(); letmut task_inner = task.inner_exclusive_access(); match signal { SignalFlags::SIGSTOP => { task_inner.frozen = true; task_inner.signals ^= SignalFlags::SIGSTOP; } SignalFlags::SIGCONT => { if task_inner.signals.contains(SignalFlags::SIGCONT) { task_inner.signals ^= SignalFlags::SIGCONT; task_inner.frozen = false; } } _ => { // println!( // "[K] call_kernel_signal_handler:: current task sigflag {:?}", // task_inner.signals // ); task_inner.killed = true; } } }
// Some signals are normal and handled by user. fncall_user_signal_handler(sig: usize, signal: SignalFlags) { let task = current_task().unwrap(); letmut task_inner = task.inner_exclusive_access();
let handler = task_inner.signal_actions.table[sig].handler; if handler != 0 { // register signal into task task_inner.handling_sig = sig asisize; task_inner.signals ^= signal;
fncheck_pending_signals() { for sig in0..(MAX_SIG + 1) { let task = current_task().unwrap(); let task_inner = task.inner_exclusive_access(); let signal = SignalFlags::from_bits(1 << sig).unwrap(); if task_inner.signals.contains(signal) && (!task_inner.signal_mask.contains(signal)) { letmut masked = true; let handling_sig = task_inner.handling_sig; if handling_sig == -1 { masked = false; } else { let handling_sig = handling_sig asusize; if !task_inner.signal_actions.table[handling_sig] .mask .contains(signal) { masked = false; } } if !masked { drop(task_inner); drop(task); if signal == SignalFlags::SIGKILL || signal == SignalFlags::SIGSTOP || signal == SignalFlags::SIGCONT || signal == SignalFlags::SIGDEF { // signal is a kernel signal call_kernel_signal_handler(signal); } else { // signal is a user signal call_user_signal_handler(sig, signal); return; } } } } }
Then record a loop function to handle repeatedly while changing the state of task.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
// os/src/task/mod.rs
pubfnhandle_signals() { loop { check_pending_signals(); let (frozen, killed) = { let task = current_task().unwrap(); let task_inner = task.inner_exclusive_access(); (task_inner.frozen, task_inner.killed) }; if !frozen || killed { break; } suspend_current_and_run_next(); } }
System Operation
Finally, we will design sys operations to construct interface.
procmask: set mask of current process
sigaction: set handler of a signal of current process and move original handler to our input old_action ptr.
kill: current process send signal to the other
sigreturn: clear current signal and back to original trap state
We will construct it one by one.
procmask is simple, we just set it directly and return original one.
As the growth of OS, dispatch resource could be divided to smaller piece for more efficient operations. Now, process can’t satisfied our demand, we want some programs could be implemented in parallel. Then, we introduce Thread.
Therefore, process will be the container of threads, each threads contain its id, state, current instruction ptr, registers, stack. However, it will share data(which means same memory and addr) in the same process. So, we will develop a accompany exclusion mechanism for parallel operations by each threads.
Design Data
Now, clarify our resource dispatch for one thread:
Immutable:
kernel stack
Mutable:
thread id
user stack
trap context
trap status
exit code
Every tasks is a thread unit and contained in one process, so now, process is really a process rather a task, it can owns many tasks.
Notice, we should separate user stack and kernel stack, we shouldn’t allocate user stack and kernel stack by same logic. Kernel stack is immutable, we only need its top place for trap context.
Because every thread use the same memory set, so each user stack and its trampoline would be allocated by its thread id. We encapsulate these to TaskUserRes data.
We can see many structure need a id allocation, we could design a general id allocator.
pubfnfork(self: &Arc<Self>) -> Arc<Self> { let child = ...; parent.children.push(Arc::clone(&child)); let task = Arc::new(TaskControlBlock::new( Arc::clone(&child), parent .get_task(0) .inner_exclusive_access() .res .as_ref() .unwrap() .ustack_base(), // here we do not allocate trap_cx or ustack again // but mention that we allocate a new kstack here false, )); letmut child_inner = child.inner_exclusive_access(); child_inner.tasks.push(Some(Arc::clone(&task))); drop(child_inner); ... }
Design System Operation
If we want to create a thread, as a naive designer, we only need the function entry addr, and arguments, yes! That’s it!
pubfnsys_thread_create(entry: usize, arg: usize) -> isize { let task = current_task().unwrap(); let process = task.process.upgrade().unwrap(); // create a new thread let new_task = Arc::new(TaskControlBlock::new( Arc::clone(&process), task.inner_exclusive_access().res.as_ref().unwrap().ustack_base, true, )); // add new task to scheduler add_task(Arc::clone(&new_task)); let new_task_inner = new_task.inner_exclusive_access(); let new_task_res = new_task_inner.res.as_ref().unwrap(); let new_task_tid = new_task_res.tid; letmut process_inner = process.inner_exclusive_access(); // add new thread to current process let tasks = &mut process_inner.tasks; while tasks.len() < new_task_tid + 1 { tasks.push(None); } tasks[new_task_tid] = Some(Arc::clone(&new_task)); let new_task_trap_cx = new_task_inner.get_trap_cx(); *new_task_trap_cx = TrapContext::app_init_context( entry, new_task_res.ustack_top(), kernel_token(), new_task.kstack.get_top(), trap_handler asusize, ); (*new_task_trap_cx).x[10] = arg; new_task_tid asisize }
Now, sys_exit will receive a exit_code and recycle its resource. Notice, if tid == 0, the thread of process itself will make other sub threads moved to init process.
letmut recycle_res = Vec::<TaskUserRes>::new(); for task in process_inner.tasks.iter().filter(|t| t.is_some()) { let task = task.as_ref().unwrap(); remove_inactive_task(Arc::clone(&task)); letmut task_inner = task.inner_exclusive_access(); ifletSome(res) = task_inner.res.take() { recycle_res.push(res); } }
sys_waittid will check thread state and recycle if could, return -2 if it has not exited. Why need it? Because sys_exit can’t recycle itself unless the thread of process, other thread can call waittid to remove it from tasks queue, then it will be cleared by rust!
We will develop exclusion mechanism previously mentioned.
Beside construction, we need to abstract possible situation of data sharing. A usual native thought is a thread want to modify one thing but due to thread switch, the data is already modified and we get wrong result. So based on this, we want a operation to be Atomic, which means the operation excluding others. Now we can alleviate this restriction and generalize this.
Generalization:
Allow multiple but finite thread can join one atomic operation.
Allow condition of atomic operation.
Before such generalization, we want a way to represent atomic operation. We call the content of this operation Critical Section, and multiple threads operations in indeterminate time sequence Race Condition. So the basic problem of data sharing push us to identify multiple different operations by different threads, we can’t restrict data because the problem is on modification by threads, we need to Lock operations!
So, it there’s a lock sharing by threads, each threads can declare Lock it!, and no other threads can access this thread again.
Now, back to our generalization. If this lock has a bound of access number, many can access until reaching a bound. That’s also a reasonable design, we call this Semaphore; If this lock has a signal which one thread can send it to others to allow others to access it, That’s also a reasonable design, we call this Condition Variable.
If the real minimal sharing thing is Lock rather than data, we can discard so called data problem, and focus on lock itself, each threads can do anything in this lock and excluding others.
Design
No matter which kinds of lock, this is shared among threads.
In such design, one lock can push one thread to wait_queue to stop it, and pop front to start it. data is a generalization for various locks.
Then, in one process, it owns many locks used in various conditions, one can easily take it as a generalization of many data(actually nothing related to real data) we want to share.
Basic Lock
Now, we want to construct a basic lock allowing simple lock, unlock operation.
Usually, there’s U-level, M-level, S-level implementation. First, we gonna try first one easily, knowing the heuristic design of M-level, and extend basic thought to S-level.
U-level
A naive approach is to declare a global boolean indicating block state. lock will wait if the boolean is true and try to set it to true, and unlock will set it to false to release.
1 2 3 4 5 6 7 8 9 10
staticmut mutex :i32 = 0;
fnlock(mutex: i32) { while (mutex); mutex = 1; }
fnunlock(mutex: i32){ mutex = 0; }
However, that’s wrong! We can’t construct lock by things we want to lock! Threads can jump in any instructions and break it! That’s means we can’t do it in U-level? We should ponder further in real situation, imagine two threads modify one thing in nearly same time, if we could set two global state in a operation that excluding each other(for example, one state set to 1 and another state set to 0), then only one operation can really be implemented and we can check this condition, allow it to get the lock.
1 2 3 4 5 6 7 8 9 10 11 12
staticmut flag : [i32;2] = [0,0]; // 哪个线程想拿到锁? staticmut turn : i32 = 0; // 排号:轮到哪个线程? (线程 0 or 1?)
Now analyze the code, we find that no matter which flag is 1, or both 1, indicating certain thread want to get lock, turn will be a excluding state to flag, which means if another thread modify turn in same time, the turn can only be in one of the state and only one thread can get the lock.
M-level
Is there any predefined operation in instructions that is atomic? Then we can use it as a lock. The answer is Yes, in RISC-V, it’s:
AMO: Atomic memory operation
LR/SC: Load Reserved/Store Conditional
AMO: will read the value in memory and write new value, then store the old value to target register(s.t. amoadd.w rd, rs2, (rs1)).
LR/SC: LR will read memory and store in target register, and leave the addr of this memory, then SC could check the addr and write data to this addr, output a condition(0/1) to target register.(s.t. lr.w rd, (rs1), sc.w rd, rs2, (rs1))
We can use it to implement a atomic function:
1 2 3 4 5 6 7 8 9 10
# RISC-V sequence for implementing a TAS at (s1) li t2, 1 # t2 <-- 1 Try: lr t1, s1 # t1 <-- mem[s1] (load reserved) bne t1, x0, Try # if t1 != 0, goto Try: sc t0, s1, t2 # mem[s1] <-- t2 (store conditional) bne t0, x0, Try # if t0 !=0 ('sc' Instr failed), goto Try: Locked: ... # critical section Unlock: sw x0,0(s1) # mem[s1] <-- 0
Here the logic of Try is mem[s1] would be zero if it’s unlocked, and would be non-zero if it’s locked. So, Try will compare t1 and x0, actually mem[s1] and 0, if equal to zero, then try to store t2 into s1, if succeed, it will compare it again for the output signal t0 and x0, actually the output signal and 0, if succeed, it will jump out otherwise repeat.In this process, if the write operation failed, t0 would be non-zero, and repeat in Try.
If we want to Unlock, we write x0 to s1 to set mem[s1] to zero. Which is the unlocked state.
S-level
Then we could take the function to rust and package it. A simple refactor is when we in repetition loop, we yield, and give CPU to others.
Now, for any kinds of locks, we could apply it to our structure.
First, when we create a lock, we create and push it to list or set in empty element.
// os/src/syscall/sync.rs pubfnsys_mutex_unlock(mutex_id: usize) -> isize { let process = current_process(); let process_inner = process.inner_exclusive_access(); let mutex = Arc::clone(process_inner.mutex_list[mutex_id].as_ref().unwrap()); drop(process_inner); drop(process); mutex.unlock(); 0 } // os/src/sync/mutex.rs impl Mutex for MutexBlocking { fnunlock(&self) { letmut mutex_inner = self.inner.exclusive_access(); // ... other operation if ... { ifletSome(waking_task) = mutex_inner.wait_queue.pop_front() { add_task(waking_task); } } } }
Semaphore
It’s simple, we only need to switch boolean to number and check the bound. So, the initiated count is the bound, if one thread access, it will minus one, and release, add one. We only need to check positive or negative.
If the initiated count equal to 1, we back to mutex!, which indicates sole thread access!
Actually, we could use it for synchronization operation, we set count to 0, if one thread access, it will be blocked, and another thread will could release and add one to count, then the original thread finally could access. Then the second thread will always be advanced to first one.
Here, the first is always advanced to second.
1 2 3 4 5 6 7 8 9 10 11 12 13
const SEM_SYNC: usize = 0; //信号量ID unsafefnfirst() -> ! { sleep(10); println!("First work and wakeup Second"); semaphore_up(SEM_SYNC); //信号量V操作 exit(0) } unsafefnsecond() -> ! { println!("Second want to continue,but need to wait first"); semaphore_down(SEM_SYNC); //信号量P操作 println!("Second can work now"); exit(0) }
Conditional Variable
If we want one thread owns the ability of release lock for others, we need the CondVar. We have to dispatch operation in wait_queue, if one thread signal others, it will pop out a thread, which means trigger it You are free!. And if one thread wait, it will push itself to queue to wait, The unlock and lock is important because in wait operation, it allow other thread to modify condition, but it should be after of the push operation, in case that the signal is before the push, then we can never receive the signal again! We won’t encapsulate condition check to CondVar because it should leave to user to design it, we only leave out interface for user.
However, if condition check is leave out to user, we can’t ensure the condition be violated due to data sharing, so usually we need to append mutex lock for this section.