I am interested in an efficient implementation of usermode scheduling using threads. When just doing stack switching, it doesn't work so well with thread-local storage, and since you don't swap thread information, things like nice don't work either. Debugging stack-switching code also sucks.
I am confident this can be done efficiently (roughly 200ns per switch) with specialized system calls as is claimed in this video:
https://www.youtube.com/watch?v=KXuZi9aeGTw
However the person who presented that apparently hasn't shared the source code for that, so I've been trying to implement it myself. I am a noob though so I am not quite sure how to get this thing right. So far I have this basic working code, but it is nowhere near as efficient as it could be. It takes roughly 10us per switch according to a microbenchmark I did, although that is on multicore. Disabling all cores but #0 speeds it up to to 1.6us per switch. As I understand, the optimization that is possible to make is that you can simply replace current the node in the runqueue instead of removing the current and inserting the next. I don't know what that would look like and what you'd have to take into consideration to implement it.
So, how to improve this?
Code:
SYSCALL_DEFINE2(ums, int, mode, pid_t, pid)
{
switch (mode) {
case 0: /* suspend current */
set_current_state(TASK_INTERRUPTIBLE);
schedule();
return 0;
case 1: /* suspend current, run @pid */
next = find_task_by_vpid(pid);
if (!next)
return -EINVAL;
set_current_state(TASK_INTERRUPTIBLE);
while (!wake_up_process(next));
schedule();
return 0;
default:
return -EINVAL;
}
}