#14 - "Improving IPC by Kernel Design", by J. Liedtke, Proceedings of the fourteenth ACM symposium on Operating systems principles, 1993.

 

Questions Posted

 

1. Would the modularity promised by IPC in a microkernel actually be used, or would programmers be tempted to unify components for the sake of performance or a more generalized interface?


2. By sharing time slices and direct process switching, are processes whichuse IPC given advantage over more isolated tasks by this scheduler?

 

3. Are the techniques used to improve the speed of IPC minor tweaks or significantly novel ideas? Or the method behind applying the detective work that is novel? Clearly this is not a trivial task to evaluate the changes and possible interactions. Most of the techniques required going down to assembly (i.e. processor dependant)?


4. http://www.usenix.org/events/hotos05/final_papers/full_papers/hand/hand_html/

- Is this the new arena for future flame wars, concerning micro kernels?

 

5. What is the main contribution of this article? When I read I felt that it introduces a series of optimizations (hacks) to improve one aspect of the system? (I am not if these optimization cause the degradation of other components)


6. What is lazy scheduling? How does it optimize IPC in uni and multi-processor environment?

 

7. Just to make things clear - in direct transfer by temporary mapping, the kernel actually maps a part of its address space (communication window) onto the receiver's address space. The mapping is kind of like having the kernel address space and the receiver's address space (where the message data needs to go) to the same physical memory location? And hence the need to flush the TLB?


8. In lazy scheduling, the queue updation is done lazily to save queue manipulation instructions and the scheduler checks each TCB state to figure out if it belongs to a particular queue. But won't the scheduler have to check each TCB every time it parses the queue? Doesn't that kind of defeat the purpose of having queues for quicker access?

 

9. I am somewhat concerned at the paper's use of  the 'Direct Transfer by Temporary Mapping' technique.  It seems to me that it goes through the following 'steps':

 

1) Measure size of message in A

2) Locate/allocate memory of that size

3) Transfer message to that location.


But what happens if the size of the message is changed, either unintentionally or maliciously, between 1 and 3?  Would an overflow occur?

 

10. Exactly what unit of measurement corresponds to "not seriously impacted"? (This is referring to the 'IPC is Master' section)

 

11. What is a µ-kernel? Is it just another name of microkernel? In performance calculations, why this paper use both microsecond and cycles as metric? Which is one is more appropriate?


12.The minimal cost for transferring control is 127 cycles. Do modern processors perform these operations more efficient?


13. is there any kernel implementation using any idea of this paper to improve the performance? If so, what does their performance look like compared to L3?

 

14. What was the approach followed by the IPC implementation to achieve the higher performance than previous implementations?

 

15. It seems like many techniques need to be applied on all levels from architecture to coding. Is it easy to apply all of these in other micro kernels and different hardware?

 

16. Regarding temporary mapping, how does the kernel know when to un-map the sender's communication window from the receiver's address space?


17. Also, what is the advantage of a per-thread communication window over a single kernel-managed ipc memory region? Wouldn't the latter make more efficient use of memory, in particular when the amount of data transferred varies widely from one thread to another?

 

18. Do you think the similar performance gains will be achieved due to mapping of communication window onto receiver’s memory space via page directory table when the system is used for multiprocessors/distributed systems?


19. The L3 system proposed here does not talk about portability and ability to communicating different types of servers in a system as opposed to Mach system that they are comparing the performance to. Do you think these issues are less important compared to the performance gain the system achieves?

 

20. The author focused his discussion on micro kernels.  What are the difference between a microkernel and a regular kernel?  Why can't we just use a regular kernel?


21. What is involved in entering/existing OS kernels?  Why is it so expensive?

 

22. How can we pass short messages via registers between different kernels?  Don't registers get flushed when during context switch?

 

23. One of the strengths of the design of the Mach kernel is that it only handles communication while all other features are implemented at user level, and the kernel is optimized so that it is fast in managing ipc. Would you say that extensibility of the kernel, is needed, given that it is optimized to work solely with ipc communication?


24. It seems to me that all objects have a port associated with them. Is this correct?


25. What do you think of the idea of having both global and local run queues with respect to fairness for tasks that are I/O-bound and do not perform some chore for the kernel?


26. How would you comment on the fixed quantum size for both I/O-bound and processor bound tasks?

 

27. When do you unmap the temporary 'communication window' which got mapped during a message transfer ?


28. How do you avoid a page fault while parsing through the virtual queues ?


29. In the clans and chief model, do the chief have unique uids through out the global space ?

 

30. What are the excellent design decisions and what are not from 5.2.1 to 5.5.6?

 

31. What is the improvement point of L3 in the performance aspect?

 

32. In Direct Message Copy from a message sender A to a message Receiver B, the window that is created can only be accessed by both the kernel and B, for the duration of the window. This access restriction is probably there to prevent A from tampering the message while B is reading/checking it. What happens if A has some vital information that needs to be accessed on the same page as the message that was sent to B?


33. If it is more efficient to use only a flat segment covering the entire address space, why were segment registers introduced in the hardware in the first place ?

 

34. How much faster (if at all) is IPC on an OS that doesn't use message passing?


35. Would system calls rather than messages for system services halve the number of boundary crossings? (2 to 1) (i.e. operating systems which don't offload parts of itself into a user-mode "server")


36. For those who have way too much time, here's some numbers I measured on Windows...:


CPU clocks for a PostMessage call (Win32 send asynchronous message)

 

Pentium3, Windows 2000 UP: 673 clocks

Athlon, Windows 2000 SMP: 760 clocks

Athlon, Windows 2000 UP: 512 clocks

Athlon64, Windows XP UP: 485 clocks

Pentium M, Windows XP UP: 612 clocks

Core 2 Duo, Windows XP SMP: 1060 clocks

Pentium 4, Windows XP UP: 1055 clocks

 

CPU clocks for a Windows system call:


Pentium 3, Windows 2000 UP: 380 clocks

Athlon, Windows 2000 SMP: 330 clocks

Athlon, Windows 2000 UP: 290 clocks

Athlon64, Windows XP UP: 215 clocks

Pentium M, Windows XP UP: 305 clocks

Core 2 Duo, Windows XP SMP: 400 clocks

Pentium 4, Windows XP UP: 650 clocks


- UP = Uniprocessor, SMP = Multiprocessor

- Windows XP supports SYSENTER/SYSEXIT SYSCALL/SYSRET instructions, which AMD claims is ~4x faster than a software interrupt. Windows 2000 does not. (The above data doesn't show this effect though, since nobody other than me seems to run Windows 2000...)


- There seems to be factor of at least 2x between different implementations of the ~same ISA. I would be cautious about the reliability of the numbers found in Table 2 Null-RPC performance, since they span different architectures and implementations.

 

37. What are the problems with LRPC and RPC sharing user level memory of client and server to transfer messages?


38. Does the L3 optimizations address all overheads of micro-kernel OSes (compared  to monolithic)? If no which ones are not ?

 

39. How well do the optimization techniques discussed in this paper apply to modern processors, which are now having long pipeline, larger cache, but also a much higher memory latency?

40. How well does L3 IPC scale with the number of processes in the system?

 

41. I find it interesting that for large messages, the performance increase isn't that great (table 1).  Any reason why this is?


42. A lot of the performance gain is due to the decrease of security in the system.  Do you think that's a good trade-off?  Are we favoring performance a little too much?

43. It's nice to have efficient implementations but if you're tuning your software to certain platforms you should be careful!  Although the author says most ideas are portable, do you believe that and would anyone really want to work at porting them to each new hardware interface that becomes popular?

 

44. Is the speed-up of L3 vs. Mach relying on the authors’ assertion (page 4) that most ipc/rpc is synchronous ... and therefore most of the copying between address spaces can be avoided?


45. Is the comparison of L3 vs. Mach on a single machine really valid? Mach supports diverse heterogeneous environments, yet there don't seem to be tests for this with L3. Would the greater speeds for L3 on a single machine be lost when transferring through networks where copying actually has to occur?

 

46. Could you provide some information on which OSes are using this technique?

 

47. This technique depends on processors. From the paper, I know it does not work well on 486 processor. I want to know if all x86 have such problems.

 

48. The paper says ipc performance is vital for modern operating systems, especially u-kernel based ones.  What is the u-kernel? And what is difference from normal kernel system?

49. From the experiment shows, Mach has greater performance than L3 on ipc Times.  From the opposite way, are there anything of L3 better than Machs?

 

Post - Presentation Discussion

 

While discussing the paper, two observations were made:

 

1) There is no single trick to obtaining high IPC performance in Microkernels; rather, a synergetic approach in design and implementation on all levels starting from the architectural design and going down to the coding level is needed. The paper discusses the implementation aspects leading to better IPC performance.

 

2) Are Virtual Machine Monitors Microkernels done right? Reading references:

 

(a)Steven Hand, Andrew Warfield, Keir Fraser, Evangelos Kottsovinos, and Dan Magenheimer. "Are virtual machine monitors microkernels done right?" In Proceedings of the 10th Workshop on Hot Topics in Operating Systems, Sante Fe, USA, June 2005.

http://berkeley.intel-research.net/troscoe/cs262a/vmm-ukernel-hotos-final.pdf

 

(b) Heiser, G., Uhlig, V., and LeVasseur, J. "Are virtual-machine monitors microkernels done right?" SIGOPS Operating Systems Review 40, 1 (Jan. 2006)

http://www.ertos.nicta.com.au/publications/papers/Heiser_UL_06.pdf

 

The Presentation Slides can be viewed here