Paper#9

 

"Capriccio: scalable threads for internet services", by R. Behren, J. Condit, F. Zhou, G. Necula, E. Brewer, Proceedings of the nineteenth ACM symposium on Operating systems principles, 2003, 268--281.[Full Paper]

 

 

Questions

 

  1. True concurrency will break cooperative threading, which relies on thread atomicity to avoid locking resource access. Would it be possible to extend call graph analysis to include data analysis and schedule non-competing threads across multiple CPUs?

 

  1. Capriccio is obviously aimed at high performance, highly concurrent systems. It also seems to require significant program analysis both of the specific application and any additional libraries (in order to manage the linked stack). With the amount of intervention required, can the performance gains be accomplished in another way that is on the same order of complexity?

 

  1. How much of an overhead does the checkpoint code add?

 

  1. How does one decide on the amount of stack chunk to allocate when calling external functions?

 

  1. How does the resource aware scheduling define a generic notion for application productivity?

 

  1. On the second page, the authors state that they believe that user-level threading will be possible on multi-cpu system without cooperative scheduling. Have they done any work since to substantiate these claims?

 

  1. Does the check pointing scheme not increase the cost of recursion relative to a sequential algorithm? Does this cost disparity exist at a similar magnitude in existing implementations because of paging, or is there a greater performance separation in Capriccio?

 

  1. Would a mechanism of directly supplying resource allocation advice to the runtime system really be such a good idea? This seems like it would be a hotbed for weird bugs and hard-to-diagnose performance problems.

 

 

  1. In terms of user-level threads performance, the paper says "Do not require kernel crossing for uncontended locks." Could you explain more about this? Why do contended locks require kernel crossings?

 

  1. In terms of user-level threads disadvantages, why non-blocking I/O requires an additional system call?

 

  1. In terms of Weighted Call Graph, each node is weighed by the maximum stack space consumed by a function, but why is it maximum size, not the exact size?

 

  1. For linked stack management, under some conditions, the checkpoints may not be able to provide correct information, because you may not be able to get what you want during compilation stage. If such things occur, what will happen? Also, I know the paper has provided two ways to solve it. I want to know, if you design the stack, what modification you want to do for it ?

 

  1. The scheme provided by this paper need a lot of help of compiler. Some of those need static and dynamic information. I think the process will cost some time. I am not sure whether that time is less than what the scheme has saved. What is your opinion, without thinking about the benchmark provided by this paper?

 

 

  1. "Capriccio's main scheduling loop looks very much like an event-driven application..."  Is it just me or are we blurring the lines between event-driven models and thread-driven models?  Can't we just conclude that events are better for some things (I/O) and threads are better for others (concurrency)?  Why are we making an effort to apply on technique to an area where it may not be suited?  All we're really doing is providing a layer of abstraction on top of what it seems we don't want to admit is a better way of doing things.

 

  1. I like the idea of automatic resource management tools but I'm not sure why these techniques can't be applied to event-driven code.  It seems that a compiler could generate a call graph for events in the system if it could be made aware of what events it will have to handle.  Why can't we apply this to event-drive models?  What am I missing here?

 

  1. It seems that Capriccio doesn't have a good performance in presence of contended Mutex locks due to numerous kernel crossings. What is the average ratio of contended and un-contended Mutex locking in Internet service applications? What are the major factors that this ratio depends on?

 

  1. One of the key assumptions that they have made in designing the resource-aware scheduler  is that is that resource usage is similar at a blocking point. and it seems to be OK with Apache. Do you know any cases in which this assumption doesn't hold?

 

  1. Could we regard Apache with Capriccio as a better web server than Knot? In Figure 9, while Apache with Capriccio maintains its bandwidth constantly over 10,000 clients, Knot decreases its bandwidth over 10,000 clients. Is the trend due to the resource-aware scheduler of Capriccio or are there other reasons?

 

  1. How can Capriccio support the convenient programming environment, except less modification of existing threaded applications? The authors insist that their thread package requires little modification of existing threaded applications. However, the descriptions of how to modify applications are scattered over the paper, so it is difficult to capture overall modifications. Specially, the authors mention, “asynchronous I/O significantly increases programming complexity and compromises portability,” in Section 2.6, while Capriccio uses asynchronous I/O primitives.

 

  1. The authors say that Capriccio employs spin locks or optimistic concurrency control primitives, depending on which mechanism best fits the situation. What do you think they would measure in order to decide which one of the two to use?

 

  1. I am unsure of what a bottom of a graph is, and what is the top of a graph when speaking of the checkpoints placement in Section 3.2. Since I don't understand it then I can't see why in their discussion on finding the length of the longest path, when adding a checkpoint to the edge between n and s they reduce the longest path of s to zero. Could you, please, explain this?

 

  1. This question pertains to Section 3.3. They say that they categorize function pointers by number and type of arguments. Well, if one of the arguments is a dynamically allocated structure (like an array), then this complicates the dynamic stack allocation. How do you think they could solve this special case?

 

  1. The authors don't specify the data structure for the graphs. Which data structure would you say they use? Considering that authors go for scalability, the number of call graphs could get pretty large.

 

  1. I guess this is more a question for the entire class: Is it true that all of the threads in cooperative scheduling are mutually trusting since they are a part of the same application? The authors state this in the first paragraph of Section 4.3

 

  1. Does the design decision in Capriccio suggest some degree of endorsement for event-driven programming?  (E.g. cooperative scheduling, "scheduling" loop)

 

  1. Are the compiler modifications suggested by the authors reasonable?  Is it safe enough for other architectures or a production compiler?

 

  1. How would one complete a stack unwind with these non-contiguous stacks?

 

  1. It seems that the "tracing" feature has quite a heavy impact on thread context switch latency (more than double with tracing turned on).  How does the resource-aware scheduling with blocking graph be beneficial with so much overhead?

 

 

  1. Capriccio attempted to build a threading package that is transparent to the programmers.  Is this the best way to achieve high performance?  With some adjustment of existing source code, application programmers can provide the threading package information that is expensive to collect at run time, thus achieving the performance without the overhead penalty.

 

  1. I'm not sure if it was mentioned in the paper, but does the resource-aware scheduling still allow for priorities to be set to specific threads?

 

  1. The paper's reliance on the use of compilers that are engineered for their specific needs in order to achieve performance gain seems somewhat...odd.  While it is of course a viable option (as proven by the fact that they have an implementation), doesn't it overextend the traditional concept of building applications to serve a need, instead they're basically molding their surroundings to get the application to work?

 

  1. Each node in the blocking graph is supposed to be the call chain by which a blocking point was reached, but is this chain implicit in the graph (that is, is the chain also a path through the graph) or is it stored explicitly at a node? In other words, is it a chain of blocking system calls (nodes) or of nested function calls? Edges represent consecutive blocking points, but are they consecutive from the perspective of the thread that executes them or are they consecutive in time?

 

  1. How well would resource-aware scheduling handle threads whose resource usage can change dramatically? For example, a thread might first do a large amount of I/O, and then might switch to a more compute intensive mode where it also builds large data structures in memory, and might also intermittently write partial results to disk. Can resource-aware scheduling adapt to this sort of changing resource usage?

 

 

  1. Are there options to add some kind of directive to the code so that it can ensure that checkpoints are not inserted into certain functions? (It would be disastrous if a checkpoint was inserted into a frequently called function.)

 

  1. The creators of Capriccio seems to distrust the programmer's ability to tune the system, is this justified? Wouldn't human judgement and wisdom be better when it comes to seeing thrashing in the system, and adjusting parameters to compensate?

 

  1. How does the info gathering by Capriccio differ from traditional profiling tools and techniques?

 

  1. The NPTL project for Linux has made great strides toward improving the efficiency of Linux kernel threads. These advances include a number of kernel-level improvements such as better data structures, lower memory overhead, and the use of O(1) thread management operations.  What kinds of data structures are used?

 

  1. If extending Capriccio to work with multi-CPU machines, what challenge parts should be taken?

 

  1. (Section 3) Capriccio adopts a dynamic stack allocation policy wherein stack space is allocated to threads on demand in relatively small increments and is deallocated when the thread requires less stack space. The authors mention in comparison that LinuxThreads allocates 2MB of stack space per-thread by default. Do you think they purposely omitted to mention that the stack's size and position could be changed with calls such as setstackattr, or pthread_attr_setstack ?

http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html (question D 10)

http://www.die.net/doc/linux/man/man3/pthread_attr_getstackaddr.3.html

 

  1. The proposed linked stack management idea relies heavily on program analysis both to determine the amount of stack space that must be allocated and the position of the checkpoints in the program, tasks that could possibly be done by a compiler. Could we really hope to see a linked-stack-management compiler emerge that would place the checkpoints in our programs for us automatically ?

 

  1. Concerning the block graph: When calculating the statistics on edges dynamically, how fair do you think including the start-up block times? In other words shouldn’t the system neglect the blocking times that would mostly occur due to non-existent pages in the warming-up period?

 

  1. Again on the blocking graph: In a system that’s relatively dynamic (e.g. web servers) how fair do you think to have each run on the graph having equal vote on the present weights? Would you agree on a proposal where the statistics belonging to past should have less effect on the present than some statistic from a relatively close run? (Say introduce a diminishing factor 1/(t) on the weights when calculating the statistics?)

 

 

Discussion Recap

 

The discussion on this paper started with the question on how to identify which thread is blocking at which blocking point. It was made clear that this is done by studying the stack trace of the particular thread. A question was on what data structures are actually used to store the call/blocking graphs. The paper doesn’t specifically state what Capriccio’s implementation for these data structures is, but it was assumed that whatever implementation that is fast in updating the graphs would probably be implemented. Then the discussion moved on to the case of generating a call graph for external libraries and this was explained as being done by profiling the external library or maybe pre-allocating a fixed conservative stack size for external function calls. Another query raised was whether practically it was possible to have compiler support for inserting check points and this issue is still an open one. There were also concerns whether non-contiguous linked stacks could actually be a security problem (stack overflow exploits), but it was concluded that the design probably does not increase or decrease this security risk. There was also a concern that the system is pretty much automated (the check points are inserted automatically by the compiler and not hand coded) and this does not give the user enough flexibility and so the user might not trust the application doing what it was supposed to do. This was accepted to be a concern but the point was also made that Capriccio does give the user two tuning parameters (MaxPath and MinChunk) to give some option over where the checkpoints are placed. There was also discussion about the New Posix Threads for Linux package and how it compares with LinuxThread package.

 

Slides