Paper#9
"Capriccio: scalable threads
for internet services", by R. Behren, J. Condit, F. Zhou, G.
Necula, E. Brewer, Proceedings of the nineteenth ACM symposium on Operating
systems principles, 2003, 268--281.[Full
Paper]
Questions
- True
concurrency will break cooperative threading, which relies on thread
atomicity to avoid locking resource access. Would it be possible to extend
call graph analysis to include data analysis and schedule non-competing
threads across multiple CPUs?
- Capriccio
is obviously aimed at high performance, highly concurrent systems. It also
seems to require significant program analysis both of the specific
application and any additional libraries (in order to manage the linked
stack). With the amount of intervention required, can the performance
gains be accomplished in another way that is on the same order of
complexity?
- How
much of an overhead does the checkpoint code add?
- How
does one decide on the amount of stack chunk to allocate when calling
external functions?
- How
does the resource aware scheduling define a generic notion for application
productivity?
- On the
second page, the authors state that they believe that user-level threading
will be possible on multi-cpu system without cooperative scheduling. Have
they done any work since to substantiate these claims?
- Does
the check pointing scheme not increase the cost of recursion relative to a
sequential algorithm? Does this cost disparity exist at a similar
magnitude in existing implementations because of paging, or is there a
greater performance separation in Capriccio?
- Would
a mechanism of directly supplying resource allocation advice to the runtime
system really be such a good idea? This seems like it would be a hotbed
for weird bugs and hard-to-diagnose performance problems.
- In
terms of user-level threads performance, the paper says "Do not
require kernel crossing for uncontended locks." Could you explain
more about this? Why do contended locks require kernel crossings?
- In
terms of user-level threads disadvantages, why non-blocking I/O requires
an additional system call?
- In
terms of Weighted Call Graph, each node is weighed by the maximum stack
space consumed by a function, but why is it maximum size, not the exact
size?
- For
linked stack management, under some conditions, the checkpoints may not be
able to provide correct information, because you may not be able to get what
you want during compilation stage. If such things occur, what will happen?
Also, I know the paper has provided two ways to solve it. I want to know,
if you design the stack, what modification you want to do for it ?
- The
scheme provided by this paper need a lot of help of compiler. Some of
those need static and dynamic information. I think the process will cost
some time. I am not sure whether that time is less than what the scheme
has saved. What is your opinion, without thinking about the benchmark provided
by this paper?
- "Capriccio's
main scheduling loop looks very much like an event-driven
application..." Is it just me
or are we blurring the lines between event-driven models and thread-driven
models? Can't we just conclude that
events are better for some things (I/O) and threads are better for others
(concurrency)? Why are we making an
effort to apply on technique to an area where it may not be suited? All we're really doing is providing a
layer of abstraction on top of what it seems we don't want to admit is a
better way of doing things.
- I like
the idea of automatic resource management tools but I'm not sure why these
techniques can't be applied to event-driven code. It seems that a compiler could generate
a call graph for events in the system if it could be made aware of what
events it will have to handle. Why
can't we apply this to event-drive models?
What am I missing here?
- It
seems that Capriccio doesn't have a good performance in presence of
contended Mutex locks due to numerous kernel crossings. What is the
average ratio of contended and un-contended Mutex locking in Internet
service applications? What are the major factors that this ratio depends
on?
- One of
the key assumptions that they have made in designing the resource-aware
scheduler is that is that resource
usage is similar at a blocking point. and it seems to be OK with Apache.
Do you know any cases in which this assumption doesn't hold?
- Could
we regard Apache with Capriccio as a better web server than Knot? In
Figure 9, while Apache with Capriccio maintains its bandwidth constantly
over 10,000 clients, Knot decreases its bandwidth over 10,000 clients. Is
the trend due to the resource-aware scheduler of Capriccio or are there
other reasons?
- How
can Capriccio support the convenient programming environment, except less
modification of existing threaded applications? The authors insist that
their thread package requires little modification of existing threaded
applications. However, the descriptions of how to modify applications are
scattered over the paper, so it is difficult to capture overall
modifications. Specially, the authors mention, “asynchronous I/O
significantly increases programming complexity and compromises
portability,” in Section 2.6, while Capriccio uses asynchronous I/O
primitives.
- The
authors say that Capriccio employs spin locks or optimistic concurrency
control primitives, depending on which mechanism best fits the situation.
What do you think they would measure in order to decide which one of the
two to use?
- I am
unsure of what a bottom of a graph is, and what is the top of a graph when
speaking of the checkpoints placement in Section 3.2. Since I don't
understand it then I can't see why in their discussion on finding the
length of the longest path, when adding a checkpoint to the edge between n
and s they reduce the longest path of s to zero. Could you, please,
explain this?
- This
question pertains to Section 3.3. They say that they categorize function
pointers by number and type of arguments. Well, if one of the arguments is
a dynamically allocated structure (like an array), then this complicates
the dynamic stack allocation. How do you think they could solve this
special case?
- The
authors don't specify the data structure for the graphs. Which data
structure would you say they use? Considering that authors go for
scalability, the number of call graphs could get pretty large.
- I
guess this is more a question for the entire class: Is it true that all of
the threads in cooperative scheduling are mutually trusting since they are
a part of the same application? The authors state this in the first
paragraph of Section 4.3
- Does
the design decision in Capriccio suggest some degree of endorsement for
event-driven programming? (E.g.
cooperative scheduling, "scheduling" loop)
- Are
the compiler modifications suggested by the authors reasonable? Is it safe enough for other
architectures or a production compiler?
- How
would one complete a stack unwind with these non-contiguous stacks?
- It
seems that the "tracing" feature has quite a heavy impact on
thread context switch latency (more than double with tracing turned
on). How does the resource-aware
scheduling with blocking graph be beneficial with so much overhead?
- Capriccio
attempted to build a threading package that is transparent to the
programmers. Is this the best way
to achieve high performance? With
some adjustment of existing source code, application programmers can
provide the threading package information that is expensive to collect at
run time, thus achieving the performance without the overhead penalty.
- I'm
not sure if it was mentioned in the paper, but does the resource-aware
scheduling still allow for priorities to be set to specific threads?
- The
paper's reliance on the use of compilers that are engineered for their
specific needs in order to achieve performance gain seems
somewhat...odd. While it is of
course a viable option (as proven by the fact that they have an
implementation), doesn't it overextend the traditional concept of building
applications to serve a need, instead they're basically molding their
surroundings to get the application to work?
- Each
node in the blocking graph is supposed to be the call chain by which a
blocking point was reached, but is this chain implicit in the graph (that
is, is the chain also a path through the graph) or is it stored explicitly
at a node? In other words, is it a chain of blocking system calls (nodes)
or of nested function calls? Edges represent consecutive blocking points,
but are they consecutive from the perspective of the thread that executes
them or are they consecutive in time?
- How
well would resource-aware scheduling handle threads whose resource usage
can change dramatically? For example, a thread might first do a large
amount of I/O, and then might switch to a more compute intensive mode
where it also builds large data structures in memory, and might also
intermittently write partial results to disk. Can resource-aware
scheduling adapt to this sort of changing resource usage?
- Are
there options to add some kind of directive to the code so that it can
ensure that checkpoints are not inserted into certain functions? (It would
be disastrous if a checkpoint was inserted into a frequently called
function.)
- The
creators of Capriccio seems to distrust the programmer's ability to tune
the system, is this justified? Wouldn't human judgement and wisdom be
better when it comes to seeing thrashing in the system, and adjusting
parameters to compensate?
- How
does the info gathering by Capriccio differ from traditional profiling
tools and techniques?
- The
NPTL project for Linux has made great strides toward improving the
efficiency of Linux kernel threads. These advances include a number of
kernel-level improvements such as better data structures, lower memory
overhead, and the use of O(1) thread management operations. What kinds of data structures are used?
- If
extending Capriccio to work with multi-CPU machines, what challenge parts
should be taken?
- (Section
3) Capriccio adopts a dynamic stack allocation policy wherein stack space
is allocated to threads on demand in relatively small increments and is
deallocated when the thread requires less stack space. The authors mention
in comparison that LinuxThreads allocates 2MB of stack space per-thread by
default. Do you think they purposely omitted to mention that the stack's
size and position could be changed with calls such as setstackattr, or
pthread_attr_setstack ?
http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html
(question D 10)
http://www.die.net/doc/linux/man/man3/pthread_attr_getstackaddr.3.html
- The
proposed linked stack management idea relies heavily on program analysis
both to determine the amount of stack space that must be allocated and the
position of the checkpoints in the program, tasks that could possibly be
done by a compiler. Could we really hope to see a linked-stack-management
compiler emerge that would place the checkpoints in our programs for us
automatically ?
- Concerning
the block graph: When calculating the statistics on edges dynamically, how
fair do you think including the start-up block times? In other words
shouldn’t the system neglect the blocking times that would mostly occur
due to non-existent pages in the warming-up period?
- Again
on the blocking graph: In a system that’s relatively dynamic (e.g. web
servers) how fair do you think to have each run on the graph having equal
vote on the present weights? Would you agree on a proposal where the
statistics belonging to past should have less effect on the present than
some statistic from a relatively close run? (Say introduce a diminishing
factor 1/(t) on the weights when calculating the statistics?)
Discussion Recap
The discussion on this paper started with the question on
how to identify which thread is blocking at which blocking point. It was made
clear that this is done by studying the stack trace of the particular thread. A
question was on what data structures are actually used to store the
call/blocking graphs. The paper doesn’t specifically state what Capriccio’s
implementation for these data structures is, but it was assumed that whatever
implementation that is fast in updating the graphs would probably be
implemented. Then the discussion moved on to the case of generating a call
graph for external libraries and this was explained as being done by profiling
the external library or maybe pre-allocating a fixed conservative stack size
for external function calls. Another query raised was whether practically it
was possible to have compiler support for inserting check points and this issue
is still an open one. There were also concerns whether non-contiguous linked
stacks could actually be a security problem (stack overflow exploits), but it
was concluded that the design probably does not increase or decrease this
security risk. There was also a concern that the system is pretty much
automated (the check points are inserted automatically by the compiler and not
hand coded) and this does not give the user enough flexibility and so the user
might not trust the application doing what it was supposed to do. This was
accepted to be a concern but the point was also made that Capriccio does give
the user two tuning parameters (MaxPath and MinChunk) to give some option over
where the checkpoints are placed. There was also discussion about the New Posix
Threads for Linux package and how it compares with LinuxThread package.
Slides