Recap:
There wasn't much discussion due to time constraints. Some of the discussion
was concerning the frequency of interrups and the fact that we cannot monitor
only some parts of the system in order to reproduce a valid history of an operating
system's execution. Also, there was some discussion on how to implement time
traveling virtual machines.
|
Is it likely (or possible) that OS-on-OS could obscure some bugs?
Particularly in those areas related to making the guest OS run in user mode?
Is it possible to implement (some) of this logging without resorting to virtual machines / OS-on-OS? |
|
Are you convinced that this particular implementation can reproduce the
playback reliable enough for debugging purposes? (No, I've seen bugs that
only appear outside a debugger, but ok when run in a debugger. Nasty
timing related bugs.)
Wouldn't this be a nifty way of backing up machine state? (No, because a snapshot would be sufficient and you don't need the debugging capability.) Is it possible to somehow extend this into a self-repairing OS of some kind? (That would be so cool.) |
|
I think the practical use of this paper is limited by lack of
multiprocessor support. It would be better for authors to indicate how
multiprocessor could be supported? Could you add some of you personal
opinions with respect to this aspect?
Could you give more explanation of what ska does? I would like to know how things might work on other architectures besides x86? Authors claimed that 98% of code is debuggable in the guest OS. I found it hard to confirm this quantitatively. What are the other undebuggable 2% derived? |
|
What are the limitations of the time traveling virtual machine (in
the sense of ~Susing a para-virtualized VMM such as UML raises the
issue of fidelity~T)?.Is UML~Rs VMM suitable for debugging any kind
of operating systems?.
Is the performance, along with sample debugging results of the four kernel bugs, convincing enough to conclude that we should use TTVM? How do you measure the applicability of the method? |
| When I was reading through this paper, I was baffled as to why they were emphasizing performance so much. After all, it's not like debugging is a time-sensitive process, really. However, when I saw that the reverse step operation took *12 seconds* to run on average, I understood. I've used debuggers that allow you to walk backwards -- specifically, the ocaml debugger -- and my usage pattern with the reverse step really wasn't any different than my usage pattern with the forward step -- that is, reverse step was a frequently used operation. At 12 seconds a pop, that would be fairly agonizing. How does this affect the real utility of the rewinding-debugger paradigm? Is it really the case that most users will only step backwards a few times per session, and that I am somewhat of an exception? |
|
The granularity of reverse time-travelling in TTVM seems to be from
checkpoint to checkpoint. Per instruction reverse stepping is
archieved by re-running from the closest checkpoint, which could be
time comsuming if checkpoint is not done often enough (to reduce total
checkpoints size). Is there better way to implement reverse
time-travelling (eg. by having coarse-grain logging mechanism)?
Would the guest kernel needs to be modified if TTVM is implemented on hardward-based virtualization technology (eg. AMD-V, Intel-VT)? |
|
Doesn't running an OS on a virtual machine with the overheads (though
shrinking) involved just to debug it seem an overkill? Maybe debugging a
development kernel is what the authors intended?
Isn't it possible to create a time traveling debugger for an application without using a virtual machine implementation? Can't something be done at the kernel level to keep trap of asynchronous events (interrupts, exceptions) that interrupt the application and thus replay its execution? |
|
A common problem with traditional debuggers is that using the debugger
changes the timing of events in the application. For example, threads
reach critical sections in a different order, and I/O completes at a
different point in the program's execution. This can make
timing-dependent bugs impossible to reproduce when running with
the debugger, even if they happen consistently without it. Does TTVM
solve this problem, or does it actually make it worse?
This approach would be useful not only for operating systems but also for any multi-threaded application, but the performance impact of the VM limits its applicability to real-time applications (e.g. systems for robot control). Is there a way around that? |
|
Some bugs are extremely to the timing of executions. The introduction of
logging overhead (even if it's only once every 20 seconds) might
hide/introduce some bugs that are not apparent without the overhead. Are
there any ways to prevent this from occuring?
On a related note to the first question, can we always debug low level code like the kernel in a virtual machine? |
|
I found another paper that has implemented the reverse execution at
Instruction level. They use dynamic translation of the debugee program
to generate a code that performs compensating actions. They actually
generate a "reverse program" and use it during the time travel. Do you
think this approach is applicable to OS debugging too? I think the
major problem is again those undeterminstic events that happen in an
OS environment. What do you think?
It seems that this debugging environment is designed to catch the bugs after the development phase of an OS. But in this case the overheads are large if you want to run the OS in this environment for a large amount of time in order too wait for a random bug to appear. Why didn't they use a virtual machine like Xen in order to improve the performance? What are the problems of implementing this method in a complete virtual machine compared to a para-virtualized machine? The link to that paper I mentioned: http://www.eecs.harvard.edu/cs261/final-project/extended-abstracts/prashanth_brett.pdf |
|
It looks great. Is there any weakness of the proposed idea? For
example, what happens if the host operating system has bugs? What happens if
hardware evolves? Is there any better way to implement this mechanism
without VMM? How much does the mechanism cover the portion of errors of
operating systems?
If the mechanism is implemented in Xen without ReVirt (and GBD), how can the architecture be depicted? |
|
As with ReVirt, which is heavily relied upon for this paper, doesn't things
like VM rootkits that could corrupt the actual log underneath your guest OS
makes things not as secure as the authors would like to believe?
Is it just me, or does the entire section 6 seem kinda...hazy? The authors' use of first the 'traditional' debugging methods have already allowed them to suss out enough information about the nature of the bug that to then apply their 'reverse debugging' and claim how much easier it was to use reeks of pollution of the results(in this case, that their reverse gdbing was better) due to learning gained in the previous trial. |
|
USB and sound are relatively low-speed devices.
Would it need to log a copy of all network data sent/received?
Would it be feasible to debug a race condition that occurs only on a
saturated 10/1 Gbps router?
Which TTVM will help debug the TTVM? =) |
|
This is a very cool idea but what do we do in cases where static information
is altered by the OS? For instance, if a file is removed, do we have to
backup that information so that the replay can be complete?
We're talking about a system to help debug our operating systems right? They're not really expecting the everyday user to use these this tool right? Why then is the performance such a huge consideration? It's nice to see a good tool to help build better systems. |
|
What did they choose (25 sec) intervals between checkpoints? Why is 4%
acceptable (not more or less)?
"Time-travel operations change virtual-machine state but should preserve debugging state." Can you explain this statement further? |
|
Here are my questions:
Do you think that this technique can be adapted to be used in debugging of parallel applications which generally require high replay cost and complexity? (multiprocessors) Do you yhink that this idea can be easily applicable to non x86 architectures as both Revirt and the idea of reverse debugging relies on specific counters and registers of x86? |
|
I just have limited kernel debugging experiences, some on Linux intrusion
detection through VMWare and some on Solaris coredump by use of kdb. I did
not use some tools like TTVM. So I could not give enough evaluation of this
scheme. However, my first feeling after I read this paper is that this
scheme could not ensure to build the environment same with the original OS.
What is your idea ?
For some bugs which will occur after running for long time, how will TTVM handle this kind of bugs ? Do you think the way provided in this paper is enough to handle this bug ? As I know, some bugs even you record all states and replay it still may not occur again. How to handle this kind of bugs ? I am very interested in this issue. |
|
In this table, why didnt the authors explore non-traditional debugging
operations that are enabled by time travel and deterministic replay?
ReVirt also uses virtual machine. Comparing with the paper, what different way does ReVirt apply on debugging? |
|
How can TTVM be extended to identify intrusions [as in paper #20]
that tend to affect/hamper the functioning of the Operating System?
I your view, how can TTVM be enhanced to identify OS bugs that it has not yet encountered or might not encounter in the near future? How can the entire range of OS bugs be identified? In general, which OS processes experience most number of bugs and most significant bugs? Would it be sufficient to monitor those sections of the OS alone, with TTVM? |
|
During check pointing how would you capture network state and replay it
later ?
Why wasn't the approach implemented on a more generic VMM ? (Ie some virtual machine software(say Qemu) which can run the guest OS un-modified, since the goal of this work is to debug the OS code, wouldn't it have a wider applicability if it could work with any guest OS ?) |
|
The reverse execution technique described can be seen as loading a
previous checkpoint in time and forwarding to the point of interest
(replay). Except for restoring a previous state, the execution model
is always going forward. Could it be possible to define an undo action
for each possible operation and use those "inverse" operations to
truly execute a program backwards ?
Are there programming language interpreters that support such time travel (e.g. is there a time-traveling Java VM) ? Should we expect the space overhead for logging in web servers to be at least half of the data that is being transferred to and from the http clients ? |