ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay
G. Dunlap, S. King, S. Cinar, M. Basrai, P. Chen
Department of Electrical Engineering and Computer Science
University of Michigan
Presented by Wilson Fung
Brief Summary for the Paper
This paper introduces a new technique to log a system via the use of Virtual Machine. Normally, logging is done via the system's kernel. This approach has two major problems. The integrity of the log is not guaranteed as attackers may forge logs on compromised system. The completeness of the log is also not ensured as the kernel cannot log its own non-deterministic events, such as hardware interrupts and external inputs.
The proposed technique, ReVirt, attempts to solve these problem by having a guest OS running on top of a Virtual Machine Monitor (VMM) kernel module. The VMM kernel module is in turn running on a host OS. By providing a narrow interface to the guest OS, the VMM module is able to provide a complete log of the guest OS, including the non-deterministic events that kernel log fails to log. The exact behaviour of a guest system over time maybe replayed just as video from CCTV maybe replayed. This allow analysis of a race condition bug or damages done by attackers to a system. Even if the guest OS is compromised, the log cannot be forged as it is done by the VMM module running outside the virtual machine.
With logging turned on , performance of CPU-bound applications running on the guest OS is comparable to that running on the host OS. For kernel-intensive workload, the execution time can be up to 70% slower on the guest OS. Despite of these performance hit, ReVirt is can still fail if the attacker managed to compromise the host OS, which the paper claims to be a much harder task. This issue, together with possible usages of ReVirt in debugging and hacking, are addressed in future papers.
Discussion Recap
The discussion starts off with Dr. Krasic sharing his personal experience of an intrusion into his previous work place resulting a month of delay in development, together with emotional distresses among the development team. This experience gives the incentive to use logging system such as ReVirt, albeit the performance penalty. The discussion then spins off into several topics.
Presentation Slides
Link for the PowerPoint file < ppt | pdf >
Submitted Questions
Is it possible to reduce the hosting kernel size to be limited to just those functions / functionality that are required by the guest OS (i.e. the 7 or so system calls + device drivers)?
In light of the recent micro-kernel papers (especially L4), would it be possible to optimize aspects of the hosting OS to reduce the 13-58% reduction in speed for kernel intensive applications. That is, is there a way to improve the trap / signal handling for the system calls to optimize the (few) calls made by UMLinux?
I think that an interesting application of the principles demonstrated in this paper would be in the *development* of operating systems or device drivers. Often this sort of development is difficult because mistakes in your code can cause serious damage to the rest of the system: At the very least, one spends a lot of time rebooting one's machine. The logging layer presented here would also provide a sort of 'implicit' debugger during development. This isn't so much a question as it is a point of discussion -- but I am curious if VMs are used in this way.
(Note: Crud, I just noticed the title of the next paper we're supposed to be reading: 'debugging OSs with time-traveling virtual machines'. Looks like that answers my question...)
What type of intrusion is ReVirt trying to detect? If your system is truly critical, shouldn't you actually go out and spend some good money on a separate system that performs the logging functionality? (Okay, ReVirt is trying to do the same thing on the cheap, but with some performance cost.)
Is it possible to "denial of service" the logging feature so that the tracks of the intrusion be better covered? (i.e. have a lot of non-related activity running at the same time)
Can this technique be used to help in detecting/testing for bugs in hardware?
ReVirt starts only from a disk checkpoint of a powered-off virtual machine. So it might take a long time (maybe several days) to reach the desired point of execution during a replay. Is it practical to use such a system? I think it might be reasonable to take periodic checkpoints of the system state to avoid this problem. Also this fact that you should power off your server to take checkpoints does not seem logical. So why didn't they implement periodic checkpointing?
They log the number of branches executed since last interrupt in addition to the program counter in order to identify the interrupted instruction. Why do they need the number of executed branches? Is this related to the loops?
Why are there two phases needed to play back asynchronous virtual interrupts? I don't understand what the first phase does.
Wouldn't it have been simpler (and safer) to run the X server inside the virtual machine and give it direct access to the video card? Would that open up other vulnerabilities?
The paper mentions the possibility of replay over a period of months, but is that really practical? When would someone do that?
After introducing virtual machine here, the performance should be worse. I do not know if the environment, especially the network environment could be taken as equivalent as that before introducing virtual machine.
Will Revirt analyze the recorded log information or it just replay everything it has recorded ? If there is analysis, how does it and what factors will be important compared with others ?
Do they have a memory bound on ReVirt? It seems that they introduced the problem of compromising the system by sending it a large number of huge packets(until disk is full)?
Is it a better design to have the virtual machince monitor as a kernel module UMLinux or a separate entity (Xen)? Is this related to runnig on hardware or host OS?
I am curious about the procedures of appending log records to the disk. The paper says it is in a manner similar to that used by the Linux syslogd deamon. I wonder how it works?
Revirt reduces overhead during normal operation but require more time to recreate a previous system state. Are there any better way to reduce such overheads?
Can ReVirt, in any form, be used with Xen?
It seems like ReVirt proposed a pretty comprehensive scheme of logging, but the tradeoffs (time and space overhead) are not cheap. Do you think it is worth applying this approach?
Why does ReVirt use the virtual machine UMLinux but not the other ones?
Of course, once one has contended from problems higher in the level(i.e in this case external apps attacking the OS the logger lives under), it will have to deal with things such as virtual root kits which slips under the VM. What recommendations do you have for ReVirt then? Continue the cat and mouse game of who lives closer to the hardware?
The paper mentions the idea that one system's sent message is another's received message. Isn't that a kind of naive way of looking at the world, since it assumes that message doesn't get fiddled with en route, or is this one of those end-to-end debate problems again?
Cooperative logging works well in the the presence of many machines connected to the same network, but it also decreases reliability of replay if one system fails. Would you think that this is of concern to a programmer (debugger) of a system?
The ReVirt assumes that SIGUSR1 signal is always available for use. Would you say that this is usually the case?
The authors planed to use ReVirt as a building block for new security services. What is the challenge for the new security services? Why didnt the authors consider the new security services in this paper?
Besides a difference in goals, Hypervisor and ReVirt also differ in several design choices. From the comparison, it seems Hypervisor is better than ReVirt. However, in what aspects Revirt has better performances than Hypervisor?
Couldn’t we make a safer direct-on-host structure than an OS-on-OS structure? The authors argue that an OS-on-OS structure is safer than a direct-on-host structure. However, if we eliminate host operating system from figure 2, we may improve the performance more though the amount of improvement is a little. As I remember, Microsoft is trying a secure direct-on-host structure in Xbox development.
The authors used the same operating systems for both host and guest operating systems, and they just instrumented the host kernel. Can the Os-on-OS structure itself guarantee the security of the host operating system perfectly?
Is UMLinux the only guest OS that can be used without major overhaul? Due to its use of the Athlon's model-specific performance counters, is the Athlon the only processor supported? (i.e. not even Intel x86)
This all looks fragile:
- If there were new non-deterministic instructions
- If there were more instructions that could be interrupted mid-instruction (similar to string operations, this isn't likely though)
- If the branch counter were to become imprecise (quite possible, perhaps due to hardware optimization of future out-of-order implementations) Then it seems to me that precise logging will no longer work.
To reduce the disk space for storage of log events, only the messages sent are saved by sending computer. However can`t the messages sent depend on the characteristics of network at a given time ( e.g: congestion) Wouldn`t this create differences in the run-time? Another question on this issue: In replay are these messages resent by the same computers to the faulted computer while the computers are doing other jobs or they`re halted too, or any other mechanism is used?
My question is about the experiment results: When we look at Table-3 first row we can infer that runtime without logging takes more time run-time with logging, why do you think this might have happened?
Would running running a different guest OS from the host OS provide some additional security for the logs?
Would we trust this logging mechanism enough to not perform a full restore from known good backups in the case of a significant security breach?
Does this help in any way with intrusion detection? It would seem that this would be important if the logs only go back a few months.
The paper states that since the trusted code base is small, a host OS is less vulnerable to an attack. But suppose a 'villain' does get access to the host OS, isn't it worse than if he had just got access to a single direct on host OS? I mean doesn't the whole VMM concept, if the VMM is not secure, increase the risk of multiple compromises?
The paper states that a 58% overhead is a moderate price to pay for security. But this statistic is just for running the guest OS on a virtual machine. Logging adds another 8% overhead. Isn't this a bit too much?
I like that this system can playback information to the OS but I'm not sure why this couldn't be done before. Or could it have been done before and no one has done it? Again for me it asks the question, why do I need a VM to do this when my OS could already do it?
Don't you think it's a fair assumption to trust the kernel? If we can't trust the kernel, who can we trust? Why is a VM more trusted than a kernel? Dunlap et al. even admit that this system isn't 100% secure, it just makes hacking more difficult.
Introducing extra code always introduce extra bugs. Since te VMM is loaded as a kernel module, it can potentally do a lot of bad things if it is compromised. Are there any alternative to loading the VMM as a kernel module to improve security?
With the OS-on-OS structure, the authors claimed that even if the attackers gain control of the guest OS, it is severely restricted in the available actions against the host oS. How can that be done? Are there a lot of functionalities that are provided to the host OS, but are not necessary to implement the guest OS? If so, why did we need those functionalities in the first place?
While replaying asynchronous virtual interrupts when does the second phase get initiated?
How would Revirt handle virtualization platforms wherein the entire virtual machine is not encapsulated in one process ? (There would be far more non-determinism in terms of the process-process interaction which would need to be captured now)
How does Rivert identify intrusions? - Instantly or Over a period of time? Is it a periodical review of processes?
Wont it be more efficient to implement Rivert for certain critical sections of an application and just employ check points for the rest of the application?
It is mostly servers that are targets of attacks, and servers are typically run by multiple processors. Why is it that they did not intend to provide support for multi-processor computers? It would certainly be harder, but would it be impossible?
Wouldn't the abundance of data that is logged pose a problem to recover from an attack quickly?
We are facing a trade off between performance and security: on the host, programs run faster, but can be more easily compromised. Do you think the industry will use this idea if they can't make it run faster?
How can ReVirt replay external events coming from external devices that are not checkpointed such as USB keys and CD-Roms?