CPSC 508 - Operating System

Deciding when to forget in the Elephant file system

Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, and Jacob Ofir (1999)

Presented by Gary Huang, Nguyet Nguyen

 Presentation Slides    |    In-class Discussion    |    Submitted questions: By subjects, by persons


Presentation Slides [PDF][PPT]

In-class Discussion

(1) Motivation:

Should we need such strong protection from the file system to protect user from their mistakes?

The answer could be Yes or No.  Yes, if files don't cost much space.  No, if files are large media files.

(2) Related Work:

Is it worth using the Elephant File System while we have (i) version control systems. E.g. CVS, time; (ii) machine in MacOS X; (iii) backup systems (daily, weekly, and monthly) E.g. checkpoint FS; (iv) or, even Window’s recycle bin?

- If focus on one specific function, EFS is not worth.

- In research point view, EFS is worth.

(3) Applicability:

Why isn’t the Elephant File System widely used today? Are people reluctant to using unfamiliar file systems?

It is not widely applied today since it is due to its policy problem.  Also, EFS is only designed for the academic research purpose and not much consider really wide use.

(4) Security:

Should the file system have some support to protect against the attacks like some trojan or malicious programs?

Yes, it should.  The security can be solved from web cache, retention policy, system adjusting policy, etc.

Submitted questions (By subjects)

IMPLEMENTATION

What prevents the user from changing the policy of a file? Wouldn't the fact that the owner of the file can change its policy render the
whole file system into something similar to a trash can in Windows? (ie. human mistake will still occur, just at a different level)

What happens to a file's inode-log when the file's policy is changed from Keep-(All,Safe,Lanmarks) to Keep-One? Would it be erased?

How exactly are 'undo-able' changes avoided? Isn't the information overwritten when there are rewrites to the same page in the buffer?

How is optimal check-pointing frequency attained in Elephant? Is the granularity level suggested, a bottleneck in achieving the same or
will you argue that this granularity provides optimal check-pointing?

Can users change the retention policy of a file after it has been created? Should they be able to? If a file's retention policy changes to
a less conservative policy (e.g. from Keep Landmarks to Keep Safe), at what point (if any) should the versions associated with the old policy
be deleted?

Elephant is supposed to protect users from their own mistakes by not giving them control over storage reclamation, but doesn't the lack of
control create a new problem, that users may not be able to free space on a full disk? In a sense, users become vulnerable to a new kind of
mistake in that the disk space used by the creation of a large file or the writing of a large amount of data cannot be explicitly reclaimed.

How do we deal with users that save files constantly (most of us...).  It seems there's a policy that a file is only versioned if it's open 10
seconds after it's last close but for people who save files 2 or 3 times a minute, you generate 2 or 3 versions of the file.  Isn't this a problem?
Maybe if it's a landmark file those 2 or 3 version are discarded quickly anyway?

Do you think it's a good idea to allow applications to specify their own file retention policies?  They mentioned moving files to different hosts
will have problems but do the advantages really make it worth it?  I think it's better to eliminate some complexity and even features for the
sake of simplicity and I think allowing this was a bad idea.

Versioning on files is done at the block level. Why not implement versioning at the device level? Wouldn't the system benefit from using a
device interface at higher level?

How can disk blocks be shared among versions of the same file?

The cleaner policy of Elephant File System doesn`t coalesce free space. As versions get older and need to be deleted, it appears to me fragmentation might become a problem.  Would you agree with me? Or does the system have a strategy for dealing with fragmentation?

Versioning policy sounds fine for files that are not updated multiple times in short intervals.(i.e. CPSC 508 assignments) However if a file is updated frequently in a certain interval it occurs to me version policy might hold too much redundant information. Can we have a policy that creates difference of files? Additionally in such a case assigning landmarks (if a file`s is updated twice in a two-minute interval, a month ago then keep the newest ) might not either hold since frequent update might be result of fast transactions done by a process, as opposed to by a human user modifying due to some mistakes he has made.  What do you think?

Would you say that storing versions of a file consecutively on the disk could impact the performance of the Elephant filesystem?

Doesn't file size play a big role in deciding what logs should be deleted? (ie it is relatively inexpensive to store every .doc edit I've ever made,
but very expensive to store all edits to my video collection.)

In finding landmarks, it seems like edits which involve deletion are much poorer candidates than those that involve adding new content, because
deletes are easier to replicate.  Should we consider creating landmarks by attempting to merging together nearby edits into some kind of
"summarizing" landmark?

Is there a common format which can be used to interop between different versioning systems ?

The authors mentioned that deleted files can be recovered by rolling back a directory.  How would that effect files that have been created since that specific version of the directory?  Does it restore the files back to the same point in time as well?

EXTENSION FEATURES

In source-control software, while the difference between each version is recorded, an important emphasis is put onto the comment for each
landmark. It is effect for the user to recall the rationale behind the landmark and subsequently prevent the user from removing the landmark.
Is there anyway to incorporate this into a file system?

Given the file retention policies, I think that it would be an interesting exercise to construct a set of user profiles to get a sense of how the
Elephant features would be used. For example: what kind of policies settings would a professional programmer want? What about a typical high school student?

SECURITY

It could also be an interesting idea to design a system recover tool to go along with Elephant. For example, your machine crashed due to some trojan
messing up your Windows directory, which directories/files should be rolled back automatically? While on the subject of malicious programs, it could be
really bad if you are assuming that your file is on "Keep Landmarks" when something purges all your old versions and turns it to "Keep One". I think that
it would be useful to be able to easily backup all versions of a file or directory.

What about security (especially secure file deletion)?

APPLICABILITY

Disk space is getting cheaper but so are the size of application programs and memory (swap) requirements. Do you think we can afford to set aside space for multiple versions of our files just to take care of an accidental deletion? (How many of us really have a lot of free disk space on our 80GB hard disks?)

Doesn't Elephant make file handling more involved? Elephant is for providing protections from user's own mistakes. What if the user unknowingly sets a keep-all retention policy for a highly modified file thus eventually running out of disk space?

I think Elephant file system would waste a lot of resources to store versions of files. Do you think this function deserves such kind of waste ?

The main function of Elephant file system is to store versions of files. This let me think about the tools used for source code
management, like CVS. Which one of the two ways, do you think, better ?

I think it is good that Elephant focuses on other goals other than achieving the highest performance, namely allowing users to recover
from their errors. Do you think however that it is still beneficial enough to use Elephant in systems that are already backed up regularly
(daily, weekly, and monthly) ?

Why isn't Elephant FS widely used today? Are people reluctant to using unfamiliar file systems ? Is hard disk space still too *precious* ?

Elephants have a good memory, but they also take a lot of *space* :) http://news.bbc.co.uk/2/hi/science/nature/1285532.stm

Why haven't versioning file systems become popular (c.f. 4.1)?

Apple Macintosh OS X 10.5 "Leopard" is set to include "Time Machine" which is a user space application which manages version history of
specified files. What are the merits of OS file system vs. user application for maintaining history?

Several applications (for example many text editors) might rewrite the same data to disk without actually changing anything is the file. Because Elephant uses copy-on-write for creating new versions, it creates a new version in such cases. I think in order to solve this problem we can use a comparator to compare the new and old data when creating new versions. (Of course this approach has it's own trade-offs). Do you think it makes sense to implement such a thing?
 
As predicted in the paper hard disk capacities has increased in recent years. But also the usage patterns has changed and the demand for space has increased much more than the capacity increase(just think about the amount of mp3 files, digital pictures and movies that you have on your hard drive), but I think the Elephant file system still makes sense in presence of these changes because a lot of these files do not need versioning to protect them and so there won't be a capacity problem. So why don't we see such file systems in widespread use?

I think most of the multimedia files (that I mentioned in the last question) do not need versioning and so fall into on of the "keep one" or "keep safe" protection policies, but I think non of these policies are suitable for these files. Keep one is not good because you don't want to loose your family pictures after accidentally deleting them and keep safe is not good for the same reason because the files are lost after the second-chance interval without asking the user for confirmation. I think having something like Window's Recycle Bin is the best choice in such cases. What do you think?

The paper assumes that disk storage spares space enough to save versions of files. However, we have to consider the size of a file
also becomes larger than before (ex. multimedia: movie, broadcasting programs, etc.) and we also want to have a more spacious disk storage so
as to save bulky files. As another issue, the Elephant file system shows poorer performance than most UNIX system as described in the section
6.3.1. So could the Elephant file system develops and evolves continuously with these two disadvantages?

The paper implements the versioning function in the kernel. However, I cannot understand the necessity of the implementation. As the
paper points out, there are files not needing version controls. Regarding user-modified files, isn’t it enough to provide a library for
application to manage versions of the files? Besides we can think about a versioning control system acting like a file managing system (ex.
windows explorer) over a file system in the kernel, instead of integrating those two functions; in fact, Rational Clearcase, one of the
versioning systems, takes the implementation I describe here. Users can access versions of files using windows explorer and the implementation
is implemented in the user space.

I agree that short-term undo is definitely something that would be very useful, and that it is not well-supported by current 
checkpointing schemes (like .snapshot). However, some of the longer-term schemes described here seems like they are adequately addressed 
by source-control systems such as Subversion. Do you think that the implementation of Elephant could be simplified by forcing the user to 
save landmarks through a source-control system, in which case it would be used only to provide short-term undo?

In 2006 we are seeing a lot of journalled filesystems in the mainstream, but those generally do not preserve user data (as far as 
I know). Why hasn't this sort of approach entered the mainstream yet,  in your opinion?

There seem to be quite a number of experimental versioning file systems out there (ext3cow, wayback, copyFS), in this context what is
the primary drawback in the elephant file system which prevented its widespread adoption ?

With the amount of disk space virtually free and unliminted, does it still make sense to have fancy file systems that tries to save space?  Will it be more efficient to store everything in a datawarehouse based file system?

How does time machine in os x works?

OTHERS

After having read all these papers about file systems, what other issues should be concerned for file system designers besides performance?

The authors did not go into detail on how the Temperature of an imap is set. How would you "weigh" the value vs. the expiration date when
determining Temperature?

I sis not understand what "current epoch" is. I was wondering if you could explain it a bit more?


Submitted questions (By persons)

Wilson Fung

In source-control software, while the difference between each version is recorded, an important emphasis is put onto the comment for each
landmark. It is effect for the user to recall the rationale behind the landmark and subsequently prevent the user from removing the landmark.
Is there anyway to incorporate this into a file system?

What prevents the user from changing the policy of a file? Wouldn't the fact that the owner of the file can change its policy render the
whole file system into something similar to a trash can in Windows? (ie. human mistake will still occur, just at a different level)

What happens to a file's inode-log when the file's policy is changed from Keep-(All,Safe,Lanmarks) to Keep-One? Would it be erased?

Alfred Yu-Han Pang

Given the file retention policies, I think that it would be an interesting exercise to construct a set of user profiles to get a sense of how the
Elephant features would be used. For example: what kind of policies settings would a professional programmer want? What about a typical high school student?

It could also be an interesting idea to design a system recover tool to go along with Elephant. For example, your machine crashed due to some trojan
messing up your Windows directory, which directories/files should be rolled back automatically? While on the subject of malicious programs, it could be
really bad if you are assuming that your file is on "Keep Landmarks" when something purges all your old versions and turns it to "Keep One". I think that
it would be useful to be able to easily backup all versions of a file or directory.

Karthik Chandrasekar

How exactly are 'undo-able' changes avoided? Isn't the information overwritten when there are rewrites to the same page in the buffer?

How is optimal check-pointing frequency attained in Elephant? Is the granularity level suggested, a bottleneck in achieving the same or
will you argue that this granularity provides optimal check-pointing?

Anoop Karollil

Disk space is getting cheaper but so are the size of application programs and memory (swap) requirements. Do you think we can afford to set aside space for multiple versions of our files just to take care of an accidental deletion? (How many of us really have a lot of free disk space on our 80GB hard disks?)

Doesn't Elephant make file handling more involved? Elephant is for providing protections from user's own mistakes. What if the user unknowingly sets a keep-all retention policy for a highly modified file thus eventually running out of disk space?

Sam Davis

Can users change the retention policy of a file after it has been created? Should they be able to? If a file's retention policy changes to
a less conservative policy (e.g. from Keep Landmarks to Keep Safe), at what point (if any) should the versions associated with the old policy
be deleted?

Elephant is supposed to protect users from their own mistakes by not giving them control over storage reclamation, but doesn't the lack of
control create a new problem, that users may not be able to free space on a full disk? In a sense, users become vulnerable to a new kind of
mistake in that the disk space used by the creation of a large file or the writing of a large amount of data cannot be explicitly reclaimed.

Lloyd Markle

How do we deal with users that save files constantly (most of us...).  It seems there's a policy that a file is only versioned if it's open 10
seconds after it's last close but for people who save files 2 or 3 times a minute, you generate 2 or 3 versions of the file.  Isn't this a problem?
Maybe if it's a landmark file those 2 or 3 version are discarded quickly anyway?

Do you think it's a good idea to allow applications to specify their own file retention policies?  They mentioned moving files to different hosts
will have problems but do the advantages really make it worth it?  I think it's better to eliminate some complexity and even features for the
sake of simplicity and I think allowing this was a bad idea.

Erica Zhang

I think Elephant file system would waste a lot of resources to store versions of files. Do you think this function deserves such kind of waste ?

The main function of Elephant file system is to store versions of files. This let me think about the tools used for source code
management, like CVS. Which one of the two ways, do you think, better ?

Jean-Sebastien Legare

I think it is good that Elephant focuses on other goals other than achieving the highest performance, namely allowing users to recover
from their errors. Do you think however that it is still beneficial enough to use Elephant in systems that are already backed up regularly
(daily, weekly, and monthly) ?

Why isn't Elephant FS widely used today? Are people reluctant to using unfamiliar file systems ? Is hard disk space still too *precious* ?

Elephants have a good memory, but they also take a lot of *space* :) http://news.bbc.co.uk/2/hi/science/nature/1285532.stm

Haoran Song

Versioning on files is done at the block level. Why not implement versioning at the device level? Wouldn't the system benefit from using a
device interface at higher level?

How can disk blocks be shared among versions of the same file?

After having read all these papers about file systems, what other issues should be concerned for file system designers besides performance?

Mehmet Argun Alparslan

The cleaner policy of Elephant File System doesn`t coalesce free space. As versions get older and need to be deleted, it appears to me fragmentation might become a problem.  Would you agree with me? Or does the system have a strategy for dealing with fragmentation?

Versioning policy sounds fine for files that are not updated multiple times in short intervals.(i.e. CPSC 508 assignments) However if a file is updated frequently in a certain interval it occurs to me version policy might hold too much redundant information. Can we have a policy that creates difference of files? Additionally in such a case assigning landmarks (if a file`s is updated twice in a two-minute interval, a month ago then keep the newest ) might not either hold since frequent update might be result of fast transactions done by a process, as opposed to by a human user modifying due to some mistakes he has made.  What do you think?

Mirna Limic

The authors did not go into detail on how the Temperature of an imap is set. How would you "weigh" the value vs. the expiration date when
determining Temperature?

I sis not understand what "current epoch" is. I was wondering if you could explain it a bit more?

Would you say that storing versions of a file consecutively on the disk could impact the performance of the Elephant filesystem?

Kevin Loken

Why haven't versioning file systems become popular (c.f. 4.1)?

Apple Macintosh OS X 10.5 "Leopard" is set to include "Time Machine" which is a user space application which manages version history of
specified files. What are the merits of OS file system vs. user application for maintaining history?

What about security (especially secure file deletion)?

Ali Bakhoda

Several applications (for example many text editors) might rewrite the same data to disk without actually changing anything is the file. Because Elephant uses copy-on-write for creating new versions, it creates a new version in such cases. I think in order to solve this problem we can use a comparator to compare the new and old data when creating new versions. (Of course this approach has it's own trade-offs). Do you think it makes sense to implement such a thing?
 
As predicted in the paper hard disk capacities has increased in recent years. But also the usage patterns has changed and the demand for space has increased much more than the capacity increase(just think about the amount of mp3 files, digital pictures and movies that you have on your hard drive), but I think the Elephant file system still makes sense in presence of these changes because a lot of these files do not need versioning to protect them and so there won't be a capacity problem. So why don't we see such file systems in widespread use?

I think most of the multimedia files (that I mentioned in the last question) do not need versioning and so fall into on of the "keep one" or "keep safe" protection policies, but I think non of these policies are suitable for these files. Keep one is not good because you don't want to loose your family pictures after accidentally deleting them and keep safe is not good for the same reason because the files are lost after the second-chance interval without asking the user for confirmation. I think having something like Window's Recycle Bin is the best choice in such cases. What do you think?

Dutch Meyer

Doesn't file size play a big role in deciding what logs should be deleted? (ie it is relatively inexpensive to store every .doc edit I've ever made,
but very expensive to store all edits to my video collection.)

In finding landmarks, it seems like edits which involve deletion are much poorer candidates than those that involve adding new content, because
deletes are easier to replicate.  Should we consider creating landmarks by attempting to merging together nearby edits into some kind of
"summarizing" landmark?

Seonah Lee

The paper assumes that disk storage spares space enough to save versions of files. However, we have to consider the size of a file
also becomes larger than before (ex. multimedia: movie, broadcasting programs, etc.) and we also want to have a more spacious disk storage so
as to save bulky files. As another issue, the Elephant file system shows poorer performance than most UNIX system as described in the section
6.3.1. So could the Elephant file system develops and evolves continuously with these two disadvantages?

The paper implements the versioning function in the kernel. However, I cannot understand the necessity of the implementation. As the
paper points out, there are files not needing version controls. Regarding user-modified files, isn’t it enough to provide a library for
application to manage versions of the files? Besides we can think about a versioning control system acting like a file managing system (ex.
windows explorer) over a file system in the kernel, instead of integrating those two functions; in fact, Rational Clearcase, one of the
versioning systems, takes the implementation I describe here. Users can access versions of files using windows explorer and the implementation
is implemented in the user space.

Michael DiBernardo

I agree that short-term undo is definitely something that would be very useful, and that it is not well-supported by current 
checkpointing schemes (like .snapshot). However, some of the longer-term schemes described here seems like they are adequately addressed 
by source-control systems such as Subversion. Do you think that the implementation of Elephant could be simplified by forcing the user to 
save landmarks through a source-control system, in which case it would be used only to provide short-term undo?

In 2006 we are seeing a lot of journalled filesystems in the mainstream, but those generally do not preserve user data (as far as 
I know). Why hasn't this sort of approach entered the mainstream yet,  in your opinion?

Mayukh Saubhasik

There seem to be quite a number of experimental versioning file systems out there (ext3cow, wayback, copyFS), in this context what is
the primary drawback in the elephant file system which prevented its widespread adoption ?

Is there a common format which can be used to interop between different versioning systems ?

Ivan Sham

The authors mentioned that deleted files can be recovered by rolling back a directory.  How would that effect files that have been created since that specific version of the directory?  Does it restore the files back to the same point in time as well?

With the amount of disk space virtually free and unliminted, does it still make sense to have fancy file systems that tries to save space?  Will it be more efficient to store everything in a datawarehouse based file system?

How does time machine in os x works?