CPSC 508 - Operating System
Deciding when to forget in the Elephant file system
Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, and Jacob Ofir (1999)
Presented by Gary Huang, Nguyet Nguyen
Presentation Slides | In-class Discussion | Submitted questions: By subjects, by persons
Presentation Slides [PDF][PPT]
In-class Discussion
(1) Motivation:
Should we need such strong protection from the file system to protect user from their mistakes?
The answer could be Yes or No. Yes, if files don't cost much space. No, if files are large media files.
(2) Related Work:
Is it worth using the Elephant File System while we have (i) version control systems. E.g. CVS, time; (ii) machine in MacOS X; (iii) backup systems (daily, weekly, and monthly) E.g. checkpoint FS; (iv) or, even Window’s recycle bin?
- If focus on one specific function, EFS is not worth.
- In research point view, EFS is worth.
(3) Applicability:
Why isn’t the Elephant File System widely used today? Are people reluctant to using unfamiliar file systems?
It is not widely applied today since it is due to its policy problem. Also, EFS is only designed for the academic research purpose and not much consider really wide use.
(4) Security:
Should the file system have some support to protect against the attacks like some trojan or malicious programs?
Yes, it should. The security can be solved from web cache, retention policy, system adjusting policy, etc.
Submitted questions (By subjects)
IMPLEMENTATION
What prevents the user from changing the policy of a file?
Wouldn't the fact that the owner of the file can change its policy render the
whole file system into something similar to a trash can in Windows? (ie. human
mistake will still occur, just at a different level)
What happens to a file's inode-log when the file's policy is changed from Keep-(All,Safe,Lanmarks) to Keep-One? Would it be erased?
How exactly are 'undo-able' changes avoided? Isn't the
information overwritten when there are rewrites to the same page in the buffer?
How is optimal check-pointing frequency attained in Elephant? Is the granularity
level suggested, a bottleneck in achieving the same or
will you argue that this granularity provides optimal check-pointing?
Can users change the retention policy of a file after it has
been created? Should they be able to? If a file's retention policy changes to
a less conservative policy (e.g. from Keep Landmarks to Keep Safe), at what
point (if any) should the versions associated with the old policy
be deleted?
Elephant is supposed to protect users from their own mistakes by not giving them
control over storage reclamation, but doesn't the lack of
control create a new problem, that users may not be able to free space on a full
disk? In a sense, users become vulnerable to a new kind of
mistake in that the disk space used by the creation of a large file or the
writing of a large amount of data cannot be explicitly reclaimed.
How do we deal with users that save files constantly (most of
us...). It seems there's a policy that a file is only versioned if it's open 10
seconds after it's last close but for people who save files 2 or 3 times a
minute, you generate 2 or 3 versions of the file. Isn't this a problem?
Maybe if it's a landmark file those 2 or 3 version are discarded quickly anyway?
Do you think it's a good idea to allow applications to specify their own file
retention policies? They mentioned moving files to different hosts
will have problems but do the advantages really make it worth it? I think it's
better to eliminate some complexity and even features for the
sake of simplicity and I think allowing this was a bad idea.
Versioning on files is done at the block level. Why not
implement versioning at the device level? Wouldn't the system benefit from using
a
device interface at higher level?
How can disk blocks be shared among versions of the same file?
The cleaner policy of Elephant File System doesn`t coalesce free space. As versions get older and need to be deleted, it appears to me fragmentation might become a problem. Would you agree with me? Or does the system have a strategy for dealing with fragmentation?
Versioning policy sounds fine for files that are not updated multiple times in short intervals.(i.e. CPSC 508 assignments) However if a file is updated frequently in a certain interval it occurs to me version policy might hold too much redundant information. Can we have a policy that creates difference of files? Additionally in such a case assigning landmarks (if a file`s is updated twice in a two-minute interval, a month ago then keep the newest ) might not either hold since frequent update might be result of fast transactions done by a process, as opposed to by a human user modifying due to some mistakes he has made. What do you think?
Would you say that storing versions of a file consecutively on the disk could impact the performance of the Elephant filesystem?
Doesn't file size play a big role in deciding what logs should
be deleted? (ie it is relatively inexpensive to store every .doc edit I've ever
made,
but very expensive to store all edits to my video collection.)
In finding landmarks, it seems like edits which involve deletion are much poorer
candidates than those that involve adding new content, because
deletes are easier to replicate. Should we consider creating landmarks by
attempting to merging together nearby edits into some kind of
"summarizing" landmark?
Is there a common format which can be used to interop between different versioning systems ?
The authors mentioned that deleted files can be recovered by
rolling back a directory. How would that effect files that have been created
since that specific version of the directory? Does it restore the files back to
the same point in time as well?
EXTENSION FEATURES
In source-control software, while the difference between each
version is recorded, an important emphasis is put onto the comment for each
landmark. It is effect for the user to recall the rationale behind the landmark
and subsequently prevent the user from removing the landmark.
Is there anyway to incorporate this into a file system?
Given the file retention policies, I think that it would be an
interesting exercise to construct a set of user profiles to get a sense of how
the
Elephant features would be used. For example: what kind of policies settings
would a professional programmer want? What about a typical high school student?
SECURITY
It could also be an interesting idea to design a system recover
tool to go along with Elephant. For example, your machine crashed due to some
trojan
messing up your Windows directory, which directories/files should be rolled back
automatically? While on the subject of malicious programs, it could be
really bad if you are assuming that your file is on "Keep Landmarks" when
something purges all your old versions and turns it to "Keep One". I think that
it would be useful to be able to easily backup all versions of a file or
directory.
What about security (especially secure file deletion)?
APPLICABILITY
Disk space is getting cheaper but so are the size of application programs and memory (swap) requirements. Do you think we can afford to set aside space for multiple versions of our files just to take care of an accidental deletion? (How many of us really have a lot of free disk space on our 80GB hard disks?)
Doesn't Elephant make file handling more involved? Elephant is for providing protections from user's own mistakes. What if the user unknowingly sets a keep-all retention policy for a highly modified file thus eventually running out of disk space?
I think Elephant file system would waste a lot of resources to store versions of files. Do you think this function deserves such kind of waste ?
The main function of Elephant file system is to store versions
of files. This let me think about the tools used for source code
management, like CVS. Which one of the two ways, do you think, better ?
I think it is good that Elephant focuses on other goals other
than achieving the highest performance, namely allowing users to recover
from their errors. Do you think however that it is still beneficial enough to
use Elephant in systems that are already backed up regularly
(daily, weekly, and monthly) ?
Why isn't Elephant FS widely used today? Are people reluctant to using
unfamiliar file systems ? Is hard disk space still too *precious* ?
Elephants have a good memory, but they also take a lot of *space* :)
http://news.bbc.co.uk/2/hi/science/nature/1285532.stm
Why haven't versioning file systems become popular (c.f. 4.1)?
Apple Macintosh OS X 10.5 "Leopard" is set to include "Time
Machine" which is a user space application which manages version history of
specified files. What are the merits of OS file system vs. user application for
maintaining history?
Several applications (for example many text editors) might
rewrite the same data to disk without actually changing anything is the file.
Because Elephant uses copy-on-write for creating new versions, it creates a new
version in such cases. I think in order to solve this problem we can use a
comparator to compare the new and old data when creating new versions. (Of
course this approach has it's own trade-offs). Do you think it makes sense to
implement such a thing?
As predicted in the paper hard disk capacities has increased in recent years.
But also the usage patterns has changed and the demand for space has increased
much more than the capacity increase(just think about the amount of mp3 files,
digital pictures and movies that you have on your hard drive), but I think the
Elephant file system still makes sense in presence of these changes because a
lot of these files do not need versioning to protect them and so there won't be
a capacity problem. So why don't we see such file systems in widespread use?
I think most of the multimedia files (that I mentioned in the last question) do
not need versioning and so fall into on of the "keep one" or "keep safe"
protection policies, but I think non of these policies are suitable for these
files. Keep one is not good because you don't want to loose your family pictures
after accidentally deleting them and keep safe is not good for the same reason
because the files are lost after the second-chance interval without asking the
user for confirmation. I think having something like Window's Recycle Bin is the
best choice in such cases. What do you think?
The paper assumes that disk storage spares space enough to save
versions of files. However, we have to consider the size of a file
also becomes larger than before (ex. multimedia: movie, broadcasting programs,
etc.) and we also want to have a more spacious disk storage so
as to save bulky files. As another issue, the Elephant file system shows poorer
performance than most UNIX system as described in the section
6.3.1. So could the Elephant file system develops and evolves continuously with
these two disadvantages?
The paper implements the versioning function in the kernel. However, I cannot
understand the necessity of the implementation. As the
paper points out, there are files not needing version controls. Regarding
user-modified files, isn’t it enough to provide a library for
application to manage versions of the files? Besides we can think about a
versioning control system acting like a file managing system (ex.
windows explorer) over a file system in the kernel, instead of integrating those
two functions; in fact, Rational Clearcase, one of the
versioning systems, takes the implementation I describe here. Users can access
versions of files using windows explorer and the implementation
is implemented in the user space.
I agree that short-term undo is definitely something that would
be very useful, and that it is not well-supported by current
checkpointing schemes (like .snapshot). However, some of the longer-term schemes
described here seems like they are adequately addressed
by source-control systems such as Subversion. Do you think that
the implementation of Elephant could be simplified by forcing the user to
save landmarks through a source-control system, in which case it would be used
only to provide short-term undo?
In 2006 we are seeing a lot of journalled filesystems in the
mainstream, but those generally do not preserve user data (as far as
I know). Why hasn't this sort of approach entered the mainstream yet, in your
opinion?
There seem to be quite a number of experimental versioning file
systems out there (ext3cow, wayback, copyFS), in this context what is
the primary drawback in the elephant file system which prevented its widespread
adoption ?
With the amount of disk space virtually free and unliminted,
does it still make sense to have fancy file systems that tries to save space?
Will it be more efficient to store everything in a datawarehouse based file
system?
How does time machine in os x works?
OTHERS
After having read all these papers about file systems, what other issues should be concerned for file system designers besides performance?
The authors did not go into detail on how the Temperature of an
imap is set. How would you "weigh" the value vs. the expiration date when
determining Temperature?
I sis not understand what "current epoch" is. I was wondering if you could
explain it a bit more?
Submitted questions (By persons)
Wilson Fung
In source-control software, while the difference between each
version is recorded, an important emphasis is put onto the comment for each
landmark. It is effect for the user to recall the rationale behind the landmark
and subsequently prevent the user from removing the landmark.
Is there anyway to incorporate this into a file system?
What prevents the user from changing the policy of a file? Wouldn't the fact
that the owner of the file can change its policy render the
whole file system into something similar to a trash can in Windows? (ie. human
mistake will still occur, just at a different level)
What happens to a file's inode-log when the file's policy is changed from Keep-(All,Safe,Lanmarks)
to Keep-One? Would it be erased?
Alfred Yu-Han Pang
Given the file retention policies, I think that it would be an
interesting exercise to construct a set of user profiles to get a sense of how
the
Elephant features would be used. For example: what kind of policies settings
would a professional programmer want? What about a typical high school student?
It could also be an interesting idea to design a system recover tool to go along
with Elephant. For example, your machine crashed due to some trojan
messing up your Windows directory, which directories/files should be rolled back
automatically? While on the subject of malicious programs, it could be
really bad if you are assuming that your file is on "Keep Landmarks" when
something purges all your old versions and turns it to "Keep One". I think that
it would be useful to be able to easily backup all versions of a file or
directory.
Karthik Chandrasekar
How exactly are 'undo-able' changes avoided? Isn't the
information overwritten when there are rewrites to the same page in the buffer?
How is optimal check-pointing frequency attained in Elephant? Is the granularity
level suggested, a bottleneck in achieving the same or
will you argue that this granularity provides optimal check-pointing?
Anoop Karollil
Disk space is getting cheaper but so are the size of application
programs and memory (swap) requirements. Do you think we can afford to set aside
space for multiple versions of our files just to take care of an accidental
deletion? (How many of us really have a lot of free disk space on our 80GB hard
disks?)
Doesn't Elephant make file handling more involved? Elephant is for providing
protections from user's own mistakes. What if the user unknowingly sets a
keep-all retention policy for a highly modified file thus eventually running out
of disk space?
Sam Davis
Can users change the retention policy of a file after it has
been created? Should they be able to? If a file's retention policy changes to
a less conservative policy (e.g. from Keep Landmarks to Keep Safe), at what
point (if any) should the versions associated with the old policy
be deleted?
Elephant is supposed to protect users from their own mistakes by not giving them
control over storage reclamation, but doesn't the lack of
control create a new problem, that users may not be able to free space on a full
disk? In a sense, users become vulnerable to a new kind of
mistake in that the disk space used by the creation of a large file or the
writing of a large amount of data cannot be explicitly reclaimed.
Lloyd Markle
How do we deal with users that save files constantly (most of
us...). It seems there's a policy that a file is only versioned if it's open 10
seconds after it's last close but for people who save files 2 or 3 times a
minute, you generate 2 or 3 versions of the file. Isn't this a problem?
Maybe if it's a landmark file those 2 or 3 version are discarded quickly anyway?
Do you think it's a good idea to allow applications to specify their own file
retention policies? They mentioned moving files to different hosts
will have problems but do the advantages really make it worth it? I think it's
better to eliminate some complexity and even features for the
sake of simplicity and I think allowing this was a bad idea.
Erica Zhang
I think Elephant file system would waste a lot of resources to store versions of files. Do you think this function deserves such kind of waste ?
The main function of Elephant file system is to store versions
of files. This let me think about the tools used for source code
management, like CVS. Which one of the two ways, do you think, better ?
Jean-Sebastien Legare
I think it is good that Elephant focuses on other goals other
than achieving the highest performance, namely allowing users to recover
from their errors. Do you think however that it is still beneficial enough to
use Elephant in systems that are already backed up regularly
(daily, weekly, and monthly) ?
Why isn't Elephant FS widely used today? Are people reluctant to using
unfamiliar file systems ? Is hard disk space still too *precious* ?
Elephants have a good memory, but they also take a lot of *space* :)
http://news.bbc.co.uk/2/hi/science/nature/1285532.stm
Haoran Song
Versioning on files is done at the block level. Why not
implement versioning at the device level? Wouldn't the system benefit from using
a
device interface at higher level?
How can disk blocks be shared among versions of the same file?
After having read all these papers about file systems, what other issues should
be concerned for file system designers besides performance?
Mehmet Argun Alparslan
The cleaner policy of Elephant File System doesn`t coalesce free space. As versions get older and need to be deleted, it appears to me fragmentation might become a problem. Would you agree with me? Or does the system have a strategy for dealing with fragmentation?
Versioning policy sounds fine for files that are not updated multiple times in short intervals.(i.e. CPSC 508 assignments) However if a file is updated frequently in a certain interval it occurs to me version policy might hold too much redundant information. Can we have a policy that creates difference of files? Additionally in such a case assigning landmarks (if a file`s is updated twice in a two-minute interval, a month ago then keep the newest ) might not either hold since frequent update might be result of fast transactions done by a process, as opposed to by a human user modifying due to some mistakes he has made. What do you think?
Mirna Limic
The authors did not go into detail on how the Temperature of an
imap is set. How would you "weigh" the value vs. the expiration date when
determining Temperature?
I sis not understand what "current epoch" is. I was wondering if you could
explain it a bit more?
Would you say that storing versions of a file consecutively on the disk could
impact the performance of the Elephant filesystem?
Kevin Loken
Why haven't versioning file systems become popular (c.f. 4.1)?
Apple Macintosh OS X 10.5 "Leopard" is set to include "Time
Machine" which is a user space application which manages version history of
specified files. What are the merits of OS file system vs. user application for
maintaining history?
What about security (especially secure file deletion)?
Ali Bakhoda
Several applications (for example many text editors) might
rewrite the same data to disk without actually changing anything is the file.
Because Elephant uses copy-on-write for creating new versions, it creates a new
version in such cases. I think in order to solve this problem we can use a
comparator to compare the new and old data when creating new versions. (Of
course this approach has it's own trade-offs). Do you think it makes sense to
implement such a thing?
As predicted in the paper hard disk capacities has increased in recent years.
But also the usage patterns has changed and the demand for space has increased
much more than the capacity increase(just think about the amount of mp3 files,
digital pictures and movies that you have on your hard drive), but I think the
Elephant file system still makes sense in presence of these changes because a
lot of these files do not need versioning to protect them and so there won't be
a capacity problem. So why don't we see such file systems in widespread use?
I think most of the multimedia files (that I mentioned in the last question) do
not need versioning and so fall into on of the "keep one" or "keep safe"
protection policies, but I think non of these policies are suitable for these
files. Keep one is not good because you don't want to loose your family pictures
after accidentally deleting them and keep safe is not good for the same reason
because the files are lost after the second-chance interval without asking the
user for confirmation. I think having something like Window's Recycle Bin is the
best choice in such cases. What do you think?
Dutch Meyer
Doesn't file size play a big role in deciding what logs should
be deleted? (ie it is relatively inexpensive to store every .doc edit I've ever
made,
but very expensive to store all edits to my video collection.)
In finding landmarks, it seems like edits which involve deletion are much poorer
candidates than those that involve adding new content, because
deletes are easier to replicate. Should we consider creating landmarks by
attempting to merging together nearby edits into some kind of
"summarizing" landmark?
Seonah Lee
The paper assumes that disk storage spares space enough to save
versions of files. However, we have to consider the size of a file
also becomes larger than before (ex. multimedia: movie, broadcasting programs,
etc.) and we also want to have a more spacious disk storage so
as to save bulky files. As another issue, the Elephant file system shows poorer
performance than most UNIX system as described in the section
6.3.1. So could the Elephant file system develops and evolves continuously with
these two disadvantages?
The paper implements the versioning function in the kernel. However, I cannot
understand the necessity of the implementation. As the
paper points out, there are files not needing version controls. Regarding
user-modified files, isn’t it enough to provide a library for
application to manage versions of the files? Besides we can think about a
versioning control system acting like a file managing system (ex.
windows explorer) over a file system in the kernel, instead of integrating those
two functions; in fact, Rational Clearcase, one of the
versioning systems, takes the implementation I describe here. Users can access
versions of files using windows explorer and the implementation
is implemented in the user space.
Michael DiBernardo
I agree that short-term undo is definitely something that would
be very useful, and that it is not well-supported by current
checkpointing schemes (like .snapshot). However, some of the longer-term schemes
described here seems like they are adequately addressed
by source-control systems such as Subversion. Do you think that
the implementation of Elephant could be simplified by forcing the user to
save landmarks through a source-control system, in which case it would be used
only to provide short-term undo?
In 2006 we are seeing a lot of journalled filesystems in the
mainstream, but those generally do not preserve user data (as far as
I know). Why hasn't this sort of approach entered the mainstream yet, in your
opinion?
Mayukh Saubhasik
There seem to be quite a number of experimental versioning file
systems out there (ext3cow, wayback, copyFS), in this context what is
the primary drawback in the elephant file system which prevented its widespread
adoption ?
Is there a common format which can be used to interop between different
versioning systems ?
Ivan Sham
The authors mentioned that deleted files can be recovered by
rolling back a directory. How would that effect files that have been created
since that specific version of the directory? Does it restore the files back to
the same point in time as well?
With the amount of disk space virtually free and unliminted, does it still make
sense to have fancy file systems that tries to save space? Will it be more
efficient to store everything in a datawarehouse based file system?
How does time machine in os x works?