The Design and Implementation of a Log-Structured File System
Mendel Rosenblum and John K. Ousterhout
Presented by Ali Bakhoda and Ivan Sham
Presentation Slides
Here is the
link for the Power Point
file. The slides are augmented with comments in the Notes section. A categorized and summarized list of submitted questions is also included at the end of slides in the discussion part..
Discussion Recap
- Is write traffic really dominating?
- Ivan thought that write traffic isn't really dominating with the media content.
- Buck believed that the other side is true once all caching and buffering are taken into account
- Temporal vs. logical locality
- Temporal and logical locality are usually the same on a personal computer
- Not the same on a file server, it's difficult to take advantage of locality on a server
- Buck's view: The sever can partition the write traffic for each user to a different segment so there are possible solutions for the multitasking issue
- This paper was one of the first papers in the field of log structured file systems and their issues
- There has been several papers following this work in various directions
- The analysis part of the paper is well done and really creative : they came up with the write metric
- Can we apply some of these techniques to improve Unix FFS?
- Optmisation of the write buffer is already implemented
- Reading ahead into cache and improving the logical locality by defragmentation are applicable to Unix file systems
- Why isn't this file system in mainstream use?
- It seems to be due to political issues that it is not integrated to the kernel yet
- It is used on the write-once media
- Their reference point was Berkley FFS which uses synchronous metadata writes
- Synchronous writes are the main reason of FFS poor performance
- Linux never did that from the beginning and that's why it has a super fast file system (although it is less safe)
- Later they migrated to somewhere in the middle: The journaling strategy
- Journaling has some of the nice properties of log file systems like fast recovery process after a crash
- A short recovery process was a nice feature previously but has become an essential property currently with the large hard drives that we have
- Hardware problems: Currently hard disks hide a lot details from the file system. This issue might defeat the whole idea of a log-structured file system.
- The disks are so cheap now that they fail much more offen now
- This is a hot topic currently, and there are lots of papers discussing the robustness of current file systems in presence of these failures
- Buck believed that it's not a good idea to make the file system very complex to deal with these problems in every line of code
- Instead lower level solutions are more simple and elegant
- For example you can have a reserve pool of bocks at block level and replace the bad blocks with good ones from pool
- You can do this for any block regardless of how the file system is using it
- Linux has implemented this in EDMS which does a software block replacement and a lot of other things
- Question : Why doesn't the hard disk manage these problems itself?
- The disk are so bad these days that after 6 months the hard disk runs out of reserved internal blocks so the software should take care of the bad blocks afterwards
List of Submitted Questions
- The segments described in the paper are a fairly large unit of
granularity (512K-1meg). What happens if some sectors in a segment
'go bad'? Are they merely flagged in the segment usage table and
skipped over? What is the largest unit of storage that will be lost
in this case -- the sectors themselves, a single block, or the entire
segment?
- I can't seem to find any mention of replication of metadata,
especially the segment summary block (SSB). I would imagine that an
SSB would be a particularly bad thing to corrupt, because that would
make it impossible to access a potentially large number of small
files in the segment. (Although I suppose this is no different than a
directory file being corrupted in a 'traditional' FS.) Do you think
that SSBs are replicated and it just was not mentioned here? Do you
think they should be replicated?
- Is it possible that constantly having a segment cleaner running in
the background could reduce the lifespan of the disk?
Would
we at some point want to take a previously trusted module out of the
sandbox, once it has proven itself reliable?
- Regarding Figure 9, they insist that Sprite LFS performs
competitively for large files. Then, why they claim only that Sprite LFS
is better in accessing small-sized files in the introduction?
- Regarding the second problem with existing file system in
Section 2.3, reducing the amount of synchronization is the best way of
improving performance? Then, why NFS introduce additional
synchronization? How does Sprite LFS solve the problems? Would you
connect the mechanism of Sprite LFS with the issue?
- The authors state the random access to small files is a major bottleneck
in some applications. How big, in total, does the set of small files tend
to get? Is it small enough that we could handle these files as a special
case, (eg grouping them together on a drive or in nonvolatile storage)?
- How is it that synchronous writes defeat the potential use of a file
system write buffer?
- The authors built their own similator. Is there (or could we build) a
standard agreed upon simulator system that could be used to relaiably
cross compare different methodologies and algorithms?
-
We have read about two different file systems Fast file system for Unix
(FFS) and LFS. How do you compare between two?. What are advantages or
disadvantages of LFS compared to FFS?. In the context of a database application
with intensive transaction processing, which do you think that has more
advantage?
-
How do you evaluate the effect of cleaning and disk fragmentation with
the design of LFS?. LFS needs large
extents of free space available for writing new data. What can be the upper
bound of the available storage size for LFS working efficiently?.
-
Are LFSs used in any current OSes?
-
From what I understand, the segment cleaner relies a bit on having
sufficient empty segments to use. How could one avoid the performance
hit when a disk is near capacity?
- The tests were run on a 300 MB file system. How do you think the
results will scale to current 300 GB (i.e. 1000 larger) file systems?
- Why do you think the performance of the system is insensitive to the
thresholds for cleaning segments?
- This is great for local file systems, can any conclusions be drawn for
use in networked file systems (the authors note synchronicity imposed by
NFS) ... can transactional or log based systems improve this performance?
- It seems like a nice idea. Would you say that this is how the filesystems
will be designed in the future?
- The slowest part of reading a file block is the actual seek. How would you
compare the authors' proposed filesystem to the following idea:
Use the standard Unix system (say sun), but read ahead into the memory
additional pages of data that are not currently needed by the process, but
could be needed in the future. I know that this sounds ify, but why
couldn't the standard unix systems use the LFS's idea of using more memory
to store additional pages and thus reduce disk seeks?
- Would you say that adjusting the size of a segment would be a useful or
non-important feature in LFS?
- It ocurs to me the weakest link of this file system is sequential
reads. The file system assumes that files written together are going to
be read together and it is safe to wirte these files together. I can
somewhat agree with this assumption but the thing that bugs me most is
whether this assumption is valid under different processes trying to
write to disk simultaneously. Does the system still places such files
together? If so how viable is this assumption?
- The system tries to avoid moving highly utilized
segments (while coalescing space) since moving these segments will
detoriate the file system`s performance for writing new data. Hence it
forces the system to a bimodal distribution to and move the less
utilized segments. My question is whether such a distributuin can be
achieved at the very beginning (
i.e. the system is recently installed)? Can we expect the performance
to be low at the beginning? Is this actually the start-up cost that the
authors are trying to avoid by running the file-system 2 or 3 months
before the tests?
- The authors are advertising that the system will have
its performance more prominent as the CPU and the disk bandwidth gap
widens in the future. Nowadays this gap has widened enough, has the
popularity of the file system increased?
- The motivation for logging file systems is usually for quicker recover from
crashes. In the case of Flash-based file systems, there is the additional
hassle of: ensuring even wear rotation, the "time" cost of erasing a block
(typically takes a "long" time when compared to just writing on a freshly
clean block), or just plainly reducing the number of block erases. What logging
FS operations would be particularly expensive on a flash-based file system?
(segment cleaning?) What kinds of trade offs would have to be made to better
align the operations to minimize block erases?
- Is logging the ultimate answer for file systems? Is this really a tradeoff
between reliability and performance? What does RAID compare to improving the
reliability of disk storage when compared to logging FSs?
- Log-structures file systems are based on the assumption that read
operations will be taken care of by increasing disk caches. The writing
speed up also seems to be achieved by sacrificing some read performance
(write sequentially with indexes for reading). Is disk traffic really
dominated by writes?
- Doesn't collecting large amounts of file data in a file cache in
volatile main memory (for increasing write performance) increase the
risk of data loss? What is not committed cannot be recovered can it?
- The paper says LFS still uses FFs' inode strcuture. But inodes are
not located at fixed positions. What is the purpose by doing this? Is it
just used to get more compact arrangement? Then it says file reading
performance of LFS is similar to FFS. Is it really true? and why does
disk workload become write-dominated?
- The log-structured file system (LFS) can offer better performance for
many common workloads such as those with frequent small writes ,
sufficient idle time to clean the log and so fourth. However, from some
other research result I googled on the Internet, it also has poor
performance for other workloads, such as random updates to a full disk
with little idle time to clean. How to enable LFS to provide high
performance across a wider range of workloads
-
Is obtaining performance just a matter of taking blocks that are
likely to be read sequentially, and regrouping them into one chunk?
Could we increase performance of the unix filesystem if we just
heuristically reordered the blocks on disk ?
(For instance if you can figure out that block B is always read after
block A, you put B right after A)
- That's kind of a technical question:
Last class we discussed that nowadays, hard disks lie about the
physical location at which they write the data. If you ask the hard
drive to write data sequentially, will it really write it sequentially
on the physical disks?
- Related to crash recovery:
Their algorithm to recover from a crash depends on comparing the
current date with those of the 2 checkpointing blocks (and picking the
checkpointing block that has the most recent date). I'm curious to
know if a computer can crash in the middle of writing the date on
disk. Can modern hard disks guarantee atomic writes ?
- I just installed Linux and it defaults to a logged file system. Maybe
we've finally made CPU's fast enough that they're not the bottleneck. Do
you know when this occured? How fast should a CPU be to be able to
benefit from the performance mentioned?
- It looks like the perfect solution but why are there still non-logged
file systems floating around? What advantages do they still have? Maybe
performance isn't as good as they predicted here?
- What is with the 65-75% usage of the HD bandwidth, why can't we use 100%?
Is it because we don't write the log or because we have to clean
segments? What accounts for that 25-35%?
- Would it be cost-effective to have a separate, faster storage medium
(e.g. flash) for file system metadata such as inodes? That should make
data access and layout more efficient, while also allowing parallel
access to the data and metadata. Or, when using multiple disks, would
simply storing the metadata for each file system on a different disk
from the one which stores the file system be worth doing?
- They suggested that increasing memory capacity would make caching more
effective, but they also mention the rapid increase in disk capacity.
Unless memory capacity increased at a much greater rate than disk
capacity (which it hasn't), why would caching become more effective?
Also, increasing memory capacity means increasing memory usage by
programs and operating systems, so it's not clear how much of the
increase the file cache would even see.
- What is the cost performance trade-off in this system?
- What is the difference between FFS and Sprite LFS in their locality
support (temporal Vs logical)? What is the effect of that on user
applications?
- Do we have ideas from this system adopted by current file systems?
- What have prevented LFS from entering other mainstream OS other than
Sprite, provided that it's performance is significantly superior to
the older file systems?
- As disk size become exponentially large, the time needed to clean a
disk would increase accordingly. Would the amount of cleaning needed
to clean the disk eventually be overwhelmed due to existance of large
files in fragmented location (due to intensive use of P2P download
software)?
- Which file system would suit better for Flash memory?
- I think this system uses the full potentiality of a disk. Then will
the disk become the main bottleneck for system performance ?
- Now Linux and other OSes have adopted the ideas of LFS. Could you
give some introduction and analysis on this topic?
-
In Sprite LFS, how to make large extents of free space always available
for writing contiguous segments?
- For the free space management, how should live blocks be grouped when
they are written out?
- LFS tries to conserve "temporal locality" while writing the files.
How does this idea work when there are multiple processes trying to do
writes at the same time.
- How does LFS handle a frequently modified file whose size is
continually increasing. (Wouldn't this lead to the file being fragmented
all over the log)
- Wont the threshold in segment cleaning time affect the performance?
- I read about Journaling File System which stores only metadata
intent records in the log and improves performance by transforming
metadata update commits into sequential intent writes, allowing the
actual in-place update to be delayed. Wont that provide a better
performance than LFS?