The Design and Implementation of a Log-Structured File System

Mendel Rosenblum and John K. Ousterhout

Presented by Ali Bakhoda and Ivan Sham


Presentation Slides

Here is the link for the Power Point file. The slides are augmented with comments in the Notes section. A categorized and summarized list of submitted questions is also included at the end of slides in the discussion part..


Discussion Recap


List of Submitted Questions

  1. The segments described in the paper are a fairly large unit of granularity (512K-1meg). What happens if some sectors in a segment 'go bad'? Are they merely flagged in the segment usage table and skipped over? What is the largest unit of storage that will be lost in this case -- the sectors themselves, a single block, or the entire segment?
  2. I can't seem to find any mention of replication of metadata, especially the segment summary block (SSB). I would imagine that an SSB would be a particularly bad thing to corrupt, because that would make it impossible to access a potentially large number of small files in the segment. (Although I suppose this is no different than a directory file being corrupted in a 'traditional' FS.) Do you think that SSBs are replicated and it just was not mentioned here? Do you think they should be replicated?
  3. Is it possible that constantly having a segment cleaner running in the background could reduce the lifespan of the disk? Would we at some point want to take a previously trusted module out of the sandbox, once it has proven itself reliable?
  4. Regarding Figure 9, they insist that Sprite LFS performs competitively for large files. Then, why they claim only that Sprite LFS is better in accessing small-sized files in the introduction?
  5. Regarding the second problem with existing file system in Section 2.3, reducing the amount of synchronization is the best way of improving performance?  Then, why NFS introduce additional synchronization? How does Sprite LFS solve the problems? Would you connect the mechanism of Sprite LFS with the issue?
  6. The authors state the random access to small files is a major bottleneck in some applications.  How big, in total, does the set of small files tend to get?  Is it small enough that we could handle these files as a special case, (eg grouping them together on a drive or in nonvolatile storage)?
  7. How is it that synchronous writes defeat the potential use of a file system write buffer?
  8. The authors built their own similator.  Is there (or could we build) a standard agreed upon simulator system that could be used to relaiably cross compare different methodologies and algorithms?
  9. We have read about two different file systems Fast file system for Unix (FFS) and LFS. How do you compare between two?. What are advantages or disadvantages of LFS compared to FFS?. In the context of a database application with intensive transaction processing, which do you think that has more advantage? 

  10. How do you evaluate the effect of cleaning and disk fragmentation with the design of LFS?.  LFS needs large extents of free space available for writing new data. What can be the upper bound of the available storage size for LFS working efficiently?.

  11. Are LFSs used in any current OSes?

  12. From what I understand, the segment cleaner relies a bit on having sufficient empty segments to use.  How could one avoid the performance hit when a disk is near capacity?

  13. The tests were run on a 300 MB file system. How do you think the results will scale to current 300 GB (i.e. 1000 larger) file systems?
  14. Why do you think the performance of the system is insensitive to the thresholds for cleaning segments?
  15. This is great for local file systems, can any conclusions be drawn for use in networked file systems (the authors note synchronicity imposed by NFS) ... can transactional or log based systems improve this performance?
  16. It seems like a nice idea. Would you say that this is how the filesystems will be designed in the future?
  17. The slowest part of reading a file block is the actual seek. How would you compare the authors' proposed filesystem to the following idea: Use the standard Unix system (say sun), but read ahead into the memory additional pages of data that are not currently needed by the process, but could be needed in the future. I know that this sounds ify, but why couldn't the standard unix systems use the LFS's idea of using more memory to store additional pages and thus reduce disk seeks?
  18. Would you say that adjusting the size of a segment would be a useful or non-important feature in LFS?
  19. It ocurs to me the weakest link of this file system is sequential reads. The file system assumes that files written together are going to be read together and it is safe to wirte these files together. I can somewhat agree with this assumption but the thing that bugs me most is whether this assumption is valid under different processes trying to write to disk simultaneously. Does the system still places such files together? If so how viable is this assumption?
  20. The system tries to avoid moving highly utilized segments (while coalescing space) since moving these segments will detoriate the file system`s performance for writing new data. Hence it forces the system to a bimodal distribution to and move the less utilized segments. My question is whether such a distributuin can be achieved at the very beginning ( i.e. the system is recently installed)? Can we expect the performance to be low at the beginning? Is this actually the start-up cost that the authors are trying to avoid by running the file-system 2 or 3 months before the tests?
  21. The authors are advertising that the system will have its performance more prominent as the CPU and the disk bandwidth gap widens in the future. Nowadays this gap has widened enough, has the popularity of the file system increased?
  22. The motivation for logging file systems is usually for quicker recover from crashes. In the case of Flash-based file systems, there is the additional hassle of: ensuring even wear rotation, the "time" cost of erasing a block (typically takes a "long" time when compared to just writing on a freshly clean block), or just plainly reducing the number of block erases. What logging FS operations would be particularly expensive on a flash-based file system? (segment cleaning?) What kinds of trade offs would have to be made to better align the operations to minimize block erases?
  23. Is logging the ultimate answer for file systems? Is this really a tradeoff  between reliability and performance? What does RAID compare to improving the  reliability of disk storage when compared to logging FSs?
  24. Log-structures file systems are based on the assumption that read operations will be taken care of by increasing disk caches. The writing speed up also seems to be achieved by sacrificing some read performance (write sequentially with indexes for reading). Is disk traffic really dominated by writes?
  25. Doesn't collecting large amounts of file data in a file cache in volatile main memory (for increasing write performance) increase the risk of data loss? What is not committed cannot be recovered can it?
  26. The paper says LFS still uses FFs' inode strcuture. But inodes are not located at fixed positions. What is the purpose by doing this? Is it just used to get more compact arrangement? Then it says file reading performance of LFS is similar to FFS. Is it really true? and why does disk workload become write-dominated?
  27. The log-structured file system (LFS) can offer better performance for many common workloads such as those with frequent small writes , sufficient idle time to clean the log and so fourth. However, from some other research result I googled on the Internet, it also has poor performance for other workloads, such as random updates to a full disk with little idle time to clean. How to enable LFS to provide high performance across a wider range of workloads
  28. Is obtaining performance just a matter of taking blocks that are likely to be read sequentially, and regrouping them into one chunk? Could we increase performance of the unix filesystem if we just heuristically reordered the blocks on disk ? (For instance if you can figure out that block B is always read after block A, you put B right after A)
  29. That's kind of a technical question: Last class we discussed that nowadays, hard disks lie about the physical location at which they write the data. If you ask the hard drive to write data sequentially, will it really write it sequentially on the physical disks?
  30. Related to crash recovery: Their algorithm to recover from a crash depends on comparing the current date with those of the 2 checkpointing blocks (and picking the checkpointing block that has the most recent date). I'm curious to know if a computer can crash in the middle of writing the date on disk. Can modern hard disks guarantee atomic writes ?
  31. I just installed Linux and it defaults to a logged file system.  Maybe we've finally made CPU's fast enough that they're not the bottleneck.  Do you know when this occured?  How fast should a CPU be to be able to benefit from the performance mentioned?
  32. It looks like the perfect solution but why are there still non-logged file systems floating around?  What advantages do they still have?  Maybe performance isn't as good as they predicted here?
  33. What is with the 65-75% usage of the HD bandwidth, why can't we use 100%? Is it because we don't write the log or because we have to clean segments?  What accounts for that 25-35%?
  34. Would it be cost-effective to have a separate, faster storage medium (e.g. flash) for file system metadata such as inodes? That should make data access and layout more efficient, while also allowing parallel access to the data and metadata. Or, when using multiple disks, would simply storing the metadata for each file system on a different disk from the one which stores the file system be worth doing?
  35. They suggested that increasing memory capacity would make caching more effective, but they also mention the rapid increase in disk capacity. Unless memory capacity increased at a much greater rate than disk capacity (which it hasn't), why would caching become more effective? Also, increasing memory capacity means increasing memory usage by programs and operating systems, so it's not clear how much of the increase the file cache would even see.
  36. What is the cost performance trade-off in this system?
  37. What is the difference between FFS and Sprite LFS in their locality support (temporal Vs logical)? What is the effect of that on user applications?
  38. Do we have ideas from this system adopted by current file systems?
  39. What have prevented LFS from entering other mainstream OS other than Sprite, provided that it's performance is significantly superior to the older file systems?
  40. As disk size become exponentially large, the time needed to clean a disk would increase accordingly. Would the amount of cleaning needed to clean the disk eventually be overwhelmed due to existance of large files in fragmented location (due to intensive use of P2P download software)?
  41. Which file system would suit better for Flash memory?
  42. I think this system uses the full potentiality of a disk. Then will the disk become the main bottleneck for system performance ?
  43. Now Linux and other OSes have adopted the ideas of LFS. Could you give some introduction and analysis on this topic?
  44. In Sprite LFS, how to make large extents of free space always available for writing contiguous segments?
  45. For the free space management, how should live blocks be grouped when they are written out?
  46. LFS tries to conserve "temporal locality" while writing the files. How does this idea work when there are multiple processes trying to do writes at the same time.
  47. How does LFS handle a frequently modified file whose size is continually increasing. (Wouldn't this lead to the file being fragmented all over the log)
  48. Wont the threshold in segment cleaning time affect the performance?
  49. I read about Journaling File System which stores only metadata intent records in the log and improves performance by transforming metadata update commits into sequential intent writes, allowing the actual in-place update to be delayed. Wont that provide a better performance than LFS?