A Fast File System

Presentation Slides

Here are the presentation
slides
Here is the Discussion Recap
Below are the questions submitted.

Questions Submitted

1. The authors try to strike a balance between efficiency and space utilization by splitting blocks into smaller "fragments". Is this distinction really necessary? Could they not just have made the block size smaller and allowed for batch operations on blocks?

2. Several tricks are used to exploit (or at least deal with) the the rotation of the disk platters. Yet, it was probably conceivable even in 1984 that other forms of secondary storage might be used where the hardware has no notion of rotating platters, but would nonetheless benefit from using the filesystem as described here. Is it a bad decision to tie the filesystem implementation so tightly to the underlying hardware? Would it be better to have another layer of abstraction to deal with things like, say, flash memory? (I'm not implying that flash "disks" were a consideration back then; but some other hardware could conceivably have replaced the magnetic disk, I think.)

3. The authors decided to use advisory locks instead of 'hard locks' to enforce locking semantics on files in the filesystem. This presents some problems, though, when you want to write a program that accesses or manipulates a file produced by some other program to which you do not have the source (or perhaps the patience to modify that source.) In this situation, you have no way of knowing if your program is reading from the file when it is in a logically inconsistent state. I know I have encountered this problem at least once before. Do you think that this is a significant enough of a problem to merit the implementation of hard locks? Can you think of other reasons why hard locks might be superior to advisory locks?

4. As my understanding, the factors contributing the speed of a new file system FFS are the increase of block size, the consideration of the underlying hardware, the clustering of related information, and the modification of disk drivers. Is it right? Then, could you show how much portion each contributes to the disk throughput?

5. What is the role of cylinder group and how does the cylinder group connected with other parts? How does the cylinder group contribute to the speed of FFS?

6. While back in the days when the paper is written, processor speed was the major bottleneck of the system, this has been resolved by the use of DMA (correct me if I am wrong). In this sense, other than the contiguous block layout, what limits the throughput of the harddrives? Are we hitting the theoretically limit yet?

7. Today's NTFS seems to perform resonably well even when the disk is beyond 95% full. (At least, no siginificant degrad in performance when new blocks are allocated.) What allows that to happen?

8. To create rotationally optimal blocks, the paper says that the time required for initiating another disk operation is considered so that the blocks can be laid out taking that delay into consideration. It also says that these parameters can be dynamically specified so that even if a slower/faster processor is installed, the throughput doesn't go down. I was wondering whether this needs to be done if using DMA? Maybe the file system's block allocation policy will then be based on the speed of the DMA controller?

9. The UNIX file system could have a maximum file size of 4GB? What is the current maximum file size for say FAT32/NTFS/EXT3

10. Many hard disks today have only a single platter. How does the cylinder group concept apply to single platter disks?

11. How much do the techniques in the paper still apply today? (The lines of thinking probably still does, but the actual numbers and specific design details are probably different.)

12. On page 191, second last paragraph - it mentions that the measurements in Table II were based on a file system with a 10 pcercent free space reserve. Is this a reasonable test? Is it fair to test the system at such an extreme condition?

13. In order for the layout policies to be effective, the file system cannot be kept completely full. They have a free space reserve parameter that gives the minimum acceptable percentage of file system blocks that should be free. They have used a 10% reserve for their measurement without providing any guidelines to set this parameter. What are the major factors that the system administrator should consider to set this parameter? I think that it has a direct relationship with the partition size (The bigger the partition the lower percentage of free space needed) and also it might be related to the average file sizes on the partition. Do you think it is possible to implement an algorithm that somehow finds the optimal amount of this parameter dynamically to increase the hard disk utilization? Is it reasonable to do such a thing?

14. The authors have listed several improvements that they didn't implement because their throughput was limited by the speed of processors. For example they didn't implement the preallocation technique that batches up block allocations for rapidly growing files and reduces the number of disk writes. As we know currently the processor speed is not the bottleneck. Do we have any bottlenecks external to the mechanical parts of the hard disk these days?

15. How do they determine the number of cylinders in a cylindergroup? >

17. Would it make sense to continue to increase block size today?

18. For this file system, parameterization is based on unchanging hardware attributes. Would it make sense to allow these parameters to by dynamically modified at run-time and make corresponding adjustments to disk layout on the fly?

19. The results show that the author's filesystem outperforms the old unix file system. Is there a workload that would show the opposite, and what would this workload be comprised of?

20. The authors allocates 2048 bytes of free space for each inode with the rationale that 'it will be more than ever needed'. Is this still the case?

21. A feature implemented by the authors was the ability for files to have arbitarily long file names. At the time of implementation, were any security converns(i.e. overflows via filenames) taken into consideration?

22. From my background it seems the fisk is always the bottleneck. Howvever, some parts of this paper suggests that processor were slower than disk. What is the reason or is it just my misconception?

23. What is the reason the throughput of the filesystem is strongly tied to the amount of free space?

24. Why do modern applications still use file locks instead of the locking mechanism provided by the file system ?

25. Was it wise on the author's part to take current processor speed as a bottleneck and thus make their design less scalable ?

26. How do new file systems, especially journaled systems like ext3 and jfs, deal with their journaling files, without halving the throughput of the disk?

27. Now that network appliances (i.e. large raid arrays of network storage) are becoming common, would it make sense to revisit some aspects of the file system interface. Are we now network bound instead of DMA/hard drive controller bound?

28. Comparing with the new file system, why does the old file system have low throughput and long seek time?

29. To avoid waste due to large-size block, the new file system uses fragment solution. As we know, Microsoft Windows have a disk tool Defragment. Is the fragment of the new file system a reversed way of Defragment in Windows?

30. How does this file system for Unix support for the architecture of multiprocessors or distributed systems?.

31. For the applications such as digital audio and video that require real-time storage and retrieval of continuous media data on disk, reading and writing files perform in sessions with a guaranteed minimum data rate. How to address this problem in this Unix file system?. if any change needed, what do you think to be modified?.

32. Compared with s5fs, the old file system in this paper, one of the FFS's advantages is that data is rotationally well-positioned on the disk. However, the rotational placement optimization could not work very well on SCSI disks. Could you give another example of FFS's idea not working now ?

33. As I know SVR4 does not adopt long name. I do not know if there is necessary to add this feature to file system. And why SVR4 does not use long name ?

34. The authors say on page 186, last paragraph: "The problem with expanding a file one fragment as a time is that data may be copied many times as a fragmented block expands to full block." and the solution is to assign a full block at a time. This is somewhat contradictory to their statement that most UNIX systems contain many small files. I wouldn't consider this a problem, rather a requirement. What would you say?

35 How can one change a filesystems' rotational separation for laying out blocks? Do modern systems adjust the rotational separation automatically based on the processor's speed requirements?

36. Why do authors say on page 189, last two lines on that page: "Ideally none of the cylinder groups should ever become completely full"? Why would one want to keep additional free space in each cylinder group?

37. Other than changing the block size, and changing the pattern in which the blocks are placed on a disk, are there other possible ways of improving filesystem performance?

38. The modified algorithms take into account many parameters from the underlying disk hardware. Would it be possible to design a filesystem that works well and makes abstraction of these parameters (for instance to make it perform well on any type of storage: hard disks, RAM, ATA over Ethernet, tapes) ?

39. About the end of section 3.2, why doesn't the software automatically adjust disk parameters so that it uses the best parameterization? Why do we have to "know" the eventual target machine?

40. Is the CPU utilization higher because of higher throughput or is it because the CPU actually has more work to do processing the I/O?

41. Modern files are much larger on average. Is it best that we still use these numbers (48KB before writing multiple cylinders and 1MB after that) or have these numbers been updated?

42. The performance of the FFS is limited by the system memory to user memory copy operations and the bandwidth usage. Would it not be better to group such operations and memory allocations? Where would the trade-off lie?

43. Are there better block allocation policies that may further enhance the performance?

44. Would it be better to perform periodic checks on memory allocations rather than checking upon every new allocation?

45. A cylinder group may have no more than 2048 inodes: does this limit the number of files per directory to 2048?

46 Is this FS still used in BSD? My guess is no: Disk sizes are up 2^11, access times by 2^-3, transfer rates up 2^7...

47. I would imagine the file fragmentation problem is getting more severe (i.e. disks are becoming less random-access). Have there been any significant improvements in the algorithms/heuristics used to prevent file fragmentation?

48. I don't understand the deadlock detection scheme. If they don't allow more than one lock on the same type on a file, in what sense do they have "shared" locks?

49. Why are read calls synchronous while write calls are asynchronous?

Discussion Recap

There wasn`t much time left to discuss however here are some topics we`ve covered:

There were questions about rotational optimality and how it is defined. CUrrently DMA is used to fetch blocks from the harddisk instead of CPU. CPU loads DMA wit correct input (i.e. beginning block, size etc.) and leaves the job to DMA. Hence rotational optimality is not much considered. Moreover hdds today are capable of reading the whole track at one shot to buffer which makes rotational optimality redundant.

Another question was on the block sizes, whether it would make sense to keep increasing the block size to increase performance. It depends on the underlying architecure. Tanenbaum presents results of an study in his book "Modern Operating Systems" which says that over 1 million files in a Unix file about half of them have sizes 1.68kb or less. Hence in a system like that choosing blocks sizes of 2kb would make sense in terms of disk space utilization and performance. On the other hand Google uses much larger files and chooses a block size of 64mb for performance purposes.

Last question was whether it would make sense to change the parameters on the fly to adjust the disk layout on the fly. However this might not be reasonable in such a situation: Some data has to be written to disk but the system performance is low so the system changes its policy of rotational optimality. But later on the performance might recover and reading this data would be slower than it needs to be, given that we can find a way to determine parameters for the time of write.