"A Low-Bandwidth Network File System", by A. Muthitacharoen, B. Chen, D. Mazieres

Questions

How possible to use some of the ideas (low bandwidth) from this system to implement a file system over limited computational devicessuch as cell phones? what assumption of this system will be violated (memory-processor)?

The use of hash functions to check for similar content is interesting. what are the current ways to check for similar content(Rsync)? What are their advantages and disadvantages?

Under what circumstances would one use LBFS beyond just when there's not enough bandwidth? For example, would there be certain types of services that would benefit from having less packets sent?

What is the current state of LBFS? Is it still being developed?

One would've expected with a smaller chunk size and thus finer granularity in looking for commonalities, shared chunk %s would increase. Why is this not the case?

It seems like a lot of "modern" file systems are relying on the fact that cache on the target machines will be sufficiently large. Will/Should we rely more and more on large cache? How does that effect the ability to use these file systems on mobile devices where cache is not always available in large quantity?

This idea somewhat reminds me of the bittorrent protocol where file transfer are based on "chunks". How can we extend the LBNFS to be used as a P2P file sharing protocol?

How well would LBFS's open-close consistency semantics scale when dealing with large multimedia files ?

Would it be possible to compromise the system by a brute force attack (generating all possible hashes, and then sending them along on the READ call) to try and retrieve data which doesn't belong to you.

The evaluation largely compares the percentage of bandwidth that LBFS uses in comparison to other networked filesystems. However, the authors do not discuss if the absolute reduction in bandwith would make applications much more usable with those savings. For example, an application that responds in 16 seconds instead of 101 seconds is still pretty unusable in my books. Do you think the results here would actually make the typical suite of user programs that one would want to run over the network much more usable?

Can you think of other hash functions and fingerprints that might be more appropriate to this problem? For instance, in computational biology a number of locality-sensitive hashes are used to identify regions of high similarity across two sequences that may have experienced significant mutation (although these are admittedly not all that great when it comes to insertions/deletions either.)

In Section 3.2.1, LBFS provides close-to-open consistency without locking files. Thus, when several people work on the same file, the last one closing the file will save only his work and other people will lose the work that they have done. I also think that the policy of locking files is better because coworkers should know if the file is modified by another person or not. So I think LBFS is not practical. Is it?

The paper explains “LBFS also practices aggressive pipelining of RPC calls to tolerate network latency” over the NFS. Then, how does LBFS manage the latency? Though Section 3.2.2 and Section 3.2.3 shows the mechanisms in detail, I could not connect the details with the purpose description.

The authors claim that their approach is compementary to other approaches; should we accept this claim? I can see how we could implement an LBFS mechanism on top of the other file systems, but the degree to which it would be beneficial is unsupported.

The decision to use a data dependant chunk size is not very well justified. What are the relative benefits of variable chunk size?

Could we add a similar chunks database to a local file system so that equal chunks in different files would only be stored once?

Could LBFS benefit from using different algorithms to extract similarities between files depending on the type of the files compared (e.g. ASCII, images, archives,...) ? (Unfortunately, my guess is that would make the chunk database useless...)

It seems like the overhead of hashing each chunk causes a negligible overhead on overall execution times. Does that mean that more complex algorithms could be used to extract even more similarities between files ?

Does the LBFS impose a significant CPU hit, since it has to do SHA-1 hashing of the files?

Whole-file caching implies a slow startup when accessing a new / long untouched file.Does this utilize the local db to reduce download time, or is the user stuck waiting for the whole file to be transferred?

In Figure 6, why does AFS use more downstream bandwidth than the native file system?

Consistency model could be violated. If server blocks all but one client from writing, would it be better? Would most users accept the consistency model, or is it too weak? Could schemes to merge writes be added?

Any better solutions to the static i-number problem?

The LBFS client currently performs whole file caching and they have mentioned that they would like to cache only portions of large files in future implementations. Could the scheme be changed to allow block caching rather than entire files(or portions of big files)? What are the challenges of doing such a thing? Is it worth it?

If multiple clients are writing the same file, then the last one to close the file will win and overwrite changes from the others. Isn't this a consistency violation? Wouldn't it be better if server allowed only one client to write?

While LBFS works well with reliable low bandwidth connection, much of remote access has moved to wireless connections which are unreliable and have high round-trip time. Is it possible to further enhance LBFS to perform better in these situations?

According the analysis in the paper, we have so many redundent chunks in the data. If that's the case, why are we consuming the space in the first place?

One of the assumptions that the authors have on concurrent writes is that the last write will be valid hence might end-up in an inconsistent state for other updaters. Do you think adopting a locking scheme might have made their job more difficult? I`m not much concerned about the complexity of the scheme but the system might have lost the bandwidth gains it has now since these gains are dependent on the similarity of files which in turn might be efected with these different updates.

About the inefficiency of renaming: How does truncating a file A to zero length and replacing the contents with the file B overcome the efficiency problem? In what ways is this more efficient than the temporary file replacement that has to be done due to lack of control over i-numbers?

The goal of LBFS is to reduce the utilization of bandwidth for file transfers. What are the trade-offs to achieve the feature?.

LBFS just achieves its goals by taking advantage of the similarities between files or their versions to limit the file transfers. In case it can’t take this advantage, for example, the files are encrypted so it is difficult to compare the similarity between files, do you think there will be a degradation on its performance, and how to compare it with other file systems in that worst case?

About the performance numbers, how is it possible that LBFS can have better execution time that other file systems given all the hashing/encryption that is going on? Are they hiding details of CPU load? Maybe they're assuming CPU load isn't relevant because we're sending a minimal amount of messages to update files?

Do you think we need special handling for when multiple users edit a file? This file system does the most intuitive thing and just keep the most recent version and that works well on my local machine because I know who's editing my files. Should we still hide the details of multiple users editing a file from the users?

Could you provide some idea on when we had better apply LBFS ?

Are there some potential issues not covered by this paper ?

Besides client caching, file leases, and exploiting cross-file similarities, do you think what other issues will reduce bandwidth?

LBFS may raise some non-network security issues. Can you suggest how to deal with these non-network security issues?

Isn't there a major overhead involved in determining the chunk boundaries and hashing for each and every FS file?

How significantly do the operations on fingerprint databases affect the performance?


Major Discusion Points

Links

Presentation Slides