"Lazy Asynchronous I/O for Event-Driven Servers", by A. Chanda, A. Cox, K. Elmeleegy, W. Zwaenepoel

Questions

Does the mechanism of setting errno to EINPROGRESS not lead to code bloat where you frequently find statements like if (errno == EINPROGRESS) { do something } else { do something else}  Is it even good style to call something being in progress an "error"? (That's admittedly a nitpick, but this seems like a hack to me.)

Another nitpick: Wouldn't it have been simpler to write "wrappers" around existing syscalls that take a parameter to identify whether they run in async mode or not, instead of *duplicating* all existing syscalls in a parallel API? What would have been the disadvantages of doing it this way? (Breaking old code is an obvious one, but something as simple as a default parameter in C could solve that I think.)

In 9.2.1, the authors use a LOC metric to demonstrate differences in code complexity across various implementations. Is the distribution of these lines not also important? For instance, I'd rather have 100 LOC inserted into one function call than 20 new lines of code scattered across multiple files.

The lazy part of their IO operations referred to the strategy where an asynchronous operation is not used if the operation will not block. But, why being lazy is necessary? I do not agree on this pure single threaded approach. I think that in order to keep balance between programming complexity and exploiting available system resources, we'd better use both thread and event. What do you think?

The LAIO library was developed originally for web servers. I wonder what other services can also benefit from LAIO? Could you give some specific examples?

What is continuation ? Normally, how AIO and LAIO create it ?

I am not sure if LAIO could avoid blocking entirely just as the paper says. What is your idea about it ?

How does the performance of a event-driven http server with LAIO
compare with the Knot server with Capriccio? Or better, how does LAIO
compare to epoll in Linux?

In a high concurrency system, the huge number of LAIO calls would
generate an overloading number of scheduler activations. Wouldn't it
be more efficient to provide non-blocking IO and system call from OS
directly?

Is the primary benefit of LAIO performance? That it only creates a continuation if the underlying I/O blocks? It seems we have to create extra code to check for block / no-block and *still* provide the continuation anyway (in case we block).

There is an underlying assumption that this is a single threaded application and that no other thread can pre-empt and set errno (which is a system wide global!). Does this really hold? Is it not possible that the background I/O that is already queued may also set errno? This seems like a huge potential area for failure of the system by improperly accounting for continuation of a function.

It seems that the performance of Flash LAIO-LAIO is better than
flash-NB-AMPED due to better utilization of disk. The paper doesn't
explain how this happens.

Why is it necessary to be "Lazy"?

Why the evaluation results of Figure 7 and 8 are different each other? *
For example, while Flash-LAID-LAID-warm is higher than Flash-NB-B-warm in Figure 7 (a), Flash-LAID-LAID-warm is lower than Flash-NB-B-warm in Figure 8 (a).

Which parts from Figure 1 to Figure 6 show that LAIO reduces coding complexity? *
Section 9.2.1 tells us the cases that using LAID reduces programming effort. However, I am still looking for the critical part to show the cases in the six figures.

Do we consider  completion notification better than partial completion notification especially since some algorithms appreciate to get some data to process while waiting? (Ease of programming VS effeciency)

Can we cohe number of lines to be a good indication of programming complexity? (what if the programmer is more experienced or the shorter code is moreharder to read)

The paper states that LAIO is general, as in it can be used to call all system calls and that it does not create a continuation for operations that return immediately. But don't you think the latter feature is actually quiet essential for generality? (I mean an operation might generally not block, and hence having it as an LAIO operation is not exactly required and it might even decrease performance because of the added overhead - and hence the laziness requirement)

Why does AIO support only read/write operations? Why couldn't they initially support other operations like open(), stat()? Or what has been done in LAIO to actually support these calls?

Is there a way to enforce the requirement placed on a background laio_syscall(), that any argument buffers are not modified until it returns?

Is the requirement for kernel support of scheduler activations reasonable?

Given the degree to which scheduler activations do the work here, how much of a contribution should we credit the authors with?

Regarding the comparison of complexity based on how many lines of code were changed, is that really a valid basis of comparison?  I mean, sometimes 10 lines of code could be more painful to write than 100 lines you just copy and pasted from another section.

Regarding the lazy generation of continuations, would it not be an improvement on performance if some sort of heuristics could be used to predetermine which calls are most likely to require continuation, and generate them beforehand instead of having to wait to see if something requires a continuation or not?

The author mentioned that LAIO requires support of scheduler activation from the kernel. Is this necessary? Can LAIO be implemented without scheduler activation?

Using LAIO does not allow progress of the blocking calls to be monitored or partial results to be shown (since only one event is sent back up when the operation is completed). Is this acceptable?

Facts: In Section 2 authors say that whenever a background laio_syscall() occurs it is the laio_gethandle() that returns a handle identifying it, and the handle always points to the last background laio_syscall. The laio_poll() returns a set of LAIO completition object, one per completed background laio_syscall().
Question: Does this mean that laio_gethandle() constantly has to be called to return the handle of the last laio_syscall() which didn't receive a LAIO completition object. However, in the event loop of Figure 1, there are no calls to laio_gethandle(). Why aren't there any?

This question is about eventp which first appears in Figure 1. Is that a structure that holds all event objects? But why is it then disabled at the end of the event loop?

I am not sure what laio really does, but according to the authors it is better than other asynchronous interfaces. Would you happen to know if it is becoming common? Would you say that processor "grabbing" of scheduler activations could be a problem of their design or not?

It's interesting the way Khaled compares code copmlexity by measuring the number of lines of code.  LOC counts are generally not a reliable method of comparing code complexity so why does he do this?  Does this add anything to the argument?

I like the simplicity of this approach bt why has no one else noticed or been able to address the problems of certain system calls alwas blocking (like opening fiels)?  Seems odd that we're only attempting to fix this in 2004.

What are the improved features of LAIO compared to the other forms of I/O processing (non-blocking, AIO, AMPED,..)?

Do you think that counting the affected lines of code within one sample server like Flash web server is enough to conclude LAIO having programming advantages over the others (non-blocking I/O, AMPED,...)?

How does the LAIO implementation of Flash have a lower disk usage than the AMPED implementation for the same traces ? How does LAIO help in reducing the 'amount' of disk IO ?


How would the LAIO interface be implemented on a OS without scheduler activation support ?

How many hours of your life have you lost debugging non-blocking I/O code? Are there applications where you would still prefer to use NB I/O even when LAIO can provide cleaner code? Multimedia applications?

I can understand LAIO matching (or almost matching) the performance of non-blocking I/O. But I still can't explain how it can beat out non-blocking I/O by such a large factor in some of the tests (Figure 10, Berkeley; Figure 11, Berkeley). Is it because of slightly better memory locality of the application because of the LAIO memory access patterns?

The kernel uses a Windows NT asynchronous notification mechanism called asynchronous procedure call (APC) to notify the applications thread of the I/O operations completion. Is LAIO is a special case of APC?

LAIO overcomes the limitations of previous I/O mechanisms, both in terms of ease of programming and performance. Are there any side effects from LAIO?

While LAIO reduces the number of possible events one needs to handle (events only for fully-completed syscalls, not partial completion?), I don't see how it would simplify event-driven programming any further than that.

i.e., the fundamental problem of having to split up "tasks" into multiple functions remains?

Section 9.2.2 Berkeley workload: They claim that LAIO transferred less disk data than NB-AMPED. Why?

(I would expect a similar amount of data transferred to/from disk, perhaps at a higher transfer rate)

Lazy I/O on disk reads:
I will base my question on an assumption: I`d assume that disk reads performed would always block a system(will not have data ready on the buffer when read is called, since a disk access time is required). Why shouldn`t we us AIO on the disk reads only and use LAIO on the rest( e.g. Network and writes on disk)? The AIO performance is slightly better(about 1.08) when data is not available according to microbench test presented in section 7.

Is the advantage of lazy creation of threads used in schedulers of LAIO? In other words is there a scheduler where it can somehow guess non-blocking calls and give those threads a lower priority?

Does laio_syscall() save the context of the calling thread on each
invocation, or only when it knows it is going to block?

Obviously, there would be cases where the time taken to complete a
blocking call would be smaller than going through the LAIO process:
creating a new thread in the kernel, and notify the blocking
application (a scheduler activation). Is this why non-blocking IO
performed better than LAIO in the microbenchmarks (reading a byte from
a pipe)?


Major Discusion Points

Links

Presentation Slides

LAIO web page