Work Queue Deadlocks

Use of Work Queues

Most network drivers use a work queue to handle network events. This is done for two reason: (1) Most of the example code to leverage from does it that way, and (2) it is easier and is a more efficient use memory resources to use the work queue rather than creating a dedicated task/thread to service the network.

High and Low Priority Work Queues

There are two work queues: A single, high priority work queue that is intended only to service the back end interrupt processing in a semi-normal, tasking context. And low priority work queue(s) that are similar but as then name implies are lower in priority and not dedicated for time-critical back end interrupt processing.

Downsides of Work Queues

There are two important downsides to the use of work queues. First, the work queues are inherently non-deterministic. The time delay from the point at which you schedule work and the time at which the work is performed in highly random and that delay is due not only to the strict priority scheduling but also to what work as been queued ahead of you.

Why do you bother to use an RTOS if you rely on non-deterministic work queues to do most of the work?

A second problem is related: Only one work queue job can be performed at a time. That job should be brief so that it can make the work queue available again for the next work queue job as soon as possible. And that job should never block waiting for resources! If the job blocks, then it blocks the entire work queue and makes the whole work queue unavailable for the duration of the wait.

Networking on Work Queues

As mentioned, most network drivers use a work queue to handle network events. (some are even configurable to use high priority work queue… YIKES!). Most network operations are not really suited for execution on a work queue: The networking operations can be quite extended and also can block waiting for for the availability of resources. So, at a minimum, networking should never use the high priority work queue.

Deadlocks

If there is only a single instance of a work queue, then it is easy to create a deadlock on the work queue if a work job blocks on the work queue. Here is the generic work queue deadlock scenario:

  • A job runs on a work queue and waits for the availability of a resource.

  • The operation that provides that resource also runs on the same work queue.

  • But since the work queue is blocked waiting for the resource, the job that provides the resource cannot run and a deadlock results.

IOBs

IOBs (I/O Blocks) are small I/O buffers that can be linked together in chains to efficiently buffer variable sized network packet data. This is a much more efficient use of buffering space than full packet buffers since the packets content is often much smaller than the full packet size (the MSS).

The network allocates IOBs to support TCP and UDP read-ahead buffering and write buffering. Read-head buffering is used when TCP/UDP data is received and there is no receiver in place waiting to accept the data. In this case, the received payload is buffered in the IOB-based, read-ahead buffers. When the application next calls revc() or recvfrom(), the date will be removed from the read-ahead buffer and returned to the caller immediately.

Write-buffering refers to the similar feature on the outgoing side. When application calls send() or sendto() and the driver is not available to accept the new packet data, then data is buffered in IOBs in the write buffer chain. When the network driver is finally available to take more data, then packet data is removed from the write-buffer and provided to the driver.

The IOBs are allocated with a fixed size. A fixed number of IOBs are pre-allocated when the system starts. If the network runs out of IOBs, additional IOBs will not be allocated dynamically, rather, the IOB allocator, iob_alloc() will block waiting until an IOB is finally returned to pool of free IOBs. There is also a non-blocking IOB allocator, iob_tryalloc().

Under conditions of high utilization, such as sending large amount of data at high rates or receiving large amounts of data at high rates, it is inevitable that the system will run out of pre-allocated IOBs. For read-ahead buffering, the packets are simply dropped in this case. For TCP this means that there will be a subsequent timeout on the remote peer because no ACK will be received and the remote peer will eventually re-transmit the packet. UDP is a lossy transfer and handling of lost or dropped datagrams must be included in any UDP design.

For write-buffering, there are three possible behaviors that can occur when the IOB pool has been exhausted: First, if there are no available IOBs at the beginning of a send() or sendto() transfer, then the operation will block until IOBs are again available if O_NONBLOCK is not selected. This delay can can be a substantial amount of time.

Second, if O_NONBLOCK is selected, the send will, of course, return immediately, failing with errno set EAGAIN if we cannot allocate the first IOB for the transfer.

The third behavior occurs if the we run out of IOBs in the middle of the transfer. Then the send operation will not wait but will instead send then number of bytes that it has successfully buffered. Applications should always check the return value from send() or sendto(). If it a is a byte count less then the requested transfer size, then the send function should be called again.

The blocking iob_alloc() call is also the a common cause of work queue deadlocks. The scenario again is:

  • Some logic in the OS runs on a work queue and blocks waiting for an IOB to become available,

  • The logic that releases the IOB also runs on the same work queue, but

  • That logic that provides the IOB cannot execute, however, because the other job is blocked waiting for the IOB on the same work queue.

Alternatives to Work Queues

To avoid network deadlocks here is the rule: Never run the network on a singleton work queue!

Most network implementation do just that! Here are a couple of alternatives:

  1. Use Multiple Low Priority Work Queues Unlike the high priority work queues, the low priority work queues utilize a thread pool. The number of threads in the pool is controlled by the CONFIG_SCHED_LPNTHREADS. If CONFIG_SCHED_LPNTHREADS is greater than one, then such deadlocks should not be possible: In that case, if a thread is busy with some other job (even if it is only waiting for a resource), then the job will be assigned to a different thread and the deadlock will be broken. The cost of the additional low priority work queue thread is primarily the memory set aside for the thread’s stack.

  2. Use a Dedicated Network Thread The best solution would be to write a custom kernel thread to handle driver network operations. This would be the highest performing and the most manageable. It would also, however, but substantially more work.

  3. Interactions with Network Locks The network lock is a re-entrant mutex that enforces mutually exclusive access to the network. The network lock can also cause deadlocks and can also interact with the work queues to degrade performance. Consider this scenario:

    • Some network logic, perhaps running on on the application thread, takes the network lock then waits for an IOB to become available (on the application thread, not a work queue).

    • Some network related event runs on the work queue but is blocked waiting for the network lock.

    • Another job is queued behind that network job. This is the one that provides the IOB, but it cannot run because the other thread is blocked waiting for the network lock on the work queue.

    But the network will not be unlocked because the application logic holds the network lock and is waiting for the IOB which can never be released.

    Within the network, this deadlock condition is avoided using a special function net_ioballoc(). net_ioballoc() is a wrapper around the blocking iob_alloc() that momentarily releases the network lock while waiting for the IOB to become available.

    Similarly, the network functions net_lockedait() and net_timedait() are wrappers around nxsem_wait() nxsem_timedwait(), respectively, and also release the network lock for the duration of the wait.

    Caution should be used with any of these wrapper functions. Because the network lock is relinquished during the wait, there could changes in the network state that occur before the lock is recovered. Your design should account for this possibility.