Memory copies in hardware

[Posted December 7, 2005 by corbet]

Upcoming versions of Intel processors will include a feature called an "asynchronous DMA engine." Essentially, it is a hardware peripheral which can be used to quickly copy data from one memory location to another. The "I/OAT" ("I/O acceleration technology") is expected to improve performance by offloading copy operations, enabling quick in-memory scatter/gather operations, and keeping copy operations from pushing useful data out of the processor's cache.

Hardware with an I/OAT is not yet available, but a patch for I/OAT support has recently been posted. It lacks the hardware-level interface, but does demonstrate the API that the folks at Intel have come up with for this sort of device.

Code which wishes to make use of the I/OAT must first register itself as a "DMA client." The registration interface looks like:

    #include <linux/dmaengine.h>
    typedef void (*dma_event_callback)(struct dma_client *client, 
                                       struct dma_chan *chan, 
				       enum dma_event_t event); 

    struct dma_client *dma_async_client_register(dma_event_callback event_callback);
    void dma_async_client_unregister(struct dma_client *client);

The client must provide a callback function which will be invoked when DMA channels come and go. If all goes well, registration results in a dma_client structure which can be used with subsequent operations.

Before anything can be done, the client must request one or more "channels." Every channel on the I/OAT can be used for one copy operation at a time; all channels can be operating simultaneously. The function to request channels is:

    dma_async_client_chan_request(struct dma_client *client, 
                                  unsigned int number);

The client's callback function will be called once for each allocated channel. The number of channels actually allocated may be less than what has been requested. There is no real guidance on the optimal number of channels to ask for; the example patch for the networking subsystem requests one channel for each processor on the system. The number of channels can be changed later on if need be.

There are three functions for actually starting a copy operation:

    dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
                                             void *dest, void *src,
                                             size_t len);
    dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
                                            struct page *page,
                                            unsigned int offset,
                                            void *kdata, size_t len);
    dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
                                           struct page *dest_pg,
                                           unsigned int dest_off,
                                           struct page *src_pg,
                                           unsigned int src_off,
                                           size_t len);

All three functions do the same thing: they request an asynchronous copy operation from one memory location to another. The only difference is whether kernel addresses or page structures are used to specify the locations. For some reason, it appears to be necessary to issue a call to:

    void dma_async_memcpy_issue_pending(struct dma_chan *chan);

before the operation will actually happen.

Since copy operations are asynchronous, they may not have completed when the request functions return, so the caller should not mess with the affected buffers in the mean time. There are two functions for querying and waiting for completion:

    dma_async_memcpy_complete(struct dma_chan *chan, dma_cookie_t cookie,
                              dma_cookie_t *last, dma_cookie_t *used);
    dma_async_wait_for_completion(struct dma_chan *chan, 
                                  dma_cookie_t cookie);

dma_async_memory_complete() will return one of DMA_SUCCESS, DMA_IN_PROGRESS, or DMA_ERROR, depending on the status of the copy operation indicated by cookie (the last and used arguments can be passed as NULL; their purpose is not entirely clear to your slow editor). A call to dma_async_wait_for_completion() will wait until the given operation finishes. In the current implementation, that wait is accomplished via a busy loop calling schedule(). There is no function for canceling an outstanding operation.

The initial reaction to the patch was cautiously positive. There is some concern that invoking an external device to perform copies may be sufficiently expensive that it will only be worthwhile for very large operations. There were also some requests to extend the interface to include a transformation to be performed on the data as it is copied. The current hardware does not look like it will support anything beyond a direct copy (though, since the hardware is not yet available, it is hard to be sure), but it would be nice to be able to make use of any such capabilities as they arrive. Transformations could be simple (simply zeroing a buffer, say), or complex (cryptographic operations). But they will only be available if the interface supports them.

The hardware is due in "early 2006," so more information will become available then. Until that time, there probably will not be any serious discussion of merging the I/OAT interface.

Index entries for this article
Kernel	Direct memory access
Kernel	I/O AT

Memory copies in hardware

Posted Dec 8, 2005 17:16 UTC (Thu) by galak (guest, #7473) [Link] (2 responses)

While the Intel HW may not existing til 2006. A number of embedded SoC processors have had general purpose DMA engines on them for some time that these APIs may be useful for.

Memory copies in hardware

Posted Dec 12, 2005 14:57 UTC (Mon) by alex (subscriber, #1355) [Link]

Indeed, in a previous life I had to implement a generic DMA api for the SH4 family of processors for use by things like IDE controllers. I even implemented a user space interface for controlling DMA for user-space apps.

Memory copies in hardware

Posted Dec 13, 2005 2:54 UTC (Tue) by jcm (guest, #18262) [Link]

ppc4xx is one example - ppc4xx_dma is in need of a reworking anyway. I've recently discovered this in implementing DMA support for MTD devices - having a generic DMA engine API would also help here.

Side note: I love it when people start talking about amazingly cool technology that Intel apparently will revolutionise the world with, when it's been done for years and years already :-)

Jon.

Memory copies in hardware

Posted Dec 8, 2005 18:49 UTC (Thu) by anamana (guest, #2787) [Link]

I'm curious on a potential hole with using these devices - the assumption that the destination copy area isn't represented in a processor cache. There are easily a couple of scenarios where this could bite you -

1) Copying incomming packet buffers to a user or other area. In general, multiple packets will come into the same memory area, so the user will have had a cache hit at one time, the DMA operation occurs, and since the cache isn't invalidated, the user gets the wrong data.

2) Peek and copy - an area is looked at to determine a value (such as an ARP cache or packet filter rule). Since the data can age, the timestamp is compared. When old, a DMA operation is used to transfer in new data, but the user hasn't invalidated the cache so therefore only gets old data.

In general, I think any copy operation has to manage the possible cache entries that cover a copy destination, and the general answer of flushing the caches determines a significant portion of the overhead of such a DMA operation - i.e. DMA is efficient when the cost of copying X-bytes <= cost of flushing all caches + cost of CPU coyping of X-bytes.

Obviously, you could only allow this functionality for non-cacheable memory regions, but then the utility of this function is quite limited.

Memory copies in hardware

Posted Dec 8, 2005 23:27 UTC (Thu) by mightyduck (guest, #23760) [Link]

That whole thing reminds me of the 82258 ADMA controller I worked with 15
years ago. It was only 16 bit at that time but it did exactly what the
article describes. It could also do transformations and scatter-gather
and all that stuff. I just put the parameters into some of it's registers
and let it go and it signalled the completion with an interrupt. Seems
like someone discovered that chip and thought that it would be a good
idea.

Memory copies in hardware

Posted Oct 11, 2010 16:41 UTC (Mon) by dalesmith (guest, #70573) [Link]

Is it possible to use this with user memory? Possibly by calling get_user_pages()? This doesn't seem to support scatter gather lists though. Must the memory be contiguous?