The return of RWF_UNCACHED

By Jonathan Corbet
December 4, 2024

Linux offers two broad ways of performing I/O to files. Buffered I/O, which is the usual way of accessing a file, stores a copy of the transferred data in the kernel's page cache to speed future accesses. Direct I/O, instead, moves data directly between the storage device and a user-space buffer, avoiding the page cache. Both modes have their advantages and disadvantages. In 2019, Jens Axboe proposed an uncached buffered mode to get some of the advantages of both, but that effort stalled at the time. Now, uncached buffered I/O is back with some impressive performance results behind it.

By saving data in the page cache, buffered I/O can accelerate many I/O operations. Cached data need not be reread from the storage device, and multiple write operations can be combined in the cache, reducing the number of writes back to persistent storage. But that caching comes at a cost; the page cache is typically the largest user of memory on a Linux system, and the CPU must spend time copying data to and from the cache. Direct I/O avoids this memory use and copying, but it is inherently synchronous, adds complexity to a program, and provides a number of inconvenient pitfalls, especially in cases where a file is accessed concurrently. Developers normally only reach for direct I/O if they really need it.

Still, as Axboe describes in the patch-set cover letter, users are often driven toward direct I/O, despite its challenges. That pressure is especially acute in cases where the data being transferred will not be needed again. Storing unneeded data in the page cache costs memory, but the problem is worse than that. Even though once-accessed data is put on the kernel's inactive list, meaning that it will be the first to be reclaimed when free memory runs low, the kernel must still make the effort to process that list and reclaim pages from it. With the right sort of I/O load (randomly accessing a set of files much larger than the system's RAM, for example), much of the available CPU time can be taken by the kernel's kswapd threads, which are working simply to reclaim memory from the page cache.

The solution that he came up with in 2019 was to add a new flag, RWF_UNCACHED, for the preadv2() and pwritev2() system calls. When that flag is present, those calls will perform I/O through the page cache as usual, with one exception: once the operation is complete, the relevant pages are immediately deleted from the page cache, making that memory available to the system without the need to go through the reclaim process. In 2019, the work then wandered into an attempt to avoid the page cache entirely, to get closer to direct-I/O performance, before coming to a stop.

$ sudo subscribe today
Subscribe today and elevate your LWN privileges. You’ll have access to all of LWN’s high-quality articles as soon as they’re published, and help support LWN in the process. Act now and you can start with a free trial subscription.

The new series picks things up again, returning to transferring data by way of the page cache and removing it afterward. For read operations, the data will be removed from the page cache as soon as it is copied into the user-space buffer (with the exception that, if it was already resident in the page cache prior to the operation, it will be left there afterward). Going through the page cache in this way avoids the coherency pitfalls that come with using direct I/O.

Writes work like buffered writes do now; the data will be written to the page cache, and the pages will be marked for eventual writeback to persistent storage. Once that writeback completes, the pages will be removed from the page cache (except, again, in cases where they were resident there prior to the operation starting). In the meantime, though, multiple writes can be combined into a single writeback operation, maintaining I/O performance.

Since the last time around, the kernel's file-I/O infrastructure has improved somewhat, to the point that much of the work of supporting RWF_UNCACHED can be performed in the kernel's iomap layer. Filesystems that use iomap fully will get RWF_UNCACHED support almost for free. Filesystems that use less generic code, including ext4, require a bit more work. The patch series includes the needed changes for ext4; XFS and Btrfs are supported as well.

The effect of these changes can be seen in the associated benchmark results. For the read side, Axboe included results showing how, in the absence of RWF_UNCACHED, a system performing a lot of random reads will bog down once memory fills and reclaim begins. At that point, nearly 28 cores on the 32-core test system are busy just running kswapd full time. With RWF_UNCACHED, that bogging-down no longer happens, and kswapd does not appear among the top CPU-using processes as all. In summary: "Not only is performance 65% better, it's also using half the CPU to do it". The write-side results are almost the same.

Most of the responses to this work have been positive; XFS developer Darrick Wong, for example, said that "there's plenty of places where this could be useful to me personally". Dave Chinner (also an XFS developer) is less convinced, though. He argued that, rather than adding a new flag for preadv2() and pwritev2(), Axboe should add the desired behavior to existing features in the kernel. Specifically, he said, the POSIX_FADV_NOREUSE flag to posix_fadvise() is meant to provide that functionality. Axboe, though, disagreed, saying that it is better to specify the desired behavior on each I/O operation than as an attribute of an open file:

Per-file settings is fine for sync IO, for anything async per-io is the way to go. It's why we have things like RWF_NOWAIT as well, where O_NONBLOCK exists too. I'd argue that RWF_NOWAIT should always have been a thing, and O_NONBLOCK is a mistake. That's why RWF_UNCACHED exists.

In other words, the O_NONBLOCK flag to open() puts the election of non-blocking behavior in the wrong place. Rather than attaching that behavior to the file descriptor, it should be selected for specific operations. RWF_UNCACHED is a way to easily get that asynchronous behavior when needed.

The discussion has since wound down for now, doubtless to be revived once the next version of the series is posted to the mailing lists. There would appear to be enough interest in this feature, though, to justify its merging into the mainline. It is too late to put uncached buffered I/O support into 6.13, but the chances of it showing up in 6.14 seem reasonably good.

Index entries for this article
Kernel	Asynchronous I/O
Kernel	Memory management/Page cache

Race?

Posted Dec 4, 2024 16:15 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (5 responses)

> with the exception that, if it was already resident in the page cache prior to the operation, it will be left there afterward

This logic sounds racy. What if reader A starts an unbuffered read, and before this read completes, reader B begins a buffered read? The two reads complete at the same time. We don't want to remove the page from cache in this case: someone expressed an interest in reading it in buffered mode.

Race?

Posted Dec 4, 2024 16:57 UTC (Wed) by axboe (subscriber, #904) [Link] (1 responses)

It's inherently racy if you have competing IO. But what'll happen for this case is that the invalidation will fail, and the page will persist in cache. Uncached isn't a hard promise - it'll always attempt to remove the page(s), but if there are competing users it may fail. And that's fine too, if you occasionally race and don't prune the page. The goal here is to avoid excessive reclaim activity, and having a few pages here or there escape the invalidation won't really change the overall effectiveness of it.

Race?

Posted Dec 4, 2024 17:07 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

Sure. I'm just pointing out that the text of the article suggests that we resolve the race by removing the page from the cache and the better behavior seems to be keeping the page in cache, as you describe.

Race?

Posted Dec 4, 2024 17:06 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (2 responses)

It seems a fairly inconsequential race. The page cache is an optimization, with a fairly, um, heuristic replacement algorithm. The consequences of a concurrent uncached read/write and a cached read ending up not caching the page seems fairly harmless, given it's hard to believe it'd be a common occurrence.

Race?

Posted Dec 4, 2024 17:10 UTC (Wed) by axboe (subscriber, #904) [Link] (1 responses)

Exactly, the outcome doesn't really matter, it supposed to be a (very) rare occurrence. If you have this happening all the time, you're doing something wrong.

Race?

Posted Dec 4, 2024 17:44 UTC (Wed) by Wol (subscriber, #4433) [Link]

IIRC the previous time this came up, a major use was when copying, so if you have competing accesses either (a) you're doing something wrong, or (b) the other process isn't copying, and so the best behaviour for it is NOT the same as the best for you.

Given that istr the speed-up for a large copy could be measured in orders of magnitude, the odd "hey it didn't do what I expected" is a price worth paying.

Cheers,
Wol

Hybrid IO

Posted Dec 4, 2024 19:21 UTC (Wed) by Paf (subscriber, #91811) [Link]

This is very similar to something I've recently added to the Lustre file system, which is unaligned direct IO and hybrid IO, but we come at the problem from different directions. Both your approach and the Lustre approach make buffered IO act more like direct IO, but you modified the page cache to be more direct IO-like, and I modified direct IO to have some aspects of the page cache approach.

Here's an actual explanation:

We found the page allocation portion of the page cache - the locking - and the additional setup we needed to do were extremely costly and the main driver of slow IO, because they couldn't be done in parallel. We actually split creating IO from userspace across multiple threads, which is great for direct IO, but that gets us little benefit when faced with page cache locking. You just pile up on the xarray lock for the mapping.

So how to avoid that? I created a version of direct IO which optionally copies to a bounce buffer (which exists only during the read() or write() call), to allow supporting unaligned IO. Then the direct IO path can accept any IO.

We use that to hybridize the buffered IO path - large buffered IO goes through the unaligned direct IO path. The work is split internally to multiple threads, which can do the bounce buffer allocation and data copying in parallel. We can get about 40 GiB/s from a single userspace thread doing large IOs, but if they do small writes or small reads, it falls back to the page cache.

Obviously you lose the page cache for those larger IO, but we found observationally that large read/write IO is very rarely accessed again in the page cache.

You can see the various resemblances here, but in our case we use direct IO as the basis for our implementation, and - at least for us - it's much faster than a flush-after-fill approach. (It's also maybe a bit more complicated, which is a downside.)

There's a moderately technical presentation on it here, that mostly leaves out the parallel part:
https://wiki.lustre.org/images/a/a0/LUG2024-Hybrid_IO_Pat...

useful for databaes

Posted Dec 4, 2024 20:47 UTC (Wed) by mokki (subscriber, #33200) [Link]

PostgreSQL writes WAL via normal writes, but does not read it unless there is a crash. So it should use this new flag if supported by kernel.

Similarly, after a crash or fallover rdbms reads once in all the WAL files, and that should be uncached read so they do not pollute the kernel buffer cache leaving more space for the actually useful data.

On the other hand, the situation can be more complex and user should decide when to use uncached operations. For example restoring a database backup is done only once in production.
But a test system might be reset back to known state every few minutes and benefits form the cached backup.

TTL

Posted Dec 4, 2024 22:24 UTC (Wed) by HIGHGuY (subscriber, #62277) [Link] (1 responses)

It would have been cool if the user could specify the flag as a TTL, 2 bits would enough:
- 00: normal behavior, nice backwards compat
- 01: ? Keep around for 2 subsequent accesses
- 10: keep around for 1 subsequent access
- 11: current flag behavior, drop immediately

Consider the compiler settting this flag on an object file so the data is dropped after the linker has passed. Bad example because you relink a file more often than you recompile the source, but the idea should be clear…

TTL

Posted Jan 16, 2025 17:29 UTC (Thu) by Spudd86 (guest, #51683) [Link]

The kernel isn't storing anything to implement this, everything is happening before the return of the syscall. A TTL would require adding extra data to the page cache. That would have lots of overhead, and require that the TTL be checked all over the place.

It would likely be bad for performance. Plus you as someone calling IO have no idea what else might be using the file, this is a hint about how the program doing the IO expects to use the IO, not system wide.

must have feature

Posted Dec 5, 2024 9:56 UTC (Thu) by amarao (subscriber, #87073) [Link]

Every backup application should support this mode. When data are larger than memory, it is obvious that they will be evicted before reuse, so, there is no point in buffering it. Same goes for any other 'larger than memory' once operation: copying a big file... Basically, if software is doing something with too-big file, it should set it.

Synchronous

Posted Dec 5, 2024 14:49 UTC (Thu) by Sesse (subscriber, #53779) [Link] (7 responses)

Why is direct I/O inherently synchronous? I'm not sure I understand why.

Synchronous

Posted Dec 5, 2024 15:10 UTC (Thu) by Wol (subscriber, #4433) [Link]

I guess because, with no buffer, you don't want the OS to be optimising your access order and moving a read in front of a write ...

Cheers,
Wol

Synchronous

Posted Dec 5, 2024 15:25 UTC (Thu) by corbet (editor, #1) [Link] (5 responses)

Direct I/O is inherently synchronous because read() and write() are inherently synchronous; that's how Unix was designed. A buffered write is synchronous in that, at the completion of the call, the data has been copied out of your buffer and you can safely put new data there. A direct write has to provide the same property; there is no way other than the completion of the write() call to know that the operation is done.

Now, of course, you can use io_uring to make it all asynchronous, but that adds significantly to the complexity of the whole operation.

Synchronous

Posted Dec 5, 2024 17:02 UTC (Thu) by Sesse (subscriber, #53779) [Link] (4 responses)

OK, but why call out direct I/O as a special thing then, if buffered I/O is also synchronous? I mean, you can argue otherwise for write() due to writeback, but read() is just as synchronous in both cases, right?

Synchronous

Posted Dec 5, 2024 17:10 UTC (Thu) by corbet (editor, #1) [Link] (3 responses)

Reads are always synchronous, which is why the kernel often prioritizes them. From the prospective of user space, though, writes are not synchronous, in that you do not have to wait for the data to land in persistent storage. Plus you get elimination of many duplicate writes and combining of operations, which helps performance a lot. Direct I/O does not give you that; part of the point of RWF_UNCACHED is to make those benefits available without flooding the page cache.

Synchronous

Posted Dec 5, 2024 20:19 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

And sequential reads are also not really synchronous - the kernel will perform readahead for you. DIO won't. IME it's a often a lot easier to convert write paths to use DIO than read paths, more complicated heuristics are needed to implement application level readahead.

Synchronous

Posted Dec 5, 2024 20:25 UTC (Thu) by malmedal (subscriber, #56172) [Link] (1 responses)

Also, I believe you avoid some kernel buffering on direct io reads, that's the reason for the alignment restrictions?

Synchronous

Posted Dec 5, 2024 20:27 UTC (Thu) by corbet (editor, #1) [Link]

Direct I/O always avoids kernel buffering — that's the "direct" part :)

What if this was on by default?

Posted Dec 6, 2024 1:03 UTC (Fri) by walters (subscriber, #7396) [Link] (7 responses)

An interesting thing to think about is: what would things be like if this was the default? What codebases/workloads would then want to opt-in to persisting in the page cache? Software compilation would be one probably, as noted in a comment somewhere one would likely want those object files that were just written by the compiler to be found by the linker in memory (and maybe not even be written to disk in a non-incremental scenario, i.e. they could be unlinked before writeback starts) - and then if you go to run the binary for tests you probably want that in memory too etc.

Although I guess there's also the theoretical idea of not caching reads by default, but caching writes still by default. What workloads would be badly hit by that? I guess back to software compilation with C/C++ with separate header files, we do want those headers in the page cache.

Obviously this is just theoretical because I'm sure trying to do this would cause fallout, but I am just curious to think about what the fallout would be.

What if this was on by default?

Posted Dec 6, 2024 13:22 UTC (Fri) by epa (subscriber, #39769) [Link] (5 responses)

Older operating systems had a wider choice of modes for opening a file. You could open for read or write, obviously, but also specify whether you needed the ability to seek, rather than read (or write) the whole file from beginning to end. (Perhaps this came originally from systems with both tape and disk storage.)

That could make a comeback. If you tell the kernel that you will only read sequentially, it can make a smarter choice about page caching. If the file is bigger than system memory and it is read from start to end, keeping the most recently used pages in cache will never help. On the other hand it does often help to cache random access to a large file. Similarly, if a file is opened for writing only then its cache treatment might differ a bit from a file opened for both read and write.

So perhaps we need more old-style file modes, even if they were treated purely as hints and you wouldn't actually get an error on trying to seek() a file you had opened for sequential access only.

What if this was on by default?

Posted Dec 6, 2024 16:35 UTC (Fri) by pj (subscriber, #4506) [Link] (4 responses)

This approach makes sense to me: have use space describe how it intends to use the file. The kernel then may or may not be able to use that information, but it at least has a chance, and for best compatibility, that description shouldn't be tied to the internal workings of the kernel.

Describing how you use the file to the kernel

Posted Dec 7, 2024 11:45 UTC (Sat) by farnz (subscriber, #17727) [Link] (3 responses)

The trouble such schemes run into is that people do start relying on things working a particular way, and get very upset if it changes, even if the change is an improvement for people using the flags as documented. It is very hard to stop people depending on implementation details, unless you change them regularly - I've had, for example, people complain when I added compression to a storage backend, because it meant that they had to ask for data in the chunk size they wanted, rather than relying on the system not being able to produce chunks larger than 512 KiB.

And that wasn't even documented behaviour - that was found by users asking for "largest possible chunk size" and discovering that it was never more than 512 KiB on the implementation at hand. They could trivially have asked for a 512 KiB chunk size, and had that, or prepared for an arbitrarily large chunk size, and been OK.

Describing how you use the file to the kernel

Posted Dec 7, 2024 15:04 UTC (Sat) by adobriyan (subscriber, #30858) [Link] (2 responses)

The answer is always more -E and something like Vulkan validation layers (and let's not forget typestate pattern).

One day I'll write debugging patch which kills a process if dirty file wan't closed properly.

Then I'll write a patch which kills a process if written file wasn't synced properly.

Then I'll write a patch which kills a process if it created or renamed a file without syncing parent directory properly.

I tried similar thing once with a patch which forces short read, it was quite entertaining -- unnamed distro didn't even boot properly.

Describing how you use the file to the kernel

Posted Dec 9, 2024 22:36 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

> Then I'll write a patch which kills a process if written file wasn't synced properly.
>
> Then I'll write a patch which kills a process if it created or renamed a file without syncing parent directory properly.

Those operations are entirely legal and even desirable in some contexts. If a file is anywhere under /tmp, it almost certainly does not need to be sync'd, and probably should not be sync'd for performance reasons. The same goes for most if not all of the following:

* /var/run, /var/lock, and many other things under /var.
* /dev/shm (see shm_overview(7), and note that POSIX intentionally does not require these file descriptors to support the full range of file-like operations in the first place, so fsyncing them would be highly non-portable).
* Probably some files under /opt, depending on what is installed there and how it is configured. E.g. you might have a Jenkins setup that does CI out of some directory under /opt (if the system crashes mid-build, we probably want to delete it and start over anyway).
* Any other file which is intended to be temporary or ephemeral, regardless of where it lives.

Describing how you use the file to the kernel

Posted Dec 9, 2024 23:10 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

Agreed, there's lots of cases where this is legitimate. Even for longer lived files.

E.g. postgres won't fsync data files until a checkpoint, as all modifications would be performed again in case the system / the database crashes and performs journal replay. It'd cause a major performance regression to always fsync modified files before a process exits (PG is multi process for now, each connection can have an FD open for a file).

What if this was on by default?

Posted Dec 11, 2024 16:52 UTC (Wed) by ScottMinster (subscriber, #67541) [Link]

There are an awful lot of reads that the system does where caching is very important. For example, every time you execute a command, like `bash`, the system has to read `/bin/bash`, and all the libraries files that it depends on. It also has to read various configuration files like `~/.bashrc` or things in `/etc`. All those files are rarely written, and probably haven't been written since the last system boot, but read often. If they all had to go back to the disk because they weren't being cached, the system would run a lot slower.