The return of RWF_UNCACHED
By saving data in the page cache, buffered I/O can accelerate many I/O operations. Cached data need not be reread from the storage device, and multiple write operations can be combined in the cache, reducing the number of writes back to persistent storage. But that caching comes at a cost; the page cache is typically the largest user of memory on a Linux system, and the CPU must spend time copying data to and from the cache. Direct I/O avoids this memory use and copying, but it is inherently synchronous, adds complexity to a program, and provides a number of inconvenient pitfalls, especially in cases where a file is accessed concurrently. Developers normally only reach for direct I/O if they really need it.
Still, as Axboe describes in the patch-set cover letter, users are often driven toward direct I/O, despite its challenges. That pressure is especially acute in cases where the data being transferred will not be needed again. Storing unneeded data in the page cache costs memory, but the problem is worse than that. Even though once-accessed data is put on the kernel's inactive list, meaning that it will be the first to be reclaimed when free memory runs low, the kernel must still make the effort to process that list and reclaim pages from it. With the right sort of I/O load (randomly accessing a set of files much larger than the system's RAM, for example), much of the available CPU time can be taken by the kernel's kswapd threads, which are working simply to reclaim memory from the page cache.
The solution that he came up with in 2019 was to add a new flag, RWF_UNCACHED, for the preadv2() and pwritev2() system calls. When that flag is present, those calls will perform I/O through the page cache as usual, with one exception: once the operation is complete, the relevant pages are immediately deleted from the page cache, making that memory available to the system without the need to go through the reclaim process. In 2019, the work then wandered into an attempt to avoid the page cache entirely, to get closer to direct-I/O performance, before coming to a stop.
$ sudo subscribe todaySubscribe today and elevate your LWN privileges. You’ll have access to all of LWN’s high-quality articles as soon as they’re published, and help support LWN in the process. Act now and you can start with a free trial subscription.
The new series picks things up again, returning to transferring data by way of the page cache and removing it afterward. For read operations, the data will be removed from the page cache as soon as it is copied into the user-space buffer (with the exception that, if it was already resident in the page cache prior to the operation, it will be left there afterward). Going through the page cache in this way avoids the coherency pitfalls that come with using direct I/O.
Writes work like buffered writes do now; the data will be written to the page cache, and the pages will be marked for eventual writeback to persistent storage. Once that writeback completes, the pages will be removed from the page cache (except, again, in cases where they were resident there prior to the operation starting). In the meantime, though, multiple writes can be combined into a single writeback operation, maintaining I/O performance.
Since the last time around, the kernel's file-I/O infrastructure has improved somewhat, to the point that much of the work of supporting RWF_UNCACHED can be performed in the kernel's iomap layer. Filesystems that use iomap fully will get RWF_UNCACHED support almost for free. Filesystems that use less generic code, including ext4, require a bit more work. The patch series includes the needed changes for ext4; XFS and Btrfs are supported as well.
The effect of these changes can be seen in the associated benchmark
results. For the read side, Axboe included
results showing how, in the absence of RWF_UNCACHED, a system
performing a lot of random reads will bog down once memory fills and reclaim
begins. At that point, nearly 28 cores on the 32-core test system are busy
just running kswapd full time. With RWF_UNCACHED, that
bogging-down no longer happens, and kswapd does not appear among
the top CPU-using processes as all. In summary: "Not only is
performance 65% better, it's also using half the CPU to do it
". The write-side
results are almost the same.
Most of the responses to this work have been positive; XFS developer
Darrick Wong, for example, said that
"there's plenty of places where this could be useful to me
personally
". Dave Chinner (also an XFS developer) is less convinced,
though. He argued that,
rather than adding a new flag for preadv2() and
pwritev2(), Axboe should add the desired behavior to existing
features in the kernel. Specifically, he said, the
POSIX_FADV_NOREUSE flag to posix_fadvise()
is meant to provide that functionality. Axboe, though, disagreed,
saying that it is better to specify the desired behavior on each I/O
operation than as an attribute of an open file:
Per-file settings is fine for sync IO, for anything async per-io is the way to go. It's why we have things like RWF_NOWAIT as well, where O_NONBLOCK exists too. I'd argue that RWF_NOWAIT should always have been a thing, and O_NONBLOCK is a mistake. That's why RWF_UNCACHED exists.
In other words, the O_NONBLOCK flag to open() puts the election of non-blocking behavior in the wrong place. Rather than attaching that behavior to the file descriptor, it should be selected for specific operations. RWF_UNCACHED is a way to easily get that asynchronous behavior when needed.
The discussion has since wound down for now, doubtless to be revived once
the next version of the series is posted to the mailing lists. There would
appear to be enough interest in this feature, though, to justify its
merging into the mainline. It is too late to put uncached buffered I/O
support into 6.13, but the chances of it showing up in 6.14 seem reasonably
good.
Index entries for this article | |
---|---|
Kernel | Asynchronous I/O |
Kernel | Memory management/Page cache |
Race?
Posted Dec 4, 2024 16:15 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (5 responses)
Posted Dec 4, 2024 16:15 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (5 responses)
This logic sounds racy. What if reader A starts an unbuffered read, and before this read completes, reader B begins a buffered read? The two reads complete at the same time. We don't want to remove the page from cache in this case: someone expressed an interest in reading it in buffered mode.
Race?
Posted Dec 4, 2024 16:57 UTC (Wed)
by axboe (subscriber, #904)
[Link] (1 responses)
Posted Dec 4, 2024 16:57 UTC (Wed) by axboe (subscriber, #904) [Link] (1 responses)
Race?
Posted Dec 4, 2024 17:07 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link]
Posted Dec 4, 2024 17:07 UTC (Wed) by quotemstr (subscriber, #45331) [Link]
Race?
Posted Dec 4, 2024 17:06 UTC (Wed)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
Posted Dec 4, 2024 17:06 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (2 responses)
Race?
Posted Dec 4, 2024 17:10 UTC (Wed)
by axboe (subscriber, #904)
[Link] (1 responses)
Posted Dec 4, 2024 17:10 UTC (Wed) by axboe (subscriber, #904) [Link] (1 responses)
Race?
Posted Dec 4, 2024 17:44 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
Posted Dec 4, 2024 17:44 UTC (Wed) by Wol (subscriber, #4433) [Link]
Given that istr the speed-up for a large copy could be measured in orders of magnitude, the odd "hey it didn't do what I expected" is a price worth paying.
Cheers,
Wol
Hybrid IO
Posted Dec 4, 2024 19:21 UTC (Wed)
by Paf (subscriber, #91811)
[Link]
Posted Dec 4, 2024 19:21 UTC (Wed) by Paf (subscriber, #91811) [Link]
Here's an actual explanation:
We found the page allocation portion of the page cache - the locking - and the additional setup we needed to do were extremely costly and the main driver of slow IO, because they couldn't be done in parallel. We actually split creating IO from userspace across multiple threads, which is great for direct IO, but that gets us little benefit when faced with page cache locking. You just pile up on the xarray lock for the mapping.
So how to avoid that? I created a version of direct IO which optionally copies to a bounce buffer (which exists only during the read() or write() call), to allow supporting unaligned IO. Then the direct IO path can accept any IO.
We use that to hybridize the buffered IO path - large buffered IO goes through the unaligned direct IO path. The work is split internally to multiple threads, which can do the bounce buffer allocation and data copying in parallel. We can get about 40 GiB/s from a single userspace thread doing large IOs, but if they do small writes or small reads, it falls back to the page cache.
Obviously you lose the page cache for those larger IO, but we found observationally that large read/write IO is very rarely accessed again in the page cache.
You can see the various resemblances here, but in our case we use direct IO as the basis for our implementation, and - at least for us - it's much faster than a flush-after-fill approach. (It's also maybe a bit more complicated, which is a downside.)
There's a moderately technical presentation on it here, that mostly leaves out the parallel part:
https://wiki.lustre.org/images/a/a0/LUG2024-Hybrid_IO_Pat...
useful for databaes
Posted Dec 4, 2024 20:47 UTC (Wed)
by mokki (subscriber, #33200)
[Link]
Posted Dec 4, 2024 20:47 UTC (Wed) by mokki (subscriber, #33200) [Link]
Similarly, after a crash or fallover rdbms reads once in all the WAL files, and that should be uncached read so they do not pollute the kernel buffer cache leaving more space for the actually useful data.
On the other hand, the situation can be more complex and user should decide when to use uncached operations. For example restoring a database backup is done only once in production.
But a test system might be reset back to known state every few minutes and benefits form the cached backup.
TTL
Posted Dec 4, 2024 22:24 UTC (Wed)
by HIGHGuY (subscriber, #62277)
[Link] (1 responses)
Posted Dec 4, 2024 22:24 UTC (Wed) by HIGHGuY (subscriber, #62277) [Link] (1 responses)
- 00: normal behavior, nice backwards compat
- 01: ? Keep around for 2 subsequent accesses
- 10: keep around for 1 subsequent access
- 11: current flag behavior, drop immediately
Consider the compiler settting this flag on an object file so the data is dropped after the linker has passed. Bad example because you relink a file more often than you recompile the source, but the idea should be clear…
TTL
Posted Jan 16, 2025 17:29 UTC (Thu)
by Spudd86 (guest, #51683)
[Link]
Posted Jan 16, 2025 17:29 UTC (Thu) by Spudd86 (guest, #51683) [Link]
It would likely be bad for performance. Plus you as someone calling IO have no idea what else might be using the file, this is a hint about how the program doing the IO expects to use the IO, not system wide.
must have feature
Posted Dec 5, 2024 9:56 UTC (Thu)
by amarao (subscriber, #87073)
[Link]
Posted Dec 5, 2024 9:56 UTC (Thu) by amarao (subscriber, #87073) [Link]
Synchronous
Posted Dec 5, 2024 14:49 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (7 responses)
Posted Dec 5, 2024 14:49 UTC (Thu) by Sesse (subscriber, #53779) [Link] (7 responses)
Synchronous
Posted Dec 5, 2024 15:10 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Posted Dec 5, 2024 15:10 UTC (Thu) by Wol (subscriber, #4433) [Link]
Cheers,
Wol
Synchronous
Posted Dec 5, 2024 15:25 UTC (Thu)
by corbet (editor, #1)
[Link] (5 responses)
Direct I/O is inherently synchronous because read() and write() are inherently synchronous; that's how Unix was designed. A buffered write is synchronous in that, at the completion of the call, the data has been copied out of your buffer and you can safely put new data there. A direct write has to provide the same property; there is no way other than the completion of the write() call to know that the operation is done.
Posted Dec 5, 2024 15:25 UTC (Thu) by corbet (editor, #1) [Link] (5 responses)
Now, of course, you can use io_uring to make it all asynchronous, but that adds significantly to the complexity of the whole operation.
Synchronous
Posted Dec 5, 2024 17:02 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (4 responses)
Posted Dec 5, 2024 17:02 UTC (Thu) by Sesse (subscriber, #53779) [Link] (4 responses)
Synchronous
Posted Dec 5, 2024 17:10 UTC (Thu)
by corbet (editor, #1)
[Link] (3 responses)
Reads are always synchronous, which is why the kernel often prioritizes them. From the prospective of user space, though, writes are not synchronous, in that you do not have to wait for the data to land in persistent storage. Plus you get elimination of many duplicate writes and combining of operations, which helps performance a lot. Direct I/O does not give you that; part of the point of RWF_UNCACHED is to make those benefits available without flooding the page cache.
Posted Dec 5, 2024 17:10 UTC (Thu) by corbet (editor, #1) [Link] (3 responses)
Synchronous
Posted Dec 5, 2024 20:19 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link]
Posted Dec 5, 2024 20:19 UTC (Thu) by andresfreund (subscriber, #69562) [Link]
Synchronous
Posted Dec 5, 2024 20:25 UTC (Thu)
by malmedal (subscriber, #56172)
[Link] (1 responses)
Posted Dec 5, 2024 20:25 UTC (Thu) by malmedal (subscriber, #56172) [Link] (1 responses)
Synchronous
Posted Dec 5, 2024 20:27 UTC (Thu)
by corbet (editor, #1)
[Link]
Direct I/O always avoids kernel buffering — that's the "direct" part :)
Posted Dec 5, 2024 20:27 UTC (Thu) by corbet (editor, #1) [Link]
What if this was on by default?
Posted Dec 6, 2024 1:03 UTC (Fri)
by walters (subscriber, #7396)
[Link] (7 responses)
Posted Dec 6, 2024 1:03 UTC (Fri) by walters (subscriber, #7396) [Link] (7 responses)
Although I guess there's also the theoretical idea of not caching reads by default, but caching writes still by default. What workloads would be badly hit by that? I guess back to software compilation with C/C++ with separate header files, we do want those headers in the page cache.
Obviously this is just theoretical because I'm sure trying to do this would cause fallout, but I am just curious to think about what the fallout would be.
What if this was on by default?
Posted Dec 6, 2024 13:22 UTC (Fri)
by epa (subscriber, #39769)
[Link] (5 responses)
Posted Dec 6, 2024 13:22 UTC (Fri) by epa (subscriber, #39769) [Link] (5 responses)
That could make a comeback. If you tell the kernel that you will only read sequentially, it can make a smarter choice about page caching. If the file is bigger than system memory and it is read from start to end, keeping the most recently used pages in cache will never help. On the other hand it does often help to cache random access to a large file. Similarly, if a file is opened for writing only then its cache treatment might differ a bit from a file opened for both read and write.
So perhaps we need more old-style file modes, even if they were treated purely as hints and you wouldn't actually get an error on trying to seek() a file you had opened for sequential access only.
What if this was on by default?
Posted Dec 6, 2024 16:35 UTC (Fri)
by pj (subscriber, #4506)
[Link] (4 responses)
Posted Dec 6, 2024 16:35 UTC (Fri) by pj (subscriber, #4506) [Link] (4 responses)
Describing how you use the file to the kernel
Posted Dec 7, 2024 11:45 UTC (Sat)
by farnz (subscriber, #17727)
[Link] (3 responses)
Posted Dec 7, 2024 11:45 UTC (Sat) by farnz (subscriber, #17727) [Link] (3 responses)
The trouble such schemes run into is that people do start relying on things working a particular way, and get very upset if it changes, even if the change is an improvement for people using the flags as documented. It is very hard to stop people depending on implementation details, unless you change them regularly - I've had, for example, people complain when I added compression to a storage backend, because it meant that they had to ask for data in the chunk size they wanted, rather than relying on the system not being able to produce chunks larger than 512 KiB.
And that wasn't even documented behaviour - that was found by users asking for "largest possible chunk size" and discovering that it was never more than 512 KiB on the implementation at hand. They could trivially have asked for a 512 KiB chunk size, and had that, or prepared for an arbitrarily large chunk size, and been OK.
Describing how you use the file to the kernel
Posted Dec 7, 2024 15:04 UTC (Sat)
by adobriyan (subscriber, #30858)
[Link] (2 responses)
Posted Dec 7, 2024 15:04 UTC (Sat) by adobriyan (subscriber, #30858) [Link] (2 responses)
One day I'll write debugging patch which kills a process if dirty file wan't closed properly.
Then I'll write a patch which kills a process if written file wasn't synced properly.
Then I'll write a patch which kills a process if it created or renamed a file without syncing parent directory properly.
I tried similar thing once with a patch which forces short read, it was quite entertaining -- unnamed distro didn't even boot properly.
Describing how you use the file to the kernel
Posted Dec 9, 2024 22:36 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Dec 9, 2024 22:36 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)
>
> Then I'll write a patch which kills a process if it created or renamed a file without syncing parent directory properly.
Those operations are entirely legal and even desirable in some contexts. If a file is anywhere under /tmp, it almost certainly does not need to be sync'd, and probably should not be sync'd for performance reasons. The same goes for most if not all of the following:
* /var/run, /var/lock, and many other things under /var.
* /dev/shm (see shm_overview(7), and note that POSIX intentionally does not require these file descriptors to support the full range of file-like operations in the first place, so fsyncing them would be highly non-portable).
* Probably some files under /opt, depending on what is installed there and how it is configured. E.g. you might have a Jenkins setup that does CI out of some directory under /opt (if the system crashes mid-build, we probably want to delete it and start over anyway).
* Any other file which is intended to be temporary or ephemeral, regardless of where it lives.
Describing how you use the file to the kernel
Posted Dec 9, 2024 23:10 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link]
Posted Dec 9, 2024 23:10 UTC (Mon) by andresfreund (subscriber, #69562) [Link]
E.g. postgres won't fsync data files until a checkpoint, as all modifications would be performed again in case the system / the database crashes and performs journal replay. It'd cause a major performance regression to always fsync modified files before a process exits (PG is multi process for now, each connection can have an FD open for a file).
What if this was on by default?
Posted Dec 11, 2024 16:52 UTC (Wed)
by ScottMinster (subscriber, #67541)
[Link]
Posted Dec 11, 2024 16:52 UTC (Wed) by ScottMinster (subscriber, #67541) [Link]