VM followup: page migration and fragmentation avoidance

[Posted November 16, 2005 by corbet]

Page migration is the act of moving a process's pages from one part of the system to another. Often, the motivation is moving pages between NUMA nodes in the hope of improving performance. When this page last looked at the page migration patch set, it worked by forcing target pages out to the swap device. When the owning process later faults them in, these pages will end up on the desired node. This technique works, but it is not optimal: it would be nicer to avoid having to write the pages to disk and read them back in.

Christoph Lameter has now followed up with the direct migration patch set, which does away with the side-trip to the swap device. A look at the patch shows why things were not done this way in the first place; direct page migration involves rather more than simply copying the data over. The first step, after choosing a target page, is to lock that page so that nobody else will mess with it. There might currently be I/O active which involves that page, so the kernel must wait for any such I/O to complete. Only then can the real migration work begin.

The kernel must establish a swap cache entry for the page, even though it intends to avoid writing the page to swap. This entry will cause the right thing to happen if a process faults on the page while it is being moved. Then all references to the page (page table entries) are unmapped. With luck, all references will go away; if references remain for any reason, the page cannot be moved.

Actually moving the page involves copying a subset of the page status bits over, copying the page data itself, then copying the rest of the status bits. The old page is cleared out and freed. If any writeback has been queued up for the new page, it is set in motion. Then it's just a matter of cleaning up, and the page has been successfully moved.

If the kernel runs out of free pages on the target node, it will fall back to the swap-based mechanism. So that stage of this patch's evolution remains useful.

With this code in place, the kernel has the support it needs to try to keep a process's pages in local memory. The migration code might also prove useful for hotplug memory uses, where all pages must be vacated from a given region. Indeed, some of this code was originally written for hotplug applications. But, at this point, the migration is done on a best-effort basis. For NUMA systems, failure to move a page results in worse performance, but nothing particularly severe. For hotplug memory, instead, this sort of failure will block a memory remove operation altogether. Moving all pages in a region with 100% certainty remains a difficult problem without a complete solution at this time.

One of the pieces of such a solution might be active memory defragmentation which, among other things, works to keep non-movable memory allocations out of memory regions which might be removed. When we looked at active defragmentation last week, that patch set looked like it was in trouble. The overhead of the defragmentation code seemed to be too high, and a number of developers (Linus included) felt that this sort of functionality should be implemented using the kernel's zone system, rather then with a new layer in the memory allocator.

Defragmentation hacker Mel Gorman doesn't give up that easily, however. He has posted a new, "light" version of the defragmentation patch which, he hopes, will be better received. As he describes it:

This is a much simplified anti-defragmentation approach that simply tries to keep kernel allocations in groups of 2^(MAX_ORDER-1) and easily reclaimed allocations in groups of 2^(MAX_ORDER-1). It uses no balancing, tunables special reserves and it introduces no new branches in the main path. For small memory systems, it can be disabled via a config option. In total, it adds 275 new lines of code with minimum changes made to the main path.

In this version of the patch, a new GFP flag (__GFP_EASYRCLM) is added; its presence indicates an allocation which the kernel can easily get back should the need arise. It is used for user-space pages (which can usually be forced out to backing store) and in a few other situations, such as for some kernel buffers. The buddy allocator already keeps track of memory in large chunks; the new code simply steers reclaimable allocations toward some chunks, while keeping the non-reclaimable allocations in others. In this way, it is hoped, there will be no situations where one non-movable page blocks the freeing of the large, contiguous region in which it is located.

The patch works by creating a "usemap" array tracking which kind of allocation is being done from each large chunk of memory. Mel also had to split the per-CPU free lists which are used to perform fast single-page allocations; now there are two such lists, one for each allocation type. From there, it is just a matter of taking allocations from the proper pile, depending on the __GFP_EASYRCLM flag.

This version certainly reduces the footprint and overhead of the defragmentation patches. It is still not the zone-based approach that others were pushing for, however. So it remains to be seen whether "active defragmentation lite" is, in the end, better received than its predecessors.

Index entries for this article
Kernel	Hotplug/Memory
Kernel	Memory management/Large allocations
Kernel	NUMA

VM followup: page migration and fragmentation avoidance

Posted Nov 24, 2005 12:29 UTC (Thu) by markryde (guest, #33361) [Link] (3 responses)

>Often, the motivation is moving pages between NUMA nodes in the hope of >improving performance.

Does anybody know if there is any other use for page migration apart from moving pages between NUMA nodes? (in clusters or virtualization solutions maybe?)

VM followup: page migration and fragmentation avoidance

Posted Nov 25, 2005 11:30 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

Uses I can think of (other people will no doubt correct me when I've got things wrong):

Memory hotplug. This needs a guarantee that pages will move off the chips you're about to unplug, so these patches are only a beginning for that use.
Driver DMA buffer allocation. Some devices can't do scatter-gather DMA (thankfully these are getting rare), so need to allocate large buffers as a single continguous lump. More common is 32-bit devices on a 64-bit system without an IOMMU; rather than use bounce buffering, you can migrate pages in and out of the DMA32 zone and get the same effect. This isn't necessarily a win though.
Large page size support - migrating pages allows you to defragment memory. There's little support for this in Linux at the moment (hugetlbfs only), but the idea is to merge lots of small pages into bigger ones where possible. For example, x86 hardware supports a 2MB or 4MB page size in addition to the normal 4K page size; some MIPS and IA-64 hardware support 4K, 16K and 64K pages. For now, you can access large pages (subject to physical memory fragmentation) via the hugetlb code; in theory, Linux could be extended to support large pages transparently (defragment physical memory so that a large page sized virtual allocation gets backed by large pages). This gets you more bytes in the TLB (as the TLB counts pages, not byte addresses), and smaller page tables; the question is whether the increased code complexity (and corresponding bug opportunity) outweighs the potential gains.

Just a few uses to think about, anyway.

VM followup: page migration and fragmentation avoidance

Posted Nov 25, 2005 21:14 UTC (Fri) by oak (guest, #2786) [Link] (1 responses)

Wouldn't this help also with something like Xen?

VM followup: page migration and fragmentation avoidance

Posted Dec 1, 2005 15:15 UTC (Thu) by farnz (subscriber, #17727) [Link]

My (limited) understanding of Xen suggests not; Xen maps pages directly into the guest's address space (it just indirects page table changes through the hypervisor). Therefore, Xen doesn't gain from physical page migration or defragmentation (it always maps page-by-page, not in blocks).