SMP alternatives

[Posted December 14, 2005 by corbet]

The i386 processor family poses a challenge for kernel builders. These processors have maintained instruction set compatibility for many years; code built for early Pentium processors will likely still run on current hardware. The problem is that code built for these older processors will fail to take advantage of features added later on. The "least common denominator" approach can thus lead to sub-optimal use of current CPUs.

The kernel has a number of ways of dealing with this challenge. In some cases it can make decisions at run time, using processor features only if they are found to be present. Other features are only available by way of build-time configuration options; selecting these will result in a kernel which will not run on older systems. Yet another mechanism is the "alternatives" feature, which allows the kernel to optimize itself at boot time. Consider this example of alternatives use (from include/asm-i386/system.h):

    #define mb() alternative("lock; addl $0,0(%%esp)", \
                             "mfence", \
			     X86_FEATURE_XMM2)

This macro places a memory barrier in the code, ensuring that all memory reads and writes initiated before the barrier complete before execution continues. The default implementation is essentially a bus-locked no-op; it will work anywhere. On newer systems, however, the more efficient mfence instruction is available, and it would be nice to use it.

The alternative() macro compiles in the default code, but also makes a note of its location (and alternative implementation) in a special ELF section. Early in the boot process, the kernel calls apply_alternatives(), which makes a pass through that special section. Every alternative instruction which is supported by the running processor is patched directly into the loaded kernel image; it will be filled with no-op instructions if need be. Once apply_alternatives() has finished its work, the kernel behaves as if it had been compiled for the processor it is actually running on. This mechanism allows distributors to ship generic kernels which can optimize themselves at boot time.

The 2.6 mainline uses alternatives sparingly: for barriers, prefetch hints, and saving the floating point unit state. Gerd Knorr, however, believes that the use of alternatives could be expanded to further reduce the range of kernels which distributors need to ship - and to improve runtime flexibility as well. In particular, he thinks that kernels can be optimized for single- or multiprocessor systems on the fly.

Gerd's SMP alternatives patch is an implementation of this concept. It creates an new macro (alternative_smp()) which can be used to specify optimal implementations of an operation on both uniprocessor and SMP systems; the proper version will then be selected at runtime. The main use of SMP alternatives in his patch is with spinlock operations; spinlocks can be patched in or edited out, as dictated by the configuration of the system at boot time.

There are a couple of interesting features in Gerd's patch. One is in the handling of the i386 architecture's lock prefix. This prefix, when applied to specific instructions, causes the instruction to run in a bus-locked, atomic manner. It is used for operations which must be seen coherently across a multiprocessor system; these include semaphore operations and the atomic_t implementation. Use of the lock prefix on uniprocessor systems imposes a runtime cost with no benefit; it would be nice to edit those out. The SMP alternatives patch takes a shortcut here; it simply remembers each location where a lock prefix appears. If the kernel boots on a uniprocessor system, all of those prefixes can be quickly overwritten with no-ops.

A more interesting - and more controversial - feature of this patch is that, when the kernel is converted between the SMP and uniprocessor mode, the overwritten instructions are remembered. At some point the the future, then, the alternatives code can reverse the change, switching the kernel back to the full SMP implementation. The code is then run whenever a CPU hotplug event happens, optimizing the kernel for the system's new configuration. A system can be initially booted with a single processor, and the alternatives code will edit out all of the SMP-related instructions. If another processor is added later on, the kernel will be automatically converted back into a fully SMP-capable mode. If processors are removed, the SMP code can be taken out too. All within a running system, with no need to reboot.

This feature may seem useful to a rather small minority of users - and it is. But that minority may be bigger than one thinks. Virtualization systems (and Xen in particular) are implementing the ability to configure the number of (virtual) CPUs in each running instance on the fly, in response to the load on each. So it may really be that a busy, virtualized server will have CPUs hot-plugged into it, and that those processors will go away when the load drops. Enabling the kernel to reconfigure itself on the fly when this happens will allow each Xen instance to run a kernel which is optimized for its current situation.

The CPU hotplug may be a hard sell - self-modifying code in a running kernel tends to make people nervous. The rest of the SMP alternatives patch seems likely to find a place in the mainline, eventually.

Index entries for this article
Kernel	Alternative instructions
Kernel	Virtualization

That's why I love Linux...

Posted Dec 15, 2005 3:19 UTC (Thu) by sbishop (guest, #33061) [Link] (5 responses)

The difference between this idea and the way Windows works is incredible. Self-modifying/optimizing code versus "Another processor? Do you have a license for that?" Wow.

I've realized recently that Linux often benefits technically from its license. This idea is an example of that.

I've seen another example at work, where I maintain a Linux USB driver used internally. It's based off the example USB driver that's included in the Linux kernel code, and it's about 600 lines long. The corresponding example Windows driver that comes with Microsoft's DDK (Driver Development Kit) consists of multiple files and around 8,000 lines total. Why the difference? Because the Linux developers aren't tied to an ABI, or API, for that matter. They can adjust the interface between the kernel and drivers (because they have the source for those too) until they get it right. The Windows driver contains much that really ought to be built-in. A +10x difference in code size--that's huge.

And guess which is more than twice as fast than the other? :)

That's why I love Linux...

Posted Dec 15, 2005 10:35 UTC (Thu) by Duncan (guest, #6647) [Link] (4 responses)

I'm certainly no MSWormOS fan, and I like your point, but in actuality,
from what I've read, MS is one of the better proprietaryware venders in
regard to multicore, at least. While the likes of Oracle continue to bill
per core, MS has been looking pretty flexible by comparison, saying it
will continue to license per CPU, even as the number of cores per CPU
begins to climb.

As to the USB driver, if it's useful for you, unless you are developing it
for as yet unreleased hardware, it's almost certainly going to be useful
to others as well. The GPL doesn't require publishing code for internal
work, but consider the benefits of either /not/ having to do that internal
maintenance, as it's done for you by the kernel crew, or submitting it for
kernel inclusion and becoming the maintainer yourself, thus gaining
visibility and favorable press throughout the Linux world, /plus/ having a
bit of the work still done for you, when someone changes out kernel code
you depend on from under you.

Of course, you likely have your reasons for not doing so, but come on, you
gotta admit such a posting to LWN is just /begging/ someone to ask why you
haven't yet made source available! <g>

Duncan

That's why I love Linux...

Posted Dec 15, 2005 16:18 UTC (Thu) by sbishop (guest, #33061) [Link] (1 responses)

Yea, I suppose that I should have seen that coming. :)

I work for a manufacturing company. The hardware is a tester that only we'd be interested in. And the driver really isn't far distanced from usb-skeleton.c, so it very interesting either.

That's why I love Linux...

Posted Jun 1, 2006 16:41 UTC (Thu) by cventers (guest, #31465) [Link]

Is it a tester that no one else on Earth owns? :)

I read that GregKH re-emphasized at the recent FreedomHEC that the kernel
has drivers with only a handful of users. Perhaps mainline inclusion
isn't an impossible thing after all.

Internal-use kernel code

Posted Dec 16, 2005 16:51 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

You silently merge the idea of making source code available and of getting it in the kernel.org tree. You also seem to merge, as many people do, the idea of submitting code for inclusion and of that code being included.

There's a significant cost to getting code into kernel.org. I myself write lots of kernel code, and while the world is welcome to all of it and much of it is published, I have never attempted to get any of it into kernel.org.

First, I'd have to translate it to a coding style I don't like and package it according to some pretty specific rules. Then I'd have to run the gauntlet of some mailing list, probably having to rework the code a few times. Some of that rework would be stuff I don't agree with. Some would be stuff I have no use for. At no point would I have any guarantee my work would result in any code going in.

So I suffer the costs of being out of tree (mainly, I can't use the most current kernel.org code), but it beats the cost of getting in tree.

Internal-use kernel code

Posted Dec 16, 2005 19:42 UTC (Fri) by Duncan (guest, #6647) [Link]

Well, I'm aware of the difference, but didn't go to my usual lengths to
specify it. How come other folks can take shortcuts, but every time I
abridge a detail, I get called on it? <g> I suppose it's likely because
folks are used to me being so detailed, tho some might be just
coincidence.

Anyway, yeah, thanks for making the code available, even if it's not
targeted at the kernel tree ATM. That's an important right of Free
source, too, being able to NOT have to go for merge, if desired, tho
because it's Free source, others can take it and go for that merge if they
want to, again, another important right.

Thanks also for filling in those details. It's quite possible the
additional information will be of use to future readers, and I /did/ fail
to mention it, so it's good someone came along to fill the gap. =8^)

Duncan

SMP alternatives

Posted Dec 15, 2005 9:44 UTC (Thu) by dw (guest, #12017) [Link] (5 responses)

Please excuse my ignorance of kernel development, but I don't understand how this can be made to work safely. I am not sure under which conditions CPU hotplug events are processed, but I would imagine that at that time numerous locks in the kernel would be in the locked state.

If the CPU hotplug code causes the unlock functions to become a NOOP, then after the event has been processed, any scheduled tasks that perform an unlock of an existing lock will not cause any change in the state of the kernel.

If the kernel is then patched for SMP again, and numerous locks which were 'unlocked' by tasks running under the non-SMP kernel are actually still marked in memory as locked, would that not cause severe system instabilty or a deadlock condition?

Again, I only know the kernel by concept, and have never written a line of code for it. Excuse my ignorance. :)

SMP alternatives

Posted Dec 15, 2005 14:05 UTC (Thu) by kpfleming (subscriber, #23250) [Link] (2 responses)

You are confusing mutual exclusion/semaphore locks with 'bus' locks. There is no 'unlock' operation involved here; the 'lock' prefix instruction being referred to only affects the instruction it precedes, and there is an automatic 'unlock' of the bus when that instruction completes.

Switching from uni- to multi-processor mode won't even require holding all the kernel threads/processes in an idle state while this happens, it would just have to complete all the instruction patching before any threads could be allowed to run on the new CPU.

This is a very, very cool idea :-)

SMP alternatives

Posted Dec 15, 2005 18:39 UTC (Thu) by Ross (guest, #4065) [Link] (1 responses)

That's not how I read it.

"The main use of SMP alternatives in his patch is with spinlock operations; spinlocks can be patched in or edited out, as dictated by the configuration of the system at boot time."

It sounds like spinlocks will be turned into noops. This may be ok when going SMP->UP, but maybe not the other direction, and I wonder what kind of lock state would be retained when going SMP->UP->SMP...

SMP alternatives

Posted Dec 15, 2005 18:59 UTC (Thu) by jzbiciak (guest, #5246) [Link]

To enable hotplug SMP -> UP -> SMP, you'd definitely need to retain spinlocks, or at least put a lightweight "take lock" instruction there as opposed to the full spin. Fully NOPping spinlocks out would be a disaster, as you note.

Elsewhere, though, you could nuke/replace LOCK prefixes as needed.

Safety

Posted Dec 15, 2005 14:55 UTC (Thu) by corbet (editor, #1) [Link]

Changing the functioning of spinlocks could certainly create trouble if parts of the kernel are certainly holding locks! By my reading of the patch, there are a couple of defenses against that problem, though:

A kernel built with the SMP alternatives maintains the counters for spinlocks, so lock state should be preserved in all configurations, and
The hotplug CPU code has to quiesce the system anyway, so no atomic code should be running while alternatives are being applied.

SMP alternatives

Posted Dec 25, 2005 19:46 UTC (Sun) by efexis (guest, #26355) [Link]

Um, my guess is that you'd stop processes from running on the second processor -before- switching down to a single processor kernel. As only one processor could then even be holding locks, the locking mechanism becomes irrelevant, so can be NOOPed.

SMP alternatives

Posted Dec 15, 2005 12:35 UTC (Thu) by NAR (subscriber, #1313) [Link] (3 responses)

Virtualization systems (and Xen in particular) are implementing the ability to configure the number of (virtual) CPUs in each running instance on the fly, in response to the load on each. So it may really be that a busy, virtualized server will have CPUs hot-plugged into it, and that those processors will go away when the load drops.

Why bother with hotplugging virtual processors? Isn't it simpler to add more resources (e.g. more CPU splices per second) on the host to the particular virtualized server?

Bye,NAR

SMP alternatives

Posted Dec 15, 2005 15:26 UTC (Thu) by bronson (subscriber, #4806) [Link] (2 responses)

Yes, as long as you have more slices per second to give.

Once you've maxed out a single processor, your only recourse is to run the VM on more processors. AFAIK there's no good way of doing this unless the VM is partitioned (parallelized?) as well. Multiple virtual CPUs is the best way of doing this.

But the VM doesn't care! why not always run it in a quad-cpu configuration? Because that wastes cycles if it's just being run on a single CPU.

That's just a guess. It seems an overcomplex solution to me but I haven't looked at the code yet.

SMP alternatives

Posted Dec 18, 2005 19:26 UTC (Sun) by bk (guest, #25617) [Link] (1 responses)

Can these processors be on different physical machines? This sounds like an implementation of SSI clustering, which is pretty cool.

SMP alternatives

Posted Dec 21, 2005 7:08 UTC (Wed) by csamuel (✭ supporter ✭, #2624) [Link]

No.

Xen is still running on a single system and whilst you can migrate a
system between Xen servers (if you've got a shared filesystem that can
cope) it only runs on one or the other with a tiny pause for the actual
transition.

SMP alternatives

Posted Dec 15, 2005 20:14 UTC (Thu) by captrb (guest, #2291) [Link] (1 responses)

Can anybody speculate whether this functionality will make various
power management easier on SMP machines? It seems (without any
knowledge to back it up) that each CPU could be removed until there
was a single CPU left, then the same procedure that works on
uniprocessor machines could be performed.

It's a really shame, especially with dual core CPU's and
hyperthreading, that there is a choice between fancy power
management and multiple processors.

SMP alternatives

Posted Dec 25, 2005 19:51 UTC (Sun) by efexis (guest, #26355) [Link]

No. This is talking about removing code once you've already switched down to a single processor, and adding code when you want to switch up.

Adding/removing processors at runtime, I believe, is already possible (otherwise this code would be pretty useless)

SMP alternatives

Posted Dec 16, 2005 17:38 UTC (Fri) by norsk (guest, #30746) [Link] (2 responses)

Back in 1998, while working at Novell on the Netware SMP project, I did the same thing on self-modifying kernel mods. The kernel was built in SMP mode and when installed on a UP system, all the "lock" instructions, SMP and atomics called their respective init routines to determine whether UP or SMP and applied the correct op-codes. Gave us 2-5% improvement and a major cost in shipping of different kernels.

I like the idea of reversing the mods when a CPU hotplug event occurs. Hardware at my time did not have the feature set.

I was wondering why this had not yet happened before in Linux. Still a far better world to work in, then the Netware kernel.

doug "norsk" thompson

SMP alternatives

Posted Dec 27, 2005 17:13 UTC (Tue) by cajal (guest, #4167) [Link] (1 responses)

If I'm interpreting your post right, you're saying this is a bad thing. A 2-5% improvement is pretty negligible, and it came at a major cost. Sounds like this is something Linux should avoid.

SMP alternatives

Posted Dec 27, 2005 23:17 UTC (Tue) by turpie (guest, #5219) [Link]

He meant that it saved the major cost of shipping different kernels, by allowing them to ship a single kernel for both UP and SMP.

User base

Posted Dec 22, 2005 11:22 UTC (Thu) by ringerc (subscriber, #3071) [Link] (3 responses)

The potential user base is larger than hot-plugging SMP users and Xen users. The many of the next generation of laptops will have dual core CPUs, and it's likely to be desirable to shut down a core when on battery (for example).

A dual core system is essentially SMP... and laptops are very power / performance concious. It would not hurt in the slightest to eke out every bit of speed (or efficiency, ie potentially less power use) while on a half-clocked single-core CPU for battery saving, then safely switch to full-bore dual-core mode when on mains.

This patch seems like a really good way to handle that need.

User base

Posted Dec 22, 2005 11:59 UTC (Thu) by rrw (guest, #9757) [Link] (2 responses)

OK, I have a problem with this. Why would I need to clock down a notebook on batteries and clock it up on mains?

When a Centrino notebook on mains works with full speed it soon gets so hot that the fan works non stop, and the keyboard itself gets annoyingly warm.

On the other hand, when notebook runs on battery and I want to do something cpu intensive it is painstakingly slow. But where's the power consumption advantage? I have to do this anyway, slower or faster, and if I do something with downclocked CPU, power consumption per unit of time in display, gpu, harddisk doesn't decrease, so I actually eat more juice.

Isn't it better to just use sth like powernowd, which dynamically clocks processor depending on system load, wheather you use mains or battery?

Robert

CPU usage optimisation

Posted Mar 30, 2006 5:20 UTC (Thu) by xoddam (guest, #2322) [Link] (1 responses)

Exactly. The CPU configuration and speed should be adjusted according to
demand for processing power, not supply of electricity.

There is never a case for consuming more power just because you *can*.
Otherwise we'd all leave all our computers and all our lights switched on
all of the time. Bring on global warming.

CPU usage optimisation

Posted Jul 22, 2021 1:41 UTC (Thu) by muzg666 (guest, #139506) [Link]

hahahahah lol

Single kernel image for UP & SMP

Posted Dec 22, 2005 15:41 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (4 responses)

Could this patch cause removing CONFIG_SMP in future? With kernel image identical for SMP and UP machines and runtime patching?

Single kernel image for UP & SMP

Posted Dec 22, 2005 23:30 UTC (Thu) by renox (guest, #23785) [Link] (1 responses)

Well, as some others have remarked for this to work you have to remember which (spin)locks are taken, even on a UP processor thus loosing some performance in UP.

I think that even a minimal loss would be a hard sell to kernel developpers considering how little this UP<->SMP feature is going to be used..

Single kernel image for UP & SMP

Posted Dec 25, 2005 19:59 UTC (Sun) by efexis (guest, #26355) [Link]

There's no reason why you would have to do that (I won't repeat my last post). The speed differences would be down to the processor having to deal with the few noop's, eating up slightly more L1/L2 cache, and keeping in memory (and the kernel image file itself) - for those who worry about that - code for both scenario's.

Single kernel image for UP & SMP

Posted Jun 9, 2006 23:42 UTC (Fri) by niner (subscriber, #26151) [Link] (1 responses)

I can't imagine that. Think of embedded applications. When you only have 2MB of flash and a couple of megabytes of RAM, you really start to care about the few kilobytes such a feature costs of both memory footprints.

Single kernel image for UP & SMP

Posted Jun 11, 2006 16:14 UTC (Sun) by tyhik (guest, #14747) [Link]

RAM and flash ships get larger and larger all the time. For smaller design companies it may already be cheaper to get 4mb than 2mb flash chip.

But of course the code size matters. Smaller code implies better cache usage.