GEM v. TTM

This article brought to you by LWN subscribers
Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
May 28, 2008

Getting high-performance, three-dimensional graphics working under Linux is quite a challenge even when the fundamental hardware programming information is available. One component of this problem is memory management: a graphics processor (GPU) is, essentially, a computer of its own with a distinct view of memory. Managing the GPU's memory - and its view of system RAM - must be done carefully if the resulting system is intended to work at all, much less with acceptable performance.

Not that long ago, it appeared that this problem had been solved with the translation table maps (TTM) subsystem. TTM remains outside of the mainline kernel, though, as do all drivers which use it. A recent query about what would be required to get TTM merged led to an interesting discussion where it turned out that, in fact, TTM may not be the future of graphics memory management after all.

A number of complaints about TTM have been raised. Its API is far larger than is needed for any free Linux driver; it has, in other words, a certain amount of code dedicated to the needs of binary-only drivers. The fencing mechanism (which manages concurrency between the host CPUs and the GPU) is seen as being complex, difficult to work with, and not always yielding the best performance. Heavy use of memory-mapped buffers can create performance problems of its own. The TTM API is an exercise in trying to provide for everything in all situations; as a result it is, according to some driver developers, hard to match to any specific hardware, hard to get started with, and still insufficiently flexible. And, importantly, there is a distinct shortage of working free drivers which use TTM. So Dave Airlie worries:

I was hoping that by now, one of the radeon or nouveau drivers would have adopted TTM, or at least demoed something working using it, this hasn't happened which worries me... The real question is whether TTM suits the driver writers for use in Linux desktop and embedded environments, and I think so far I'm not seeing enough positive feedback from the desktop side

All of these worries would seem to be moot, since TTM is available and there is nothing else out there. Except, as it turns out, there is something out there: it's called the Graphics Execution Manager, or GEM. The Intel-sponsored GEM project is all of one month old, as of this writing. The GEM developers had not really intended to announce their work quite yet, but the TTM discussion brought the issue to the fore.

Keith Packard's introduction to GEM includes a document describing the API as it exists so far. There are a number of significant differences in how GEM does things. To begin with, GEM allocates graphical buffer objects using normal, anonymous, user-space memory. That means that these buffers can be forced out to swap when memory gets tight. There are clear advantages to this approach, and not just in memory flexibility: it also makes the implementation of suspend and resume easier by automatically providing backing store for all buffer objects.

The GEM API tries to do away with the mapping of buffers into user space. That mapping is expensive to do and brings all sorts of interesting issues with cache coherency between the CPU and GPU. So, instead, buffer objects are accessed with simple read() and write() calls. Or, at least, that's the way it would be if the GEM developers could attach a file descriptor to each buffer object. The kernel, however, does not make the management of that many file descriptors easy (yet), so the real API uses separate handles for buffer objects and a series of ioctl() calls.

That said, it is possible to map a buffer object into user space. But then the user-space driver must take explicit responsibility for the management of cache coherency. To that end there is a set of ioctl() calls for managing the "domain" of a buffer; the domain, essentially, describes which component of the system owns the buffer and is entitled to operate on it. Changing the domains (there are two, one for read access and one for writes) of a buffer will perform the necessary cache flushes. In a sense, this mechanism resembles the streaming DMA API, where the ownership of DMA buffers can be switched between the CPU and the peripheral controller. That is not entirely surprising, as a very similar problem is being solved.

This API also does away with the need for explicit fence operations. Instead, a CPU operation which requires access to a buffer will simply wait, if necessary, for the GPU to finish any outstanding operations involving that buffer.

Finally, the GEM API does not try to solve the entire problem; a number of important operations (such as the execution of a set of GPU commands) are left for the hardware-specific driver to implement. GEM is, thus, quite specific to the needs of Intel's driver at this time; it does not try for the same sort of generality that was a goal of TTM. As described by Eric Anholt:

The problem with TTM is that it's designed to expose one general API for all hardware, when that's not what our drivers want... We're trying to come at it from the other direction: Implement one driver well. When someone else implements another driver and finds that there's code that should be common, make it into a support library and share it.

The advantage to this approach is that it makes it relatively easy to create something which works well with Intel drivers. And that may well be a good start; one working set of drivers is better than none. On the other hand, that means that a significant amount of work may be required to get GEM to the point where it can support drivers for other hardware. There seem to be two points of view on how that might be done: (1) add capabilities to GEM when needed by other drivers, or (2) have each driver use its own memory manager.

The first approach is, in many ways, more pleasing. But it implies that the GEM API could change significantly over time. And that, in turn, could delay the merging of the whole thing; the GEM API is exported to user space, and, as a result, must remain compatible as things change. So there may be resistance to a quick merge of an API which looks like it may yet have to evolve for some time.

The second approach, instead, is best described by Dave Airlie:

Well the thing is I can't believe we don't know enough to do this in some way generically, but maybe the TTM vs GEM thing proves its not possible. So we can then punt to having one memory manager per driver, but I suspect this will be a maintenance nightmare, so if people decide this is the way forward, I'm happy to see it happen. However the person submitting the memory manager n+1 must damn well be willing to stand behind the interface until time ends, and explain why they couldn't re-use 1..n memory managers.

One other remaining issue is performance. Keith Whitwell posted some benchmark results showing that the i915 driver performs significantly worse with either TTM or GEM than without. Keith Packard gets different results, though; his tests show that the GEM-based driver is significantly faster. Clearly there is a need for a set of consistent benchmarks; performance of graphics drivers is important, but performance cannot be optimized if it cannot be reliably measured.

The use of anonymous memory also raises some performance concerns: a first-person shooter game will not provide the same experience if its blood-and-gore textures must be continually paged in. Anonymous memory can also be high memory, and, thus, not necessarily accessible via a 32-bit pointer. Some GPU hardware cannot address high memory; that will likely force the use of bounce buffers within the kernel. In the end, GEM will have to prove that it can deliver good performance; GEM's developers are highly motivated to make their hardware look good, so there is a reasonable chance that things will work out on this front.

The conclusion to draw from all of this is that the GPU memory management problem cannot yet be considered solved. GEM might eventually become that solution, but it is a very new API which still needs a fair amount of work. There is likely to be a lot of work yet to be done in this area.

(Thanks to Timo Jyrinki for suggesting this topic.)

Index entries for this article
Kernel	Graphics Execution Manager (GEM)
Kernel	Translation table maps (TTM)

GEM v. TTM

Posted May 28, 2008 15:08 UTC (Wed) by zooko (guest, #2589) [Link] (1 responses)

"A number of complaints about TTM have been raised. Its API is far larger than is needed for
any free Linux driver; it has, in other words, a certain amount of code dedicated to the needs
of binary-only drivers."

How are the needs of binary-only drivers different than the needs of open source drivers?  Is
TTM offering API pieces that are particularly useful for DRM or something like that?

GEM v. TTM

Posted May 29, 2008 4:51 UTC (Thu) by dberkholz (guest, #23346) [Link]

It's more that the needs of embedded hardware only supported by binary-only drivers are
different.

No Bounce buffers

Posted May 28, 2008 15:43 UTC (Wed) by arjan (subscriber, #36785) [Link]

Since with shmfs you can set a DMA mask (effectively) on the inode, there's no need to use
bounce buffers... you just allocate the memory in the right place from the start.

Clarification on benchmark results

Posted May 28, 2008 17:55 UTC (Wed) by keithw (guest, #3127) [Link] (1 responses)

Note that the benchmark results I posted don't exactly show what is claimed in the article.

In particular, the version of the driver labeled "i915tex" is the original TTM version of the
i915 driver and has good performance, while "master/ttm" is a newer one which seems to have
suffered some degree of performance regression relative to both i915tex and the original
non-ttm version...  at least in the couple of machines I've looked at...

To make things even more confusing, it seems that Keith Packard's testing may have revealed
yet another regression in the non-ttm versions of the driver, which I haven't had a chance to
dig into at this point.  

All this testing is pretty preliminary & hampered by lack of time & travel schedules, etc.
So, nobody really has all the answers.

Anyway, the biggest win at this point would be getting some sort of a memory manager interface
that everyone agrees on & can move forward with, *providing* that it doesn't encode design
decisions which preclude a properly performant implementation -- and I'm hopeful that's the
case.

Keith

Clarification on benchmark results

Posted May 30, 2008 20:40 UTC (Fri) by drag (guest, #31333) [Link]

This may be one of those situations were you just are not going to know the right way to do
it. 

Like the Linux developers stumbling over themselves to deal with wireless drivers.. First they
treated them as generic ethernet devices, which ment that each driver had to do way to much
work on it's own. 

Then Intel introduced their open source 802.11 stack, unfortunately it was not generic enough
to work with all sorts of different drivers.

Now they finally got the devicescape stuff fairly down so that it makes writing Linux wireless
drivers a sane thing to do.

Who knows? It may just be that a Intel video card vs Nvidia/ATI card are so different
architecturally that they simply can't be managed with the same API and that maybe Nvidia and
ATI cards can be managed together or something like that.

How can you tell? The only two ways I can figure it out is wait years and years and end up
with half-made drivers supporting obsolete hardware, or just to go for it and know it's going
to be a learning experience and get something done quick enough that it can actually benefit
end users.

GEM v. TTM

Posted May 28, 2008 18:27 UTC (Wed) by sylware (guest, #35259) [Link]

There is also opengl 3 in the pipeline. But opengl 3 is supposed to be quite high level and I
have pain to imagine a modern and fast 3D engine without the ability to have a fine grained
control on the video ram.
Carmak said that Id next engine (nb 5) will stream giant textures in order to render outside
landscape. Of course you can do it with opengl interfaces, but common sens pushes for low
level video ram management interface in order to make such engine fast and performant: will we
see low level memory management appear as an opengl extension?
To make things harder, all GPU manufacturers have announced, soon to arrive, hardware
accelerated raytracing and started to provide APIs for GPU "general programming". The GPU
market is in high entropy and tension is rising. And that's not helping the design of the new
Linux graphic stack. Intel wants to become serious with GPUs... of course those who "saw" the
larrabe performing where "stunned". Better wait to see it in a real life context. And NVIDIA
suggested that in its next GPUs much of what was done on the CPU will be offloaded on the
GPU... and that's not pleasing Intel...

GEM v. TTM

Posted May 28, 2008 19:41 UTC (Wed) by MisterIO (guest, #36192) [Link]

"The first approach is, in many ways, more pleasing. But it implies that the GEM API could
change significantly over time. And that, in turn, could delay the merging of the whole thing;
the GEM API is exported to user space, and, as a result, must remain compatible as things
change. So there may be resistance to a quick merge of an API which looks like it may yet have
to evolve for some time. "

Why? If it's not standardized anywhere, you coul just label it experimental and actually try
it, before starting to say that it will remain compatible as things change.

GEM v. TTM

Posted May 28, 2008 22:01 UTC (Wed) by anholt (subscriber, #52292) [Link]

The early benchmarking is kind of unfortunate -- we just started writing this code, and have
needed to spend more time on correctness than performance so far.  I've still got issues on
the 965 to resolve.  But keithp put in changes last week that got another 16% performance
improvement on my 945 system with GEM, I think we've got room for improvement on 915-class
still, and I know there's serious low-hanging fruit in 965 with GEM.

Right now, though, I care most about getting a solid user API that we can feel comfortable
putting into the kernel and maintaining for the forseeable future.  The only issue I have with
GEM API at the moment is the cache domain setting being general as opposed to driver-specific
API.  So far when we try to make a general API describing some bit of hardware state with an
N-bit field, it seems some other driver developer says he needs about 4N bits.

Just merge something!

Posted May 29, 2008 8:56 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (3 responses)

I still remember seeing the first Utah GLX demos and thinking 3D was on its way to be solved.
What a fool I was.

After all this years I feel the GFX Linux developers suffer from a perpetual alpha mindset.
Stuff is started, advances enough to be used on some dev systems and be demoed at a few
conferences, then is declared "not good enough" and killed before it reaches most user systems
(because if actual users were exposed to it, they may file issues and demand that the result
is minimally maintained, and it's much more comfortable to just work on new prototypes after
new prototypes).

Other systems (wireless) have gone through several API rewrites in-tree while graphic
developpers where still debating is something should be merged at all. While the wireless
rewrites have been painful they've been a lot less painful than having them happen
out-of-tree.

So please just merge something. If it needs to be rewritten it will be rewritten, and the
rewrite will be painful, but at least users will have something to use in the meanwhile, and
they won't have to fish for new alpha code all over the internet.

Re: Just merge something!

Posted May 29, 2008 21:49 UTC (Thu) by anholt (subscriber, #52292) [Link] (2 responses)

We have been told by Linus that we're not allowed to break userland API once the code gets
merged to the linux kernel.  We've got mistakes made 8 years ago, and fixed in better API 5
years ago, that we still have to implement because we're not allowed to break API.

It means that if you're unsure of maintaining an API today, you're really scared of merging it
and having to maintain it 5 years down the line when you've added better APIs and nobody in
their right mind is using the old software stack.

Not a userland API?

Posted May 30, 2008 18:17 UTC (Fri) by jhohm (guest, #7225) [Link] (1 responses)

I think TTM and GEM are not userland APIs, but in-kernel driver APIs; Linus's demand for
compatibility might not apply.

Not a userland API?

Posted May 30, 2008 20:34 UTC (Fri) by drag (guest, #31333) [Link]

Open Source 3D drivers in Linux do their acceleration using userspace drivers. 

The *_dri.so drivers are loaded by your Xserver-side of things and then the in-kernel DRM
stuff is what opens up a hole for those userspace drivers to interact with the kernel. The
Linux kernel is now slowly taking on additional duties to manage display modesetting and
memory management, which should lead to help moving X out of the root account and better
display performance.

Even with very expensive cards there still isn't going to be enough memory on board to manage
a very large display with many applications open on a 3D desktop. So your going to have to
have some way to deal with intelligent way to deal with moving memory in and out of a video
card.

GEM v. TTM

Posted May 30, 2008 5:42 UTC (Fri) by jzbiciak (guest, #5246) [Link]

The use of anonymous memory also raises some performance concerns: a first-person shooter game will not provide the same experience if its blood-and-gore textures must be continually paged in.

Ah, you brought back some ~~nightmares~~memories...

//
// Z_Malloc
// You can pass a NULL user if the tag is < PU_PURGELEVEL.
//
#define MINFRAGMENT             64


void*
Z_Malloc
( int           size,
  int           tag,
  void*         user )
{
    int         extra;
    memblock_t* start;
    memblock_t* rover;
    memblock_t* newblock;
    memblock_t* base;

    size = (size + 3) & ~3;

    // scan through the block list,
    // looking for the first free block
    // of sufficient size,
    // throwing out any purgable blocks along the way.

    // account for size of block header
    size += sizeof(memblock_t);

    // if there is a free block behind the rover,
    //  back up over them
    base = mainzone->rover;
....

DOOM had a zone allocator setup where you could allocate purgable blocks. If you ran out of space, it's start purging space until there was room for the new allocation. Objects would register callbacks to handle being purged. :-)

The reason I remember it is that I had to hack around it when I made an embedded version of DOOM that directly memory mapped the WAD file rather than Z_Malloc'ing it. Finding all the places where WAD elements were being explicitly managed was no walk in the park. :-)

GEM v. TTM

Posted Dec 26, 2008 1:13 UTC (Fri) by dibbles_you (guest, #45004) [Link] (1 responses)

"The GEM API tries to do away with the mapping of buffers into user space. That mapping is expensive to do and brings all sorts of interesting issues with cache coherency between the CPU and GPU. So, instead, buffer objects are accessed with simple read() and write() calls. Or, at least, that's the way it would be if the GEM developers could attach a file descriptor to each buffer object. The kernel, however, does not make the management of that many file descriptors easy (yet), so the real API uses separate handles for buffer objects and a series of ioctl() calls."

Ok so open/read/write/seek/mmap would be great (if the kernel could efficiently handle that many objects), ok fine, use ioctls to emulate this behavior, but shouldn't we be adding in macros so they look like open, read, write, gopen, gread and gwrite? So when the kernel is ready, it's a simple change of a #define gwrite write ?

GEM v. TTM

Posted Dec 26, 2008 1:57 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

Or, you know, fix whatever file descriptor infrastructure makes the management problematic.