Skip to content

Semaphore slowdown #97545

Closed
Closed

Description

Image for: Description

This started in #93222 (comment).

Here are two of my posts there, repeated:


The test program spawns oodles of trivial tasks and tries to rate-limit them by making each task acquire a semaphore first. The semaphore allows 50 tasks at a time. With python 3.11rc2, on my Mac it does nearly 25,000 iterations per second. With the new code it does about 900, or about 27 times slower.

The first 50 tasks take the easy path through acquire(), every following task gets put in the queue first. What makes it so slow is that once 50 tasks have acquired the semaphore in a single event loop iteration, they will also release it all in a single iteration -- but the new algorithm only wakes up the first waiting task, and the next 49 release() calls do nothing to the queue. So from then on we only wake up one task per iteration.

Of course, it's easy to complain when the broken code is fast. :-) But I think we can do better.

(UPDATE: The same code without using a semaphore can do over 200,000 loops/sec. Same if I keep the semaphore but set the initial value to 50,000. If I also remove the sleep(0) from the task it can do over 400,000 loops/sec. A loop that doesn't create tasks but calls sleep(0) and bumps the counter runs around 50,000.)


Actually the behavior is guaranteed. Under Scheduling Callbacks I read

Callbacks are called in the order in which they are registered.

So we can definitely rely on this. (Sorry I hadn't mentioned this earlier, I wasn't actually aware of the guarantee, just of how the code works.)

Something that doesn't affect correctness but may affect performance is that the event loop goes through a cycle:

  • wait for I/O events (*)
  • register callbacks for I/O events
  • call all callbacks that are registered and ready at this point (I/O and otherwise)
  • go back to top

(*) The timeout for the I/O wait is zero if there are callbacks that are immediately ready, otherwise the time until the first callback scheduled for a particular time in the future (call_later()).

The I/O wait has expensive fixed overhead, so we want to call as many callbacks in a single iteration as possible.

Therefore I think a it behooves us to make all futures ready for which we have place. I think it can be like this abstract algorithm:

  • Definitions:
    • Level: L = self._value
    • Waiters: W = self._waiters
    • Ready: R = [w for w in W if w.done() and not w.cancelled()]
    • Cancelled: C = [w for w in W if w.cancelled()]
    • Blocked: B = [w for w in W if not w.done()]
    • Note that R, C and B are views on W, not separate data structures
  • Invariant that should hold at all times:
    • L >= |R|
      (I.e., we should not promise more guests to seat than we have open tables)
  • Operations:
    • Equalize: while |B| > 0 and L >= |R|: make the first item of B ready (move it to R)
    • Release: L++; Equalize
    • Acquire:
      • if L > 0 and |R| == 0 and |B| == 0: L--; return
      • create a future F, append it to B, await it
      • when awaken (with or without exception):
        • assertion: F should be in either R or C
        • remove F from W (hence from R or C)
        • if no exception caught: L--; Equalize; return
        • if CancelledException caught and F.cancelled(): return
        • if CancelledException caught and not F.cancelled(): Equalize; return
        • (other exceptions are not expected and will bubble out unhandled)

Metadata

Image for: Metadata

Metadata

Image for: Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagetopic-asynciotype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

    Image for: Issue actions