Optimize nv_kthread_q batching and reduce per-item wakeup overhead by arch-fan · Pull Request #1050 · NVIDIA/open-gpu-kernel-modules

arch-fan · 2026-03-07T15:49:33Z

Summary

This change reduces the scheduling overhead of nv_kthread_q, which is used across
nvidia, nvidia-drm, nvidia-modeset, and nvidia-uvm.

The previous implementation used a counting semaphore and dequeued exactly one item
per wakeup. This change switches the queue to a wait_queue_head_t + pending_count
model and drains queued items in batches.

What changed

Replace the per-item semaphore with:
- wait_queue_head_t q_wait_queue
- atomic_t pending_count
Wake the worker thread only on the 0 -> 1 transition.
Drain the full queue into a local list with list_splice_init() and process it
in one pass.
Keep the existing queue API and semantics unchanged.
Add a small user-space queue model benchmark under:
- tools/nv-kthread-q-bench/

Why

nv_kthread_q sits on hot paths used by deferred work, bottom halves, and driver-side
event handling. Waking the worker once per queued item adds avoidable scheduler and
locking overhead when the queue receives bursts of small work items.

Batch draining reduces:

wakeups
lock/unlock frequency on dequeue
queue management overhead under bursty load

Expected impact

This is primarily a latency/overhead optimization.

Potential user-visible impact:

better frame pacing in bursty driver activity
less CPU overhead in deferred work paths
lower overhead in UVM/ISR-related queue usage

This is not expected to materially increase peak GPU throughput by itself.

Testing

Build validation

Built successfully against the local Xanmod kernel development tree:

make -j$(nproc) modules \
  SYSSRC=/nix/store/4ddw927f74js8ra4cahm3ism430a8zqi-linux-xanmod-6.18.16-dev/lib/modules/6.18.16-xanmod1/source \
  SYSOUT=/nix/store/4ddw927f74js8ra4cahm3ism430a8zqi-linux-xanmod-6.18.16-dev/lib/modules/6.18.16-xanmod1/build \
  CC=gcc LD=ld \
  NV_KERNEL_MODULES="nvidia nvidia-drm nvidia-modeset nvidia-uvm"

Generated successfully:

nvidia.ko
nvidia-drm.ko
nvidia-modeset.ko
nvidia-uvm.ko

Queue model benchmark

Built and ran:

make -C tools/nv-kthread-q-bench
./tools/nv-kthread-q-bench/nvq_model_bench 300000 600000 8 10

Observed results on this machine:

single-producer median:
- 121.24 ns/item -> 32.17 ns/item
- 3.77x improvement
8-producer median:
- 89.46 ns/item -> 55.67 ns/item
- 1.61x improvement

Limitations

I did not load the modified kernel modules on the running system in this environment.
The benchmark is a user-space model of the queue design, not a live in-kernel runtime benchmark.

Notes

The change is intentionally scoped to the queue internals and preserves the external
behavior of nv_kthread_q.

CLAassistant · 2026-03-07T15:49:39Z

All committers have signed the CLA.

Optimize nv_kthread_q batching

dc4d6e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize nv_kthread_q batching and reduce per-item wakeup overhead#1050

Optimize nv_kthread_q batching and reduce per-item wakeup overhead#1050
arch-fan wants to merge 1 commit intoNVIDIA:mainfrom
arch-fan:perf/nv-kthread-q-batching

arch-fan commented Mar 7, 2026

Uh oh!

CLAassistant commented Mar 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arch-fan commented Mar 7, 2026

Summary

What changed

Why

Expected impact

Testing

Build validation

Queue model benchmark

Limitations

Notes

Uh oh!

CLAassistant commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Mar 7, 2026 •

edited

Loading