Skip to content

Optimize nv_kthread_q batching and reduce per-item wakeup overhead#1050

Open
arch-fan wants to merge 1 commit intoNVIDIA:mainfrom
arch-fan:perf/nv-kthread-q-batching
Open

Optimize nv_kthread_q batching and reduce per-item wakeup overhead#1050
arch-fan wants to merge 1 commit intoNVIDIA:mainfrom
arch-fan:perf/nv-kthread-q-batching

Conversation

@arch-fan
Copy link

@arch-fan arch-fan commented Mar 7, 2026

Summary

This change reduces the scheduling overhead of nv_kthread_q, which is used across
nvidia, nvidia-drm, nvidia-modeset, and nvidia-uvm.

The previous implementation used a counting semaphore and dequeued exactly one item
per wakeup. This change switches the queue to a wait_queue_head_t + pending_count
model and drains queued items in batches.

What changed

  • Replace the per-item semaphore with:
    • wait_queue_head_t q_wait_queue
    • atomic_t pending_count
  • Wake the worker thread only on the 0 -> 1 transition.
  • Drain the full queue into a local list with list_splice_init() and process it
    in one pass.
  • Keep the existing queue API and semantics unchanged.
  • Add a small user-space queue model benchmark under:
    • tools/nv-kthread-q-bench/

Why

nv_kthread_q sits on hot paths used by deferred work, bottom halves, and driver-side
event handling. Waking the worker once per queued item adds avoidable scheduler and
locking overhead when the queue receives bursts of small work items.

Batch draining reduces:

  • wakeups
  • lock/unlock frequency on dequeue
  • queue management overhead under bursty load

Expected impact

This is primarily a latency/overhead optimization.

Potential user-visible impact:

  • better frame pacing in bursty driver activity
  • less CPU overhead in deferred work paths
  • lower overhead in UVM/ISR-related queue usage

This is not expected to materially increase peak GPU throughput by itself.

Testing

Build validation

Built successfully against the local Xanmod kernel development tree:

make -j$(nproc) modules \
  SYSSRC=/nix/store/4ddw927f74js8ra4cahm3ism430a8zqi-linux-xanmod-6.18.16-dev/lib/modules/6.18.16-xanmod1/source \
  SYSOUT=/nix/store/4ddw927f74js8ra4cahm3ism430a8zqi-linux-xanmod-6.18.16-dev/lib/modules/6.18.16-xanmod1/build \
  CC=gcc LD=ld \
  NV_KERNEL_MODULES="nvidia nvidia-drm nvidia-modeset nvidia-uvm"

Generated successfully:

  • nvidia.ko
  • nvidia-drm.ko
  • nvidia-modeset.ko
  • nvidia-uvm.ko

Queue model benchmark

Built and ran:

make -C tools/nv-kthread-q-bench
./tools/nv-kthread-q-bench/nvq_model_bench 300000 600000 8 10

Observed results on this machine:

  • single-producer median:
    • 121.24 ns/item -> 32.17 ns/item
    • 3.77x improvement
  • 8-producer median:
    • 89.46 ns/item -> 55.67 ns/item
    • 1.61x improvement

Limitations

  • I did not load the modified kernel modules on the running system in this environment.
  • The benchmark is a user-space model of the queue design, not a live in-kernel runtime benchmark.

Notes

The change is intentionally scoped to the queue internals and preserves the external
behavior of nv_kthread_q.

@CLAassistant
Copy link

CLAassistant commented Mar 7, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants