Debugging Linux Kernel Freezes: An eBPF Spinlock Saga

Q: How does "cache line bouncing" impact spinlock performance, especially in highly contended scenarios?

Cache line bouncing, or ping ponging, occurs when a shared memory location (like a spinlock's locked flag) is frequently written to by multiple CPUs. According to the MESI protocol, a CPU writing to a cache line must acquire it in a Modified state, which requires invalidating that cache line in all other CPUs' caches. When many CPUs contend for a spinlock, they continuously invalidate and re acquire the cache line, leading to a surge of expensive inter core communication that dramatically slows down access to the shared resource, often worsening quadratically with the number of contending CPUs.

Q: Why did reducing the eBPF sampling frequency only make the freezes less frequent rather than eliminating them entirely?

Reducing the sampling frequency decreases the probability of an NMI (which triggers the eBPF sampling program) occurring precisely when the context switch eBPF program holds the problematic ring buffer spinlock on the same CPU. While the likelihood of this specific race condition decreases, the fundamental flaw in the spinlock's interaction with NMIs remains. Therefore, given enough time or sufficient system load, the specific timing conditions for the freeze can still be met, just less often.

As developers, we often pride ourselves on creating robust software that "just works." So, when our CPU profiler, Superluminal, started causing periodic full system freezes on a tester's Fedora 42 machine (kernel 6.17.4-200), we knew we had a serious challenge on our hands. This wasn't just a simple crash; the entire system would become unresponsive for short bursts, making traditional debugging nearly impossible. The hunt for this elusive bug led us deep into the Linux kernel's intricate world of eBPF and spinlocks.

Initial Clues from a Frozen System

Our first step was to analyze the system's behavior. Superluminal captures revealed suspicious periods, over 250 milliseconds long, where all threads appeared busy, yet no samples were being collected. Concurrently, dmesg output showed alarming messages like:

INFO: NMI handler (perf_event_nmi_handler) took too long to run: 250.424 msecs

These messages perfectly matched the freeze durations, strongly suggesting a kernel-level issue, specifically within a Non-Maskable Interrupt (NMI) handler. However, trying to attach a debugger to a freezing kernel instance proved futile; gdb itself would crash or time out, leaving us without direct insight into the kernel's state during these critical moments.

Isolating the Problem with a Minimal Repro

With direct debugging stalled, our strategy shifted to creating a minimal reproduction. Superluminal's Linux backend is substantial, involving around 2000 lines of eBPF code. We suspected the issue lay in how our eBPF programs interacted with kernel events. We categorize our eBPF events into three main types: sampling, context switch, and wake events.

Through systematic testing, enabling and disabling these event types, we made a crucial observation:

Neither sampling events alone nor context switch/wake events alone caused freezes.
Freezes only occurred when both sampling events and context switch events were enabled, even with wake events disabled.
Reducing the sampling frequency decreased the frequency of freezes but didn't eliminate them.

This pointed to an interaction bug. We then painstakingly stripped down our eBPF code, keeping only the bare essentials for sampling and context switch events, until we arrived at this minimal, freeze-inducing eBPF program:

c struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 512 * 1024 * 1024); } ringBuffer SEC(".maps");

SEC("tp_btf/sched_switch") int cswitch(struct bpf_raw_tracepoint_args* inContext) { struct CSwitchEvent* event = bpf_ringbuf_reserve(&ringBuffer, sizeof(struct CSwitchEvent), 0); if (event == NULL) return 1; bpf_ringbuf_discard(event, 0); return 0; }

SEC("perf_event") int sample(struct bpf_perf_event_data* inContext) { struct SampleEvent* event = bpf_ringbuf_reserve(&ringBuffer, sizeof(struct SampleEvent), 0); if (event == NULL) return 1; bpf_ringbuf_discard(event, 0); return 0; }

These programs do almost nothing beyond attempting to reserve and then immediately discard space in a BPF ring buffer using bpf_ringbuf_reserve and bpf_ringbuf_discard.

Unmasking the Spinlock Issue

Given the minimal code, bpf_ringbuf_reserve became our prime suspect. A quick look at its kernel implementation revealed it's guarded by a spinlock: raw_res_spin_lock_irqsave and raw_res_spin_unlock_irqrestore. These functions are designed to disable local interrupts and preemption to protect critical sections of code. However, the local_irq_save component only disables maskable interrupts.

Our key observation about sampling events, which trigger Non-Maskable Interrupts (NMIs), immediately sparked a hypothesis:

An eBPF program, perhaps the context switch handler, acquires the ring buffer spinlock on a CPU.
This spinlock disables maskable interrupts but critically, not NMIs.
While the lock is held, a sampling NMI occurs on the same CPU.
The NMI handler, running on the same CPU, then also attempts to acquire the same ring buffer spinlock.

Since the spinlock is already held by the initial eBPF program on that CPU, the NMI handler would enter a spin-wait loop. Crucially, the spinlock implementation includes a timeout to prevent indefinite spinning. The RES_DEF_TIMEOUT constant, often used in these spin-wait loops, is defined as NSEC_PER_SEC / 4, which is precisely 0.25 seconds, or 250 milliseconds.

This was our "smoking gun." The 250ms timeout in the spinlock perfectly matched the observed 250+ ms system freezes and the NMI handler dmesg warnings. The system was effectively self-deadlocking: an NMI handler on a CPU would attempt to acquire a spinlock already held by code on the same CPU, which it could never release because the NMI blocked its execution. The spinlock would eventually time out, causing the observed freezes.

A Primer on Spinlocks and Their Pitfalls

This incident highlights some fundamental challenges with spinlocks, especially in kernel contexts. A basic spinlock works by repeatedly attempting an atomic compare-and-swap (CAS) operation until it successfully changes a locked flag from 0 to 1. If the CAS fails, it means another thread holds the lock, and the current thread "spins" in a loop, wasting CPU cycles.

Beyond wasted cycles, spinlocks can suffer from severe performance degradation due to "cache line bouncing." Modern CPUs use protocols like MESI to maintain cache coherence. When multiple CPUs contend for a spinlock, they repeatedly try to write to the locked flag, which sits in a single cache line. Each write attempt requires a CPU to acquire the cache line in a Modified state, invalidating it in all other CPUs' caches. This generates a constant "storm" of expensive inter-core communication over the memory bus, with performance degrading quadratically with the number of contenders. This also contributes to "unfairness," where no guarantee exists that a waiting thread will eventually acquire the lock, potentially leading to starvation if other threads continually win the race.

In our case, the specific issue wasn't just general contention, but a critical interaction with NMIs and the interrupt masking properties of raw_res_spin_lock_irqsave. The fact that NMIs cannot be masked meant that they could interrupt code holding a spinlock, then attempt to acquire the same lock, leading to a self-deadlock scenario and the subsequent timeout-induced freezes.

Upon reporting our findings to the eBPF kernel mailing list, our analysis was confirmed, leading to further investigations and fixes by kernel maintainers. This journey underscored the importance of understanding the deep interactions between eBPF, kernel primitives, and hardware specifics like NMIs and cache coherence.

FAQ

Q: What distinguishes a Non-Maskable Interrupt (NMI) from a regular interrupt, and why is this relevant to the eBPF spinlock issue?

A: NMIs are special hardware interrupts that cannot be disabled or "masked" by software, unlike regular maskable interrupts. This is critical because kernel code often acquires spinlocks after disabling local interrupts to protect critical sections. If an NMI occurs while such a spinlock is held on the same CPU, and the NMI handler then tries to acquire the same spinlock, it will spin indefinitely (or until a timeout) because the original holder cannot release the lock while the NMI handler is executing.

Q: How does "cache line bouncing" impact spinlock performance, especially in highly contended scenarios?

A: Cache line bouncing, or ping-ponging, occurs when a shared memory location (like a spinlock's locked flag) is frequently written to by multiple CPUs. According to the MESI protocol, a CPU writing to a cache line must acquire it in a Modified state, which requires invalidating that cache line in all other CPUs' caches. When many CPUs contend for a spinlock, they continuously invalidate and re-acquire the cache line, leading to a surge of expensive inter-core communication that dramatically slows down access to the shared resource, often worsening quadratically with the number of contending CPUs.

Q: Why did reducing the eBPF sampling frequency only make the freezes less frequent rather than eliminating them entirely?

A: Reducing the sampling frequency decreases the probability of an NMI (which triggers the eBPF sampling program) occurring precisely when the context switch eBPF program holds the problematic ring buffer spinlock on the same CPU. While the likelihood of this specific race condition decreases, the fundamental flaw in the spinlock's interaction with NMIs remains. Therefore, given enough time or sufficient system load, the specific timing conditions for the freeze can still be met, just less often.