Debugging Linux Kernel Freezes: An eBPF Spinlock Saga
This article details the journey of debugging mysterious system freezes caused by eBPF programs in the Linux kernel. We uncovered an issue where an NMI-driven eBPF sampling program would self-deadlock by attempting to acquire a spinlock already held by another eBPF program on the same CPU, leading to 250ms kernel timeouts. The analysis highlights the complexities of spinlocks, NMIs, and cache coherence in kernel development.

As developers, we often pride ourselves on creating robust software that "just works." So, when our CPU profiler, Superluminal, started causing periodic full system freezes on a tester's Fedora 42 machine (kernel 6.17.4-200), we knew we had a serious challenge on our hands. This wasn't just a simple crash; the entire system would become unresponsive for short bursts, making traditional debugging nearly impossible. The hunt for this elusive bug led us deep into the Linux kernel's intricate world of eBPF and spinlocks.
Initial Clues from a Frozen System
Our first step was to analyze the system's behavior. Superluminal captures revealed suspicious periods, over 250 milliseconds long, where all threads appeared busy, yet no samples were being collected. Concurrently, dmesg output showed alarming messages like:
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 250.424 msecs
These messages perfectly matched the freeze durations, strongly suggesting a kernel-level issue, specifically within a Non-Maskable Interrupt (NMI) handler. However, trying to attach a debugger to a freezing kernel instance proved futile; gdb itself would crash or time out, leaving us without direct insight into the kernel's state during these critical moments.
Isolating the Problem with a Minimal Repro
With direct debugging stalled, our strategy shifted to creating a minimal reproduction. Superluminal's Linux backend is substantial, involving around 2000 lines of eBPF code. We suspected the issue lay in how our eBPF programs interacted with kernel events. We categorize our eBPF events into three main types: sampling, context switch, and wake events.
Through systematic testing, enabling and disabling these event types, we made a crucial observation:
- Neither sampling events alone nor context switch/wake events alone caused freezes.
- Freezes only occurred when both sampling events and context switch events were enabled, even with wake events disabled.
- Reducing the sampling frequency decreased the frequency of freezes but didn't eliminate them.
This pointed to an interaction bug. We then painstakingly stripped down our eBPF code, keeping only the bare essentials for sampling and context switch events, until we arrived at this minimal, freeze-inducing eBPF program:
c struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 512 * 1024 * 1024); } ringBuffer SEC(".maps");
SEC("tp_btf/sched_switch") int cswitch(struct bpf_raw_tracepoint_args* inContext) { struct CSwitchEvent* event = bpf_ringbuf_reserve(&ringBuffer, sizeof(struct CSwitchEvent), 0); if (event == NULL) return 1; bpf_ringbuf_discard(event, 0); return 0; }
SEC("perf_event") int sample(struct bpf_perf_event_data* inContext) { struct SampleEvent* event = bpf_ringbuf_reserve(&ringBuffer, sizeof(struct SampleEvent), 0); if (event == NULL) return 1; bpf_ringbuf_discard(event, 0); return 0; }
These programs do almost nothing beyond attempting to reserve and then immediately discard space in a BPF ring buffer using bpf_ringbuf_reserve and bpf_ringbuf_discard.
Unmasking the Spinlock Issue
Given the minimal code, bpf_ringbuf_reserve became our prime suspect. A quick look at its kernel implementation revealed it's guarded by a spinlock: raw_res_spin_lock_irqsave and raw_res_spin_unlock_irqrestore. These functions are designed to disable local interrupts and preemption to protect critical sections of code. However, the local_irq_save component only disables maskable interrupts.
Our key observation about sampling events, which trigger Non-Maskable Interrupts (NMIs), immediately sparked a hypothesis:
- An eBPF program, perhaps the context switch handler, acquires the ring buffer spinlock on a CPU.
- This spinlock disables maskable interrupts but critically, not NMIs.
- While the lock is held, a sampling NMI occurs on the same CPU.
- The NMI handler, running on the same CPU, then also attempts to acquire the same ring buffer spinlock.
Since the spinlock is already held by the initial eBPF program on that CPU, the NMI handler would enter a spin-wait loop. Crucially, the spinlock implementation includes a timeout to prevent indefinite spinning. The RES_DEF_TIMEOUT constant, often used in these spin-wait loops, is defined as NSEC_PER_SEC / 4, which is precisely 0.25 seconds, or 250 milliseconds.
This was our "smoking gun." The 250ms timeout in the spinlock perfectly matched the observed 250+ ms system freezes and the NMI handler dmesg warnings. The system was effectively self-deadlocking: an NMI handler on a CPU would attempt to acquire a spinlock already held by code on the same CPU, which it could never release because the NMI blocked its execution. The spinlock would eventually time out, causing the observed freezes.
A Primer on Spinlocks and Their Pitfalls
This incident highlights some fundamental challenges with spinlocks, especially in kernel contexts. A basic spinlock works by repeatedly attempting an atomic compare-and-swap (CAS) operation until it successfully changes a locked flag from 0 to 1. If the CAS fails, it means another thread holds the lock, and the current thread "spins" in a loop, wasting CPU cycles.
Beyond wasted cycles, spinlocks can suffer from severe performance degradation due to "cache line bouncing." Modern CPUs use protocols like MESI to maintain cache coherence. When multiple CPUs contend for a spinlock, they repeatedly try to write to the locked flag, which sits in a single cache line. Each write attempt requires a CPU to acquire the cache line in a Modified state, invalidating it in all other CPUs' caches. This generates a constant "storm" of expensive inter-core communication over the memory bus, with performance degrading quadratically with the number of contenders. This also contributes to "unfairness," where no guarantee exists that a waiting thread will eventually acquire the lock, potentially leading to starvation if other threads continually win the race.
In our case, the specific issue wasn't just general contention, but a critical interaction with NMIs and the interrupt masking properties of raw_res_spin_lock_irqsave. The fact that NMIs cannot be masked meant that they could interrupt code holding a spinlock, then attempt to acquire the same lock, leading to a self-deadlock scenario and the subsequent timeout-induced freezes.
Upon reporting our findings to the eBPF kernel mailing list, our analysis was confirmed, leading to further investigations and fixes by kernel maintainers. This journey underscored the importance of understanding the deep interactions between eBPF, kernel primitives, and hardware specifics like NMIs and cache coherence.
FAQ
Q: What distinguishes a Non-Maskable Interrupt (NMI) from a regular interrupt, and why is this relevant to the eBPF spinlock issue?
A: NMIs are special hardware interrupts that cannot be disabled or "masked" by software, unlike regular maskable interrupts. This is critical because kernel code often acquires spinlocks after disabling local interrupts to protect critical sections. If an NMI occurs while such a spinlock is held on the same CPU, and the NMI handler then tries to acquire the same spinlock, it will spin indefinitely (or until a timeout) because the original holder cannot release the lock while the NMI handler is executing.
Q: How does "cache line bouncing" impact spinlock performance, especially in highly contended scenarios?
A: Cache line bouncing, or ping-ponging, occurs when a shared memory location (like a spinlock's locked flag) is frequently written to by multiple CPUs. According to the MESI protocol, a CPU writing to a cache line must acquire it in a Modified state, which requires invalidating that cache line in all other CPUs' caches. When many CPUs contend for a spinlock, they continuously invalidate and re-acquire the cache line, leading to a surge of expensive inter-core communication that dramatically slows down access to the shared resource, often worsening quadratically with the number of contending CPUs.
Q: Why did reducing the eBPF sampling frequency only make the freezes less frequent rather than eliminating them entirely?
A: Reducing the sampling frequency decreases the probability of an NMI (which triggers the eBPF sampling program) occurring precisely when the context switch eBPF program holds the problematic ring buffer spinlock on the same CPU. While the likelihood of this specific race condition decreases, the fundamental flaw in the spinlock's interaction with NMIs remains. Therefore, given enough time or sufficient system load, the specific timing conditions for the freeze can still be met, just less often.
Related articles
Building Responsive, Accessible React UIs with Semantic HTML
Build responsive and accessible React UIs. This guide uses semantic HTML, mobile-first design, and ARIA to create inclusive applications, ensuring seamless user experiences across devices.
Beyond Vibe Coding: Engineering Quality in the AI Era
The concept of 'vibe coding,' an extreme form of dogfooding where developers avoid inspecting AI-generated code, often leads to significant quality issues. A more effective approach involves actively guiding AI tools to clean up technical debt and refactor, treating them as powerful assistants under human oversight. Ultimately, maintaining high software quality, even with AI, remains a deliberate choice for developers.
Offline-First Social Systems: The Rise of Phone-Free Venues
Mobile technology, while streamlining communication and access, has also ushered in an era of constant digital distraction. For developers familiar with context switching and notification fatigue, the impact on
Lisette: Rust-like Syntax, Go Runtime — Bridging Safety and
Lisette is a new language inspired by Rust's syntax and type system, but designed to compile directly to Go. It aims to combine Rust's compile-time safety features—like exhaustive pattern matching, no nil, and strong error handling—with Go's efficient runtime and extensive ecosystem. This approach allows developers to write safer, more expressive code while seamlessly leveraging existing Go tools and libraries.
Linux 7.0 Halves PostgreSQL Performance: A Kernel Preemption Deep Dive
An AWS engineer reported a dramatic 50% performance drop for PostgreSQL on the upcoming Linux 7.0 kernel, caused by changes to kernel preemption modes. While a revert was proposed, kernel developers suggest PostgreSQL should adapt using Restartable Sequences (RSEQ). This could mean significant performance issues for databases on Linux 7.0 until PostgreSQL is updated.
Lessons from 15,031 Hours of Live Coding on Twitch with Chris Griffing
In today's rapidly evolving software landscape, developers are constantly seeking insights into efficient learning, career growth, and adapting to new technologies. While traditional paths exist, some invaluable lessons




