Hidden Latency in Python: Where APM Stops Looking

By Deepanshu Kartikey Last Updated on 08 Jun, 2026

In this blog

A customer told us one of our API was timing out. Their HTTP timeout was 5 seconds. We opened OpenTelemetry and pulled the latency distribution for the same endpoint - p99 was 1 second.

A different alert told us another of our API was spiking to 3, 5, sometimes 9 seconds. The OpenTelemetry parent span confirmed it: duration = 3305ms. But the sum of every child span - every database query, every downstream HTTP call, every Kafka send - added up to about 25 milliseconds.

Two APIs, two stories, one shared symptom: real customer-facing latency that the APM couldn't account for. Almost all of it was happening either before our handler started, or beneath the code APM knew how to watch. This post is about how we found that hidden latency, and the eBPF tooling we built and will open-source to expose it.

What this series is about

This is a two-part series about latency we couldn't see with the standard observability stack, and the tooling we built to expose it.

In Part 1 (this post) we look at hidden latency inside the Python process: the gunicorn handoff queue silently delaying requests before any worker picks them up, garbage-collection pauses blocking sibling threads through the GIL, and lock / off-CPU waits during execution. Between the first two stories alone we cut P95 on the affected endpoints roughly in half and eliminated multi-second customer timeouts that OpenTelemetry could not attribute.

In Part 2, we move down the stack into the network - AWS hypervisor packet drops, TCP retransmits, zero-window stalls, and connection churn - and walk through the eBPF network tracer we built to find them. We will open-source the entire stack (eight trackers in one repo: gunicorn handoff time, GC pauses, GIL waits, off-CPU stacks, TCP accept, TCP connect, TCP loss, and host-level AWS throttling) at the end of part 2.

What APM can and can't - see

APM (OpenTelemetry, Datadog, NewRelic, others) has a model: a request span starts when your code says it starts and ends when your code says it ends. Inside it, you get child spans for the things you instrumented - database queries, HTTP clients, message brokers. This is a powerful model when the slowness happens inside your handler code, on a network call your library knows how to wrap.

It is a useless model for everything else.Specifically, APM cannot tell you:

We've been hunting that hidden latency - in Python and in the network. The two cases below are the same story told twice: pre-handler latency in the gunicorn handoff queue, and runtime-internal latency in the garbage collector. Both were invisible to APM. Both were exposed with eBPF.

The architecture, briefly

Our Python API stack is unremarkable on paper:

figure (1)

A request comes in from the customer, hits an AWS Application Load Balancer, gets forwarded to nginx, which proxies to gunicorn, which hands the request to one of its Python worker threads. The worker runs Django, makes downstream calls (Postgres via PgBouncer, Elasticsearch, third-party APIs, Kafka producers), assembles a response, and sends it back.

Each box in that diagram is well-understood individually. What's not well-understood is what happens between the boxes - the queues, buffers, and waits that OpenTelemetry doesn't instrument because nobody wrote a span for them. Zoom in to a single request and the picture is much busier:

microscopic_with_metrics (2)

Microscopic view with eBPF observation points: every queue, buffer, and process annotated with the eBPF metric that observes it

Let's walk through every queue, buffer, and process in this view - because every one of them is a place latency can hide.

Customer → ALB. The customer's HTTP request leaves their browser or backend and reaches our load balancer over the public internet.

TCP handshake (ALB ↔ EC2). Before any HTTP data flows, the ALB and our EC2 box complete a three-way TCP handshake (SYN, SYN-ACK, ACK). One round trip. Cheap when the connection is reused; not free when it isn't.

SYN queue. A kernel-side queue holding half-completed handshakes - connections where the kernel has received the SYN but is still waiting for the final ACK. Each entry lives here for about one RTT.

Nginx accept queue. Once the handshake completes, the connection sits here until nginx calls accept() on it. If nginx is healthy this is microseconds; if nginx is overwhelmed, established connections pile up and time accumulates that the customer can feel.

Nginx. Reverse proxy. Terminates TLS, applies routing, opens (or reuses) a connection to gunicorn, forwards the request.

TCP handshake (nginx - > gunicorn). Same three-way dance, between two processes on the same box. Loopback handshakes are sub-millisecond but they still happen every time keepalive isn't configured.

Gunicorn accept queue. Same idea as the nginx accept queue, one layer in. Fully-established connections waiting for the gunicorn master process to accept() them. When the kernel-level backlog fills, the next SYN gets dropped and the customer sees a connection failure.

Gunicorn (master). A single process that owns the listening socket. Calls accept(), then hands the connection off to a worker thread.

Per-connection recv buffer. Once gunicorn calls accept() on the socket, the kernel allocates a per-connection receive buffer (default ~85 KB) where incoming HTTP request bytes accumulate. If the worker is slow to read(), the buffer fills and the kernel advertises a zero-window - a TCP-level "stop sending."

Gunicorn handoff queue. This one is gunicorn-internal, not a kernel queue. After accepting a connection, the master puts the socket descriptor on an in-process queue. Worker threads pull from it. If every worker is busy, the connection sits here. OpenTelemetry only starts a span when a worker picks up the connection, so any time spent here is completely invisible to APM. This is the queue that caused our customer's timeouts.

Python worker threads (T0, T1, T2). The threads that actually execute the request handler. Acquire the GIL, run Django, call downstream services, build the response. Three things can stall a worker thread that APM can't see: a sibling thread holding the GIL too long (typically during GC), an application-level lock the thread is waiting on, or an off-CPU block on a slow syscall.

Outbound send buffer. When the worker writes data to a downstream service, it goes into a per-socket kernel send buffer first, then onto the wire. If the network is slow or the peer's receive window is closed, this buffer can fill and the worker's write() will block.

Outer world. PgBouncer, Elasticsearch, Kafka brokers, internal services, third-party APIs. Anything on the other end of a TCP connection from the worker.

Outbound recv buffer. Symmetric to the send buffer. The kernel parks incoming bytes from the downstream service here until the worker reads them.

Failure modes on outbound paths. Four things go wrong on the outbound side, all invisible to APM and all instrumentable with eBPF: every new connection pays a TCP handshake (1 RTT cost - bad when you're not pooling); a lost packet triggers a kernel-level retransmit; multiple losses in a row trigger an RTO and stall the connection for ~200 ms or more; if the peer's receive buffer fills, you see zero-window stalls until they drain. None of these show up in OpenTelemetry as anything except "the call took longer than usual."

OpenTelemetry can see the work you tell it about. The work you don't know to instrument - that's where the missing seconds live.

eBPF, briefly

eBPF lets you attach probes to functions inside any running process - including CPython's own internals - and to scheduler / syscall events in the kernel itself. Probes fire kernel-side, with no application code changes, no recompilation, and overhead measured in nanoseconds per event.

We picked the kinds of events that map directly to the blind spots above:

figure__frame (1)

Every event is stamped with (pid, tid, start_time, end_time) and shipped to ClickHouse. The same (pid, tid, timestamp) is also captured on every OpenTelemetry span (via a small SpanProcessor hook that reads container-namespace PIDs), so we can JOIN the two datasets and finally answer questions like "during this slow request, what was the GC doing on every other thread in the same process?" or "how long did this request wait in gunicorn's handoff queue before the parent span even started?"

This is exactly the question that broke open both stories below.

Story one: the 11-second queue

Our customer's HTTP client had a 5-second timeout. They were seeing it fire on POST /api/v1/clickpost-api-1. Reproducible. Frustrating for them. Mysterious for us - because when we opened OpenTelemetry, p99 latency on the same endpoint was 1 second.

The customer's tooling said the API took more than 5 seconds. Ours said p99 was 1 second. Neither was lying. The gap between them was hidden time - time that happened to the request before it ever reached our code.

When the gunicorn master accepts a connection, the connection doesn't go straight to a worker. It sits on gunicorn's internal handoff queue waiting for one of the worker threads to be free. The OpenTelemetry span starts only after a worker picks the connection up and calls into our Django handler - so any wait in that internal queue is completely invisible to APM. We suspected our requests were stuck there, but we couldn't prove it without measuring it.

So we built a tracker for exactly that: time between accept() in the gunicorn master and the first recv() by a worker. Same connection, different threads.

Building the handoff tracker

This is what made eBPF beautiful here. The Python application doesn't need to know we're measuring it. We don't add code. We don't modify gunicorn. We hook a few syscalls at the kernel boundary:

🐍 Python

The idea, in words: every connection has a unique (pid, fd) key. When the gunicorn master returns from accept4() with a new fd, we record the timestamp under that key. The connection then sits in gunicorn's internal handoff queue waiting for a worker. When some worker thread eventually picks up that connection and calls recvfrom() (or recvmsg()) to read the HTTP request body, we look up the key, compute now - ts_queued, and that delta is handoff_ms - the time the request spent in the queue. The same hashmap also gives us the master TID and worker TID, so we know exactly which thread accepted the connection and which one finally picked it up. Less than 30 lines of BPF C. No change to gunicorn or Django.

What we found

A day later, we JOIN'd ebpf_handoff against the OpenTelemetry trace for the slow recommendation API requests:

data__scroll (4)

The customer's experience was in the right-hand column. Ours was in the left. Once a worker thread picked up the request, the handler ran in 140-240 milliseconds - that's what APM saw, and APM was correct: the handler was fast. But every request had been stuck in gunicorn's handoff queue for up to 11 seconds before a worker thread became free to pick it up. The customer's 5-second client-side timeout fired more than twice over before our code ran at all.

The fix

Once we knew where the time was going, the fix was obvious: gunicorn didn't have enough worker threads to keep up with the request rate on this endpoint. We increased the thread count. Connections stopped piling up in the internal handoff queue, handoff_ms dropped back to single-digit milliseconds across the board, and the customer's 5-second timeouts stopped firing.

The interesting work wasn't tuning the thread count - that's a one-line config change anyone can guess at. The interesting work was being able to prove that the queue was the problem before changing anything. Without handoff_ms, every "maybe we need more workers?" suggestion was a guess; with it, we knew exactly which endpoint was queueing, by how much, and could confirm the fix worked the moment we deployed.

What carries forward

The handoff tracker is now running continuously across the gunicorn fleet. Any customer who complains about timeouts can be JOIN'd against ebpf_handoff in two minutes: we know within seconds whether the wait was in the queue or in our code. APM tells us during the span. eBPF tells us before the span.

Story two: The 3-second pause

The second case started the same way the first did - a customer-facing API that was slow - but the answer turned out to live in a completely different place. A slow trace from a prediction endpoint, total duration 3305 ms:

trace__frame (1)

The handoff tracker said this trace had handoff_ms = 8 - the request reached the worker thread cleanly. So the 3.28-second gap was inside the worker, not before it. The thread was alive and held the connection, it just wasn't progressing. The candidates: a GC pause, a GIL wait, an off-CPU block. We pulled ebpf_gc_pause for that worker's pid during the silent window:

data

The mechanism is straightforward once you see it. CPython's GIL is process-wide - every thread must hold the GIL to execute bytecode. When the garbage collector runs, the thread doing the collection holds the GIL the entire time. So if any worker thread inside the process starts a 3-second gen-2 sweep, every other thread in that process is frozen for those 3 seconds - regardless of whether they're handling a request, regardless of whether they're doing real work, regardless of whether they're on a different CPU.

In our case, ThreadPoolExecutor-2_3 was running a background task that triggered a gen-2 collection. Our request thread (tid=707) was holding an accepted connection and waiting to issue its first Postgres query. It couldn't - the GIL was elsewhere. So it sat. For 3178 milliseconds. Then GC finished, our thread got the GIL, and the request completed in 25 ms.

OpenTelemetry has no ability to see this. There's no span for "waiting to acquire the GIL while a sibling thread runs gen-2 GC." There can't be - the wait isn't in your code, isn't in any library, isn't in any framework. It's in the runtime, beneath your code.

The pattern was consistent across every slow trace we examined. 80–95% of the latency on these slow requests was a single gen-2 GC pause on a sibling thread. Aggregated over a minute, one worker had spent 11.6 seconds in gen-2 collection. At 13,857 gen-2 collections per hour, GC was eating roughly 20% of every worker's wall-clock time. APM only ever saw "the request was slow."

How we hooked GC

CPython's gc_collect_main is the C function that drives every collection - generation 0, 1, or 2. We attach a uprobe on entry, a uretprobe on exit, and compute the duration:

🐍 Python

PT_REGS_PARM2 reads the second argument of gc_collect_main - that's the generation number. The FILTER_PID macro is templated at load time to filter to our gunicorn worker PIDs. Each event flows through a perf buffer to a Python reader process, which enriches the event with the thread name (resolved via py-spy every 30 seconds) and writes batched JSON to ClickHouse. The ClickHouse table is designed with the JOIN in mind:

🗄️ SQL

(pid, tid, start_time) is the sort key - the exact JOIN key against OpenTelemetry traces. DateTime64(6) gives microsecond precision (matching OTel). The 7-day TTL keeps storage bounded without manual cleanup.

Mirror tables exist for ebpf_gil_wait, ebpf_off_cpu, and ebpf_handoff.

The fix: gc.freeze()

Once we knew GC was the problem, the fix was small but non-obvious.

CPython's garbage collector is generational. New objects start in generation 0; if they survive a collection, they get promoted to gen 1; if they survive longer, gen 2. Gen-2 collections are the expensive ones - they walk every long-lived object in the heap to find references.

In our case, the "long-lived" objects were almost entirely startup state: Django models, settings, OpenTelemetry instrumentors, third-party module objects loaded at import time. Things that get created once and live forever. Every gen-2 collection was walking 1.07 million of these objects, finding nothing collectable, and finishing in seconds.

gc.freeze() is a CPython API that moves every object currently alive into a special "permanent" generation that gen-2 collections skip. We added one line to gunicorn's post_fork hook, after Django and OpenTelemetry finished initializing:

After this change:

Objects visible to the collector dropped from 1.07M to ~22k per worker. Everything frozen at startup is now invisible to gen-2 sweeps.
Max gen-2 pause dropped from 5244 ms to 226 ms. Still not free, but no longer customer-visible.
P95 latency on affected endpoints dropped roughly 66%; P99 dropped about 56%.

The fix was three lines. The work was finding it - three months of "why is this API slow" with no answer, until we had data the existing tools couldn't produce.

Bonus: GIL waits and off-CPU stacks

The same eBPF infrastructure exposes two more invisible-to-APM signals: how long each thread waited for the GIL (and which thread was holding it), and where threads got parked off-CPU with full kernel + user stacks. We use them less often than handoff and GC pause, but when the answer isn't in either of those two, it's almost always in one of these.

GIL waits - who is the culprit thread?

The tracker hooks CPython's take_gil (on entry) and drop_gil (to record the current holder). Every time a thread finally acquires the GIL, we know which thread it took the GIL from, and how long it waited. The culprit - the thread that was hogging the GIL - shows up as holder_thread_name on every wait event.

data__scroll (5)

Read row 4: MainThread waited 17.5 seconds for the GIL - held by ThreadPoolExecutor-2_4. Row 1: ThreadPoolExecutor-2_7 waited 18.9 seconds, blocked by ThreadPoolExecutor-2_1. Every one of those waits was a thread blocked from executing Python bytecode because another thread held the GIL too long. APM has no way to see this - GIL waits aren't in your handler code, they're the runtime arbitrating between threads.

When you see one holder_thread_name showing up repeatedly across many slow waits - like ThreadPoolExecutor-2_6 above - that thread is the one to investigate.

Off-CPU stacks - why was the thread sleeping?

When a thread isn't waiting on the GIL but still isn't progressing, it's parked off-CPU. The sched_switch tracepoint catches the moment the scheduler deschedules it, with full user and kernel stacks attached:

data__scroll (7)

Three patterns to recognize:

· PyThread_acquire_lock_timed in the user stack: a Python-level threading.Lock, Semaphore, or bounded queue.Queue is being waited on. This is application-level lock contention - not the GIL (the GIL shows up as take_gil). If this dominates your off-CPU time, there's a lock in your app code being held too long.

· pthread_cond_wait + a library's background_thread_entry (e.g. polars.abi3.so): an idle background worker from a third-party library. Normal. Filter out.

· Empty stacks at 60+ seconds: long-sleeping daemon / heartbeat threads where we couldn't unwind a stack. Expected, not a problem.

The kernel stack __schedule → __schedule → schedule confirms the thread was genuinely parked by the scheduler. eBPF caught it via the sched_switch tracepoint with prev_state set to non-runnable.

A note on user stacks. Python is an interpreted language, so the symbols you see today are CPython's C-level functions (PyThread_acquire_lock_timed, do_futex_wait, etc.) rather than the Python source frame that originally triggered the wait. We're actively working on resolving Python frames as part of the stack - until then, the C-level stack is enough to classify the cause (app lock vs library idle vs daemon sleep), even if it doesn't pinpoint the exact line of Python code.

Coming in part 2

Both stories above lived above the network layer - pre-handler queue waits, runtime-internal pauses. The network layer has its own set of invisible-to-APM latency sources, and in part 2 we go down into it: AWS hypervisor packet drops on the ENA driver, TCP retransmits, RTO stalls of 200 ms+ that look like "the call was slow," zero-window stalls when a peer's receive buffer fills, and the connection-churn problem (every new HTTP connection paying a fresh TCP handshake). We'll show how we built the eBPF network tracer for it, and at the end of part 2 we'll open-source the whole observability stack - eight trackers, one repo - so you can run this in your own environment.

The bigger lesson from this post isn't use gc.freeze or watch your handoff queue. It's that APM measures what you instrument; eBPF measures what the kernel sees; and the gap between them is exactly where hidden latency lives.

Deepanshu Kartikey

Platform Engineer

Linux kernel developer and performance engineer with patches merged across BPF, networking, ext4, and memory management subsystems. I specialize in eBPF-based observability, building tools that provide kernel-level visibility into application performance and system optimization.

View profile