Hunting Python Memory Leaks at the C Level

By Deepanshu Kartikey Last Updated on 09 Mar, 2026

In this blog

Python’s garbage collector handles most memory management automatically — until it doesn’t. In long-running production services, memory usage creeps up over hours or days, OOM kills start hitting your containers, and tracemalloc tells you “yeah, there are allocations” without giving you the full picture. We needed something better.

So we built a C-based memory profiler that intercepts allocations below the Python interpreter, uses sampling to keep overhead minimal, attributes every allocation back to the originating Python function, and ships the data to Elasticsearch in real-time. No code changes. No restarts. Just set an environment variable:

🐍 Python

This post covers how we built it, the Linux dynamic linking tricks that make it possible, what we actually found in production, and why this approach catches leaks that pure-Python profilers miss.

The Problem: Python Memory Leaks Are Harder Than They Look

Python’s reference counting and cyclic garbage collector are great for typical application code. But in production services — think microservices processing millions of tracking updates per day — memory leaks happen for reasons invisible to Python-level tools:

C extension leaks: Libraries like lxml, numpy, or database drivers allocate memory through C’s malloc. Python’s tracemalloc can’t see these at all.
Reference cycles involving __del__: Objects with destructors can’t be collected by the cyclic GC, silently leaking.
Arena fragmentation: CPython’s memory allocator (pymalloc) uses arenas. Even when objects are freed, the arena may not be returned to the OS.
Hidden allocations in standard library code: Things like warnings.warn() or time.strftime() that allocate strings on every call — invisible unless you’re watching at the C level.

Why Not tracemalloc or Memray?

We evaluated every major option. Here’s the comparison that drove our decision:

feature	tracemalloc	Memray	Our Profiler
Sees C malloc?	No	Yes (with --native)	Yes
Memory overhead	200-400MB extra	GB trace files	~3-5% CPU
Config changes?	tracemalloc.start()	PYTHONMALLOC=malloc	Just LD_PRELOAD
Real-time monitoring?	No	No (offline)	Yes (Elasticsearch)
Customizable?	Limited	No (black box)	Fully
Production-safe?	Risky (OOM)	attach can crash	Yes
Leak detection	Python objects only	Needs pymalloc off	Retention sampling

tracemalloc: High Memory Cost, Blind to C

It eats memory. tracemalloc stores a full Python traceback for every live allocation. We saw 200–400MB of overhead on services already tight on memory. You’re trying to diagnose an OOM, and the profiler makes it happen faster.

It can’t see C malloc. When lxml calls xmlMalloc, when your database driver allocates result buffers in C — tracemalloc sees none of it. In our services, ~30–40% of memory was allocated through C extensions.

It requires code changes. You need to call tracemalloc.start(nframes) and decide the frame depth upfront.

Memray: Not Built for Production Monitoring

Bloomberg’s Memray is impressive for debugging. But it needs PYTHONMALLOC=malloc for accurate leak detection (changing program behavior), generates gigabyte trace files, and memray attach can crash your process. There’s no way to stream data to Elasticsearch or set up Kibana dashboards.

The fundamental gap

Existing tools are either blind to C allocations (tracemalloc) or not designed for continuous production monitoring with custom pipelines (Memray). We needed both.

Understanding Dynamic Linking: The Foundation

To understand why our profiler works, you need to understand how programs find their functions at runtime.

Static vs Dynamic: A Live Demo

Take a simple C program and compile it two ways:

⚙️ C

The static binary is 70x larger because it has malloc baked in at a fixed address. The dynamic binary just says “I need malloc from somewhere.”

🐍 Python

Key insight

With static linking, LD_PRELOAD is useless. With dynamic linking, malloc is an unresolved symbol (U) that the linker resolves at runtime — and we can change where it resolves to. Python dynamically links to libc on all Linux systems.

Static vs Dynamic Linking

How Python’s Memory Functions Are Linked

Where Python’s allocation functions live determines what we can intercept.

Reading Python’s Symbol Table

🐍 Python

U (undefined) = always interceptable via LD_PRELOAD. T (defined) = depends on how the binary was built.

Reading Python’s Symbol Table

The Two Python Builds

Python can be built with libpython statically baked in (Debian/Ubuntu) or dynamically linked as a shared library (Fedora, custom builds). This changes interception behavior:

Build 1: Static libpython (Debian/Ubuntu aarch64)

🐍 Python

Internal calls are hardcoded direct jumps — the dynamic linker is never involved.

Build 2: Dynamic libpython (custom build / Fedora)

🐍 Python

All calls to PyMem_Malloc go through the PLT — LD_PRELOAD intercepts them.

Two Python Builds — Two Interception Behaviors

Proving It with LD_DEBUG

Watch the dynamic linker resolve symbols with the profiler loaded:

🐍 Python

Two lines — the complete chain:

libpython calls PyMem_Malloc → lands in trackleak.so (our profiler intercepts it)
trackleak.so calls the real one via dlsym(RTLD_NEXT) → lands in libpython (we pass it through)

Why the Profiler Works on Both Builds

The malloc symbol is U (undefined) on every build. It always goes through the PLT. Even when PyMem_Malloc is baked into the binary, it eventually calls malloc when it needs a new arena — and that call goes through the dynamic linker.

🐍 Python

On static-libpython builds, you catch everything through the malloc safety net. On dynamic-libpython builds, you catch both levels. Either way, nothing escapes.

The Allocation Flow: Normal vs Intercepted

Normal flow:

🐍 Python

With LD_PRELOAD=trackleak.so — double interception:

🐍 Python

Python Memory Allocation Flow — Normal vs Intercepted

Symbol Interposition: How LD_PRELOAD Works

The dynamic linker resolves function symbols at runtime by searching shared libraries in order. LD_PRELOAD lets you jump the queue. dlsym(RTLD_NEXT, "malloc") means “skip me, find the next definition” — giving us a pointer to libc’s real malloc.

A Minimal Example

⚙️ C

🐍 Python

Every malloc call — from CPython’s interpreter loop, from NumPy’s array allocations, from OpenSSL inside your HTTP client — now prints to stderr. No recompilation, no code changes.

How LD_PRELOAD Intercepts malloc

Architecture of the Profiler

Profiler Architecture — From malloc to Elasticsearch

Functions We Intercept

We intercept six functions total:

⚙️ C

On dynamic-libpython builds, all six are intercepted. On static-libpython builds, malloc/free are always intercepted, and PyMem_* are intercepted only from C extensions.

Avoiding Infinite Recursion

Our profiler uses functions that call malloc. We use a thread-local guard (__thread int in_malloc) to detect re-entry:

⚙️ C

Walking the Python Stack from C

From inside our C malloc wrapper, we acquire the GIL, walk the frame chain, extract function name / file path / line number, then release the GIL:

🐍 Python

Why the GIL is necessary: This isn’t about protecting your globals — it’s about protecting CPython’s internal heap objects. PyUnicode_AsUTF8() may write to the object (lazy UTF-8 cache creation). PyFrame_GetCode() does Py_INCREF (a refcount write). Frames can be freed by other threads at any moment. Without the GIL: corrupted refcounts, dangling pointers, segfaults.

Leak Detection Through Retention Rate

The key metric:

🐍 Python

On malloc: sample 1-in-50 and track the pointer in a hash table.
On free: look up the pointer — if it was sampled, credit the freed bytes to the originating function.

High retention = memory leak candidate. A function with 95% retention is allocating memory that nobody is freeing.

The pymalloc Blind Spot

CPython doesn’t call malloc for every small object. It grabs 256KB arenas and carves them internally. We filter for allocations above 500 bytes — the leaks that matter (large buffers, HTTP bodies, deserialized data) bypass pymalloc and hit malloc directly.

Why pymalloc Arenas Are Opaque to malloc-level Profilers

What We Found in Production

On our tracking and shipment update services — high-throughput consumers processing millions of messages — we found two memory leaks hiding in plain sight.

Bug #1: Django’s warnings.warn() on Every Database Write

🐍 Python

Every database write with a naive datetime triggered warnings.warn(). Under the hood, that builds formatted strings, creates WarningMessage objects, and walks the entire call stack. Dozens of string allocations per call, thousands of times per second.

🐍 Python

Fix: Comment out the warning. We know our datetimes are naive — no need for Python to warn us on every single database write.

Bug #2: Python’s Logging formatTime on Every Log Line

🐍 Python

Two string allocations per log line via time.strftime(). Millions of string objects per hour on a heavily-logging shipment tracking service.

🐍 Python

Fix: Return the epoch timestamp as a string and parse it in the ELK pipeline instead. One tiny allocation instead of two formatted strings.

How We Found These

Both bugs share a pattern: death by a thousand cuts. Each individual allocation is tiny. No single call leaks memory. But at thousands of calls per second, allocations accumulate faster than the garbage collector can clean them up.

The profiler’s Kibana dashboards showed these functions with high allocation volume and elevated retention rates. Neither would have been caught by code review, tracemalloc, or Memray trace analysis.

Deployment

One environment variable:

🐍 Python

Auto-detects NOMAD_JOB_NAME for tagging. 10-second startup delay to skip noisy initialization. Ships to Elasticsearch every 5 minutes via async bulk API. ~3–5% CPU overhead. Kibana dashboards show top allocators, retention trends, and per-service memory profiles.

Lessons Learned

Sampling is non-negotiable. Our early prototype tracked every allocation. It worked in testing and fell over under 10K req/s. 1-in-50 gives valid retention rates at <5% overhead.
Hash table resets beat resizes. When the table fills up, wipe it clean. A crash in the profiler is worse than losing data.
GIL acquisition from C is tricky. Calling PyGILState_Ensure() from inside a malloc wrapper that might be called from any thread requires a recursion guard.
Know your Python build. Whether PyMem_Malloc is interceptable depends on static vs dynamic libpython. Use ldd $(which python3) to check. Either way, malloc interception is your safety net.

Wrapping Up

The code is straightforward C — LD_PRELOAD, dlsym, and the Python C API. One .so file, one environment variable. The bugs it found — warnings.warn() firing on every DB write, time.strftime() allocating on every log line — were invisible to every other tool we tried.

Sometimes the leak isn’t in your code. It’s in the code your code calls, at a layer below where Python can see.

🔗 TrackLeak is open source!

Get the code, file issues, and contribute:

github.com/deepanshuclickpost/TrackLeak

Deepanshu Kartikey

Platform Engineer

Linux kernel developer and performance engineer with patches merged across BPF, networking, ext4, and memory management subsystems. I specialize in eBPF-based observability, building tools that provide kernel-level visibility into application performance and system optimization.

View profile