From Compiler Theory to Production: How AST Powers Our PII Protection

By Deepanshu Kartikey Last Updated on 02 Apr, 2026

In this blog

At ClickPost, we process millions of shipments every day. Behind every shipment is a layer of customer PII passed through courier APIs, order payloads, and tracking systems. Protecting that data is not optional. It is foundational.

So we asked ourselves: what if we could catch every potential PII exposure at the code level before it ever reaches databases and logs? What if we borrowed a 50-year-old idea from compiler design and turned it into a security tool?

That's exactly what we built.

Why This Matters in Logistics

Logistics platforms move data through deep function chains. A customer's phone number gets decrypted, placed into a payload, forwarded to a courier partner API, and sometimes passes through 5-6 internal functions along the way.

With 31,000+ Python files in our codebase, relying on manual code reviews to catch every place sensitive data might get exposed is not scalable. We needed automated, intelligent enforcement, something that understands how data actually flows through code.

Not a regex scanner. Not a keyword search. Something that thinks like a compiler.

What is an AST? (And Why Compilers Use It)

The core technology behind our tool is the Abstract Syntax Tree, the same data structure every compiler has used since the 1960s.

When you write code, the compiler does not read it the way you do. It transforms it through a pipeline:

How compiler use AST

Lexing: Your source code x = 1 + 2 is broken into tokens: x, =, 1, +, 2. Just raw words, no meaning yet.

Parsing: The parser reads the tokens and builds a tree based on the language's grammar rules. It understands that = is an assignment, + is an operation, and 1 and 2 are operands.

The AST is born: The result is a structured tree:

Every tool you use, linters like pylint, formatters like black, type checkers like mypy, they all walk this exact tree. Compilers optimize it and generate machine code from it.

Python gives us this tree for free. The ast module in the standard library parses any Python file into this structure without executing the code. Every assignment, function call, and return statement becomes a node you can inspect.

🐍 Python

We took this compiler primitive and applied it to a security problem: tracking where sensitive data flows through our codebase.

The Approach: Taint Analysis

The concept comes from security research, taint analysis. You mark certain functions as "sources" of sensitive data, then track every variable that touches that data as it flows through assignments, dicts, lists, and function calls. If that variable reaches a dangerous "sink" like print() or a logging function, you flag it before it ships.

How taint flows through code

Think of it as putting a dye marker on decrypted data and watching where the color spreads.

How Our Tool Works

Step 1: Define the Sources

We have a limited set of core decryption functions, the only places in our codebase where encrypted phone numbers become readable:

Any variable assigned from these becomes tainted. That's where the trail begins.

Step 2: Walk the AST, Follow the Taint

Our AST visitor (a class inheriting from ast.NodeVisitor) handles each node type:

visit_Assign: does the right side contain tainted vars? If yes, taint the left side
visit_Call: is a tainted var passed as an argument? Flag it
visit_Return: is tainted data being returned? Track it

🐍 Python

The taint spreads through the AST, from simple assignments to dicts, lists, nested subscripts, and container methods. If the color touches something it should not, we know.

Step 3: Classify the Destination

When tainted data reaches a function call, we classify it:

Blocked sink: print(phone), logger.info(payload) | CRITICAL, build blocked
Unknown function: some_func(phone) | HIGH, must verify and whitelist
Allowed sink: requests.post(data=payload) | INFO, passes, reviewer notified

This three-tier system gives developers clear, actionable signals rather than a wall of noise.

Step 4: Cross-Function Tracing

Cross-functiona taint tracing

The biggest engineering challenge: phone data crosses 5-6 functions before reaching its final destination. Within a single function, taint analysis is straightforward. But when send_to_partner_api() receives api_payload as a parameter, the AST of that function alone does not reveal it is tainted.

We solved this with pattern matching on function names:

Functions matching these patterns are recognized as carrying sensitive data. Their return values are treated as tainted by callers. One pattern covers dozens of functions, no manual listing needed.

These functions can return phone data (that's their job) and pass it to allowed sinks like courier APIs, but they cannot log or print it. If they do, it's a CRITICAL violation regardless of the function's purpose.

Scaling to 31,000 Files

Our codebase has 31,580 Python files. Parsing all of them would take minutes, too slow for a PR pipeline.

The insight:

💻 Bash

47 files. Out of 31,580. That's 0.15% of the codebase.

Only 47 files ever touch a decryption function. We use grep, C-level string search, to find them, then parse only those with AST. Total pipeline time: under 5 seconds for the entire codebase.

Bitbucket Integration

The tool runs on every pull request and posts inline annotations directly on the changed lines using Bitbucket's Code Insights API. Bitbucket Pipelines provides a built-in authentication proxy, so there are no tokens or credentials to manage.

Reviewers see the annotation on the exact line where sensitive data flows. It tells them the variable name, the destination function, and whether it needs verification. No context switching, no separate report to read.

Our pipeline config is four lines:

Developer Experience: What Happens When It Flags Something

When a developer's build fails, they see a clear message:

CRITICAL: print(phone) | Logging decrypted data | Remove the print/log
CRITICAL: do_log(payload) | Payload contains phone | Mask before logging, or remove
HIGH: xyz() not whitelisted | Unknown function receives phone data | Verify it's safe, add to config
HIGH: return from function | Function returns tainted data | Add to tainted_return_functions

The config is self-service. Developers verify the function is legitimate, add one line to the config, and re-run. No security team bottleneck.

What We Learned

AST is underrated. Most Python developers know the ast module exists but never touch it. It is the same tree structure that compilers have relied on for 50+ years, and Python hands it to you for free. If you can describe a code pattern in terms of assignments, calls, and returns, you can detect it automatically.

Start with the data, not the code. Only 47 out of 31,580 files matter. The sensitive data is the protagonist of this story. Follow it, and the relevant code reveals itself.

False positives kill adoption. We invested heavily in precision: passthrough functions for built-ins like len() and str(), safe dict key tracking so payload['city'] does not trigger even when payload['phone'] is tainted, and pattern-based whitelisting that covers entire function families with a single rule.

Automated enforcement beats manual review. A tool that runs on every PR and gives instant, specific feedback is worth more than a quarterly security audit. Developers learn the patterns. The codebase gets cleaner over time.

What's Next

We're building a call graph tracer that eliminates manual configuration entirely:

grep for the 47 files with decrypt calls (2 seconds)
Parse those files, find what functions they call
grep for each called function's definition across the codebase
Recursively trace until hitting external libraries like requests, then stop
At every hop, check: does this function log or print the tainted parameter?

Full cross-function taint tracking. Zero manual configuration. Under 15 seconds for 31K files.

The AST has been powering compilers since Fortran. We're using the same tree to protect customer data at scale. If your codebase handles sensitive data, the tools to protect it are already built into your language. You just have to use them.

Deepanshu Kartikey

Platform Engineer

Linux kernel developer and performance engineer with patches merged across BPF, networking, ext4, and memory management subsystems. I specialize in eBPF-based observability, building tools that provide kernel-level visibility into application performance and system optimization.

View profile