Home > Blog >

From Compiler Theory to Production: How AST Powers Our PII Protection

From Compiler Theory to Production: How AST Powers Our PII Protection

Deepanshu Kartikey
By Deepanshu Kartikey

In this blog

    At ClickPost, we process millions of shipments every day. Behind every shipment is a layer of customer PII passed through courier APIs, order payloads, and tracking systems. Protecting that data is not optional. It is foundational.

    So we asked ourselves: what if we could catch every potential PII exposure at the code level before it ever reaches databases and logs? What if we borrowed a 50-year-old idea from compiler design and turned it into a security tool?

    That's exactly what we built. 

    Why This Matters in Logistics

    Logistics platforms move data through deep function chains. A customer's phone number gets decrypted, placed into a payload, forwarded to a courier partner API, and sometimes passes through 5-6 internal functions along the way.

    With 31,000+ Python files in our codebase, relying on manual code reviews to catch every place sensitive data might get exposed is not scalable. We needed automated, intelligent enforcement, something that understands how data actually flows through code.

    Not a regex scanner. Not a keyword search. Something that thinks like a compiler.

    What is an AST? (And Why Compilers Use It)

    The core technology behind our tool is the Abstract Syntax Tree, the same data structure every compiler has used since the 1960s.

    When you write code, the compiler does not read it the way you do. It transforms it through a pipeline:

    How compiler use AST

    Lexing: Your source code x = 1 + 2 is broken into tokens: x, =, 1, +, 2. Just raw words, no meaning yet.

    Parsing: The parser reads the tokens and builds a tree based on the language's grammar rules. It understands that = is an assignment, + is an operation, and 1 and 2 are operands.

    The AST is born: The result is a structured tree:

    Every tool you use, linters like pylint, formatters like black, type checkers like mypy, they all walk this exact tree. Compilers optimize it and generate machine code from it.

    Python gives us this tree for free. The ast module in the standard library parses any Python file into this structure without executing the code. Every assignment, function call, and return statement becomes a node you can inspect.

    🐍 Python

    We took this compiler primitive and applied it to a security problem: tracking where sensitive data flows through our codebase.

    The Approach: Taint Analysis

    The concept comes from security research, taint analysis. You mark certain functions as "sources" of sensitive data, then track every variable that touches that data as it flows through assignments, dicts, lists, and function calls. If that variable reaches a dangerous "sink" like print() or a logging function, you flag it before it ships.

    How taint flows through code

    Think of it as putting a dye marker on decrypted data and watching where the color spreads.

    How Our Tool Works

    Step 1: Define the Sources

    We have a limited set of core decryption functions, the only places in our codebase where encrypted phone numbers become readable:

    Any variable assigned from these becomes tainted. That's where the trail begins.

    Step 2: Walk the AST, Follow the Taint

    Our AST visitor (a class inheriting from ast.NodeVisitor) handles each node type:

    • visit_Assign: does the right side contain tainted vars? If yes, taint the left side

    • visit_Call: is a tainted var passed as an argument? Flag it

    • visit_Return: is tainted data being returned? Track it

    🐍 Python

    The taint spreads through the AST, from simple assignments to dicts, lists, nested subscripts, and container methods. If the color touches something it should not, we know.

    Step 3: Classify the Destination

    When tainted data reaches a function call, we classify it:

    • Blocked sink: print(phone), logger.info(payload) | CRITICAL, build blocked

    • Unknown function: some_func(phone) | HIGH, must verify and whitelist

    • Allowed sink: requests.post(data=payload) | INFO, passes, reviewer notified

    This three-tier system gives developers clear, actionable signals rather than a wall of noise.

    Step 4: Cross-Function Tracing

    Cross-functiona taint tracing

    The biggest engineering challenge: phone data crosses 5-6 functions before reaching its final destination. Within a single function, taint analysis is straightforward. But when send_to_partner_api() receives api_payload as a parameter, the AST of that function alone does not reveal it is tainted.

    We solved this with pattern matching on function names:

    Functions matching these patterns are recognized as carrying sensitive data. Their return values are treated as tainted by callers. One pattern covers dozens of functions, no manual listing needed.

    These functions can return phone data (that's their job) and pass it to allowed sinks like courier APIs, but they cannot log or print it. If they do, it's a CRITICAL violation regardless of the function's purpose.

    Scaling to 31,000 Files

    Our codebase has 31,580 Python files. Parsing all of them would take minutes, too slow for a PR pipeline.

    The insight:

    💻 Bash

    47 files. Out of 31,580. That's 0.15% of the codebase.

    Only 47 files ever touch a decryption function. We use grep, C-level string search, to find them, then parse only those with AST. Total pipeline time: under 5 seconds for the entire codebase.

    Bitbucket Integration

    Bitbucket Integration

    The tool runs on every pull request and posts inline annotations directly on the changed lines using Bitbucket's Code Insights API. Bitbucket Pipelines provides a built-in authentication proxy, so there are no tokens or credentials to manage.

    Reviewers see the annotation on the exact line where sensitive data flows. It tells them the variable name, the destination function, and whether it needs verification. No context switching, no separate report to read.

    Our pipeline config is four lines:

    Developer Experience: What Happens When It Flags Something

    When a developer's build fails, they see a clear message:

    • CRITICAL: print(phone) | Logging decrypted data | Remove the print/log

    • CRITICAL: do_log(payload) | Payload contains phone | Mask before logging, or remove

    • HIGH: xyz() not whitelisted | Unknown function receives phone data | Verify it's safe, add to config

    • HIGH: return from function | Function returns tainted data | Add to tainted_return_functions

    The config is self-service. Developers verify the function is legitimate, add one line to the config, and re-run. No security team bottleneck.

    What We Learned

    AST is underrated. Most Python developers know the ast module exists but never touch it. It is the same tree structure that compilers have relied on for 50+ years, and Python hands it to you for free. If you can describe a code pattern in terms of assignments, calls, and returns, you can detect it automatically.

    Start with the data, not the code. Only 47 out of 31,580 files matter. The sensitive data is the protagonist of this story. Follow it, and the relevant code reveals itself.

    False positives kill adoption. We invested heavily in precision: passthrough functions for built-ins like len() and str(), safe dict key tracking so payload['city'] does not trigger even when payload['phone'] is tainted, and pattern-based whitelisting that covers entire function families with a single rule.

    Automated enforcement beats manual review. A tool that runs on every PR and gives instant, specific feedback is worth more than a quarterly security audit. Developers learn the patterns. The codebase gets cleaner over time.

    What's Next

    We're building a call graph tracer that eliminates manual configuration entirely:

    1. grep for the 47 files with decrypt calls (2 seconds)

    2. Parse those files, find what functions they call

    3. grep for each called function's definition across the codebase

    4. Recursively trace until hitting external libraries like requests, then stop

    5. At every hop, check: does this function log or print the tainted parameter?

    Full cross-function taint tracking. Zero manual configuration. Under 15 seconds for 31K files.

    The AST has been powering compilers since Fortran. We're using the same tree to protect customer data at scale. If your codebase handles sensitive data, the tools to protect it are already built into your language. You just have to use them.

    Post Purchase Intelligence to Power Your Ambition

    G2 Momentum Leader G2 Highest User Adoption Jan 2026 G2 High Performer Mid Market G2 2026 JAN