From Compiler Theory to Production: How AST Powers Our PII Protection
In this blog
At ClickPost, we process millions of shipments every day. Behind every shipment is a layer of customer PII passed through courier APIs, order payloads, and tracking systems. Protecting that data is not optional. It is foundational.
So we asked ourselves: what if we could catch every potential PII exposure at the code level before it ever reaches databases and logs? What if we borrowed a 50-year-old idea from compiler design and turned it into a security tool?
That's exactly what we built.
Why This Matters in Logistics
Logistics platforms move data through deep function chains. A customer's phone number gets decrypted, placed into a payload, forwarded to a courier partner API, and sometimes passes through 5-6 internal functions along the way.
With 31,000+ Python files in our codebase, relying on manual code reviews to catch every place sensitive data might get exposed is not scalable. We needed automated, intelligent enforcement, something that understands how data actually flows through code.
Not a regex scanner. Not a keyword search. Something that thinks like a compiler.
What is an AST? (And Why Compilers Use It)
The core technology behind our tool is the Abstract Syntax Tree, the same data structure every compiler has used since the 1960s.
When you write code, the compiler does not read it the way you do. It transforms it through a pipeline:
.png?width=2400&height=1200&name=img1_compiler_pipeline%20(4).png)
Lexing: Your source code x = 1 + 2 is broken into tokens: x, =, 1, +, 2. Just raw words, no meaning yet.
Parsing: The parser reads the tokens and builds a tree based on the language's grammar rules. It understands that = is an assignment, + is an operation, and 1 and 2 are operands.
The AST is born: The result is a structured tree:
Assignment (=)
/ \
Name(x) BinOp(+)
/ \
Num(1) Num(2)
Every tool you use, linters like pylint, formatters like black, type checkers like mypy, they all walk this exact tree. Compilers optimize it and generate machine code from it.
Python gives us this tree for free. The ast module in the standard library parses any Python file into this structure without executing the code. Every assignment, function call, and return statement becomes a node you can inspect.
import ast
code = "phone = decrypt_pii_field(data)"
tree = ast.parse(code)
# tree.body[0] -> Assign node
# .targets[0] -> Name(id='phone')
# .value -> Call(func=Name(id='decrypt_pii_field'))
We took this compiler primitive and applied it to a security problem: tracking where sensitive data flows through our codebase.
The Approach: Taint Analysis
The concept comes from security research, taint analysis. You mark certain functions as "sources" of sensitive data, then track every variable that touches that data as it flows through assignments, dicts, lists, and function calls. If that variable reaches a dangerous "sink" like print() or a logging function, you flag it before it ships.

Think of it as putting a dye marker on decrypted data and watching where the color spreads.
How Our Tool Works
Step 1: Define the Sources
We have a limited set of core decryption functions, the only places in our codebase where encrypted phone numbers become readable:
decrypt_pii_field, unmask_pii_for_channel, unmask_pii_for_order
Any variable assigned from these becomes tainted. That's where the trail begins.
Step 2: Walk the AST, Follow the Taint
Our AST visitor (a class inheriting from ast.NodeVisitor) handles each node type:
-
visit_Assign: does the right side contain tainted vars? If yes, taint the left side
-
visit_Call: is a tainted var passed as an argument? Flag it
-
visit_Return: is tainted data being returned? Track it
phone = decrypt_pii_field(encrypted_data) # Assign: phone is tainted
payload = {"drop_phone": phone} # Dict: payload is tainted
final_array.append(payload) # Call: final_array is tainted
response['result']['data'] = final_array # Subscript: response is tainted
The taint spreads through the AST, from simple assignments to dicts, lists, nested subscripts, and container methods. If the color touches something it should not, we know.
Step 3: Classify the Destination
When tainted data reaches a function call, we classify it:
-
Blocked sink: print(phone), logger.info(payload) | CRITICAL, build blocked
-
Unknown function: some_func(phone) | HIGH, must verify and whitelist
-
Allowed sink: requests.post(data=payload) | INFO, passes, reviewer notified
This three-tier system gives developers clear, actionable signals rather than a wall of noise.
Step 4: Cross-Function Tracing

The biggest engineering challenge: phone data crosses 5-6 functions before reaching its final destination. Within a single function, taint analysis is straightforward. But when send_to_partner_api() receives api_payload as a parameter, the AST of that function alone does not reveal it is tainted.
We solved this with pattern matching on function names:
"decrypt_pii" -> matches all decrypt_pii_* variants
"unmask_pii" -> matches unmask_pii_for_channel, unmask_pii_for_order, etc.
"build_customer" -> matches build_customer_info, build_customer_payload, etc.
"build_api_payload" -> matches all payload builders
Functions matching these patterns are recognized as carrying sensitive data. Their return values are treated as tainted by callers. One pattern covers dozens of functions, no manual listing needed.
These functions can return phone data (that's their job) and pass it to allowed sinks like courier APIs, but they cannot log or print it. If they do, it's a CRITICAL violation regardless of the function's purpose.
Scaling to 31,000 Files
Our codebase has 31,580 Python files. Parsing all of them would take minutes, too slow for a PR pipeline.
The insight:
grep -rl "decrypt_pii\|unmask_pii\|fetch_pii" . --include="*.py" | wc -l
47 files. Out of 31,580. That's 0.15% of the codebase.
Only 47 files ever touch a decryption function. We use grep, C-level string search, to find them, then parse only those with AST. Total pipeline time: under 5 seconds for the entire codebase.
Bitbucket Integration

The tool runs on every pull request and posts inline annotations directly on the changed lines using Bitbucket's Code Insights API. Bitbucket Pipelines provides a built-in authentication proxy, so there are no tokens or credentials to manage.
Reviewers see the annotation on the exact line where sensitive data flows. It tells them the variable name, the destination function, and whether it needs verification. No context switching, no separate report to read.
Our pipeline config is four lines:
pull-requests:
'**':
- step:
name: PII Taint Check
script:
- python3 ci_scripts/pii_taint_tracker.py --changed
Developer Experience: What Happens When It Flags Something
When a developer's build fails, they see a clear message:
-
CRITICAL: print(phone) | Logging decrypted data | Remove the print/log
-
CRITICAL: do_log(payload) | Payload contains phone | Mask before logging, or remove
-
HIGH: xyz() not whitelisted | Unknown function receives phone data | Verify it's safe, add to config
-
HIGH: return from function | Function returns tainted data | Add to tainted_return_functions
The config is self-service. Developers verify the function is legitimate, add one line to the config, and re-run. No security team bottleneck.
What We Learned
AST is underrated. Most Python developers know the ast module exists but never touch it. It is the same tree structure that compilers have relied on for 50+ years, and Python hands it to you for free. If you can describe a code pattern in terms of assignments, calls, and returns, you can detect it automatically.
Start with the data, not the code. Only 47 out of 31,580 files matter. The sensitive data is the protagonist of this story. Follow it, and the relevant code reveals itself.
False positives kill adoption. We invested heavily in precision: passthrough functions for built-ins like len() and str(), safe dict key tracking so payload['city'] does not trigger even when payload['phone'] is tainted, and pattern-based whitelisting that covers entire function families with a single rule.
Automated enforcement beats manual review. A tool that runs on every PR and gives instant, specific feedback is worth more than a quarterly security audit. Developers learn the patterns. The codebase gets cleaner over time.
What's Next
We're building a call graph tracer that eliminates manual configuration entirely:
- grep for the 47 files with decrypt calls (2 seconds)
- Parse those files, find what functions they call
- grep for each called function's definition across the codebase
- Recursively trace until hitting external libraries like requests, then stop
- At every hop, check: does this function log or print the tainted parameter?
Full cross-function taint tracking. Zero manual configuration. Under 15 seconds for 31K files.
The AST has been powering compilers since Fortran. We're using the same tree to protect customer data at scale. If your codebase handles sensitive data, the tools to protect it are already built into your language. You just have to use them.