Skip to content
Solutions by Playbook

Safe Harbor De-Identification

Six weeks of legal review.Or one command.

You need production data for realistic testing. Legal needs a six-week review to approve the data copy request. The test deadline is Friday. You have been here before. This time, skip the request entirely.

The compliance bottleneck

The compliance bottleneck that kills testing timelines.

Every integration team hits the same wall. Staging needs realistic data. Realistic data means production messages. Production messages contain PHI. PHI requires legal review, compliance sign-off, data use agreements, and a chain of custody you will be auditing for years.

So you compromise. You hand-build 14 test patients named John Doe. You copy a production ADT, open it in a text editor, and start replacing names and MRNs by hand. You are careful. You catch the PID. You catch the NK1. But you miss the SSN buried in an NTE segment on message 147.

Now you have a reportable PHI breach because your scrubbing was Find-and-Replace in Notepad++. The problem is not carelessness. The problem is that manual de-identification does not scale, and production copy-and-scrub treats a compliance requirement as a text editing exercise.

HIPAA Safe Harbor Method

All 18 identifier categories. Every time.

One missed SSN in a free-text NTE segment is a reportable breach. Regex misses it. Pidgeon parses the full HL7 abstract syntax tree and removes every structural identifier with certainty — not guesswork.

Deterministic hashing means the same input with the same salt produces the same output. Your team gets identical de-identified datasets. Reproducible. Auditable.

Names
MRN / Account numbers
Social Security Numbers
Dates of birth
Street addresses
Phone numbers
Fax numbers
Email addresses
IP addresses
Geographic subdivisions
Dates (except year)
Age over 89
Certificate / license numbers
Device identifiers
Web URLs
Vehicle identifiers
Biometric identifiers
Full-face photographs
A

De-identify production messages.

When you need the real production payload — the message with the Z-segment your mapper has never seen, the OBX with the non-standard reference range — Post strips every HIPAA identifier locally. Zero cloud extraction. Zero data transmission. The messages never leave your machine.

terminal
$ pidgeon deident --in ./prod_samples --out ./safe_samples --date-shift 90d --salt "project-2026"
Processing 847 messages...
  18 HIPAA identifier categories detected and replaced
  Date shifting applied (+90 days) to all temporal fields
  Cross-message referential integrity preserved

Zero transmission

Your PHI is not uploaded to a cloud service. It never touches our servers. It never crosses your network boundary. Your CISO can verify this in the first meeting.

Deterministic hashing

Same input, same salt, same output. Share the salt with your team and everyone gets identical de-identified datasets across every run.

Referential integrity

A patient MRN replaced with a synthetic value is replaced consistently across every message in the batch. Cross-message references stay coherent.

B

Generate from nothing.

When you do not need the production payload and just need realistic test data, generate it from scratch. There is no PHI to de-identify because no PHI ever existed. No legal review. No compliance risk. No waiting.

terminal
$ pidgeon generate ADT^A01 --count 500 --vendor epic --output ./test_data/
Generated 500 HL7 v2.5.1 messages
  Clinically correlated demographics
  Vendor-realistic field patterns
  Zero PHI by construction

No data use agreement. No six-week legal review. No chain of custody. The data was never real.

Either way

Prove it to compliance.

Post generates a compliance report based on actual detection results. Hand it to your privacy officer. Attach it to the data use agreement. The report documents every identifier category scanned, every substitution made, and every field preserved.

terminal — generate HTML compliance report
$ pidgeon deident --in ./prod_samples --out ./safe --date-shift 90d --report compliance.html

Audit-ready documentation

The report documents every identifier category scanned, every substitution made, and every field preserved. Attach it directly to your data use agreement.

Actual detection results

Not a policy document. An evidence document. The compliance report reflects what Pidgeon actually found and removed from your specific dataset.

The conversation with legal changes.

Before

“We need production data for testing.”
“File a data use agreement. We will review in six weeks.”

After

“The test data was generated synthetically. No PHI was involved at any stage. Here is the compliance report.”

There is nothing to scrub, nothing to approve, nothing to breach.

For QA and test data managers

This is the workflow that removes legal from the testing critical path. De-identify when you need production structure. Generate when you need volume. Either way, your test deadline is no longer blocked by a compliance review.

For integration engineers

The free CLI includes full de-identification. No Pro tier required. Point it at a directory and have safe test data before lunch.

Free. No trial. No subscription.

De-identification is free. Right now.

The full Safe Harbor workflow — all 18 HIPAA identifier categories, deterministic hashing, date shifting, referential integrity — ships with the free CLI. No trial. No subscription. No strings.

Download the CLI (Mac / Windows / Linux)

De-identification is free. Right now.

The full Safe Harbor workflow ships with the free CLI — all 18 HIPAA identifier categories, deterministic hashing, date shifting, and referential integrity. No trial, no subscription, no strings.