PII redaction in transcripts: how automated redaction works

A customer reads their sixteen-digit card number aloud to confirm a payment. The call is recorded and transcribed for quality assurance, and now that card number sits in full in your transcript database and your audio archive, in two places, indefinitely. Under PCI-DSS, that is a violation waiting to be found. The same is true of a Social Security number under privacy law, or a diagnosis under HIPAA. A recording kept to improve service has become something you can be fined for holding.

Redaction makes that audio and transcript safe to keep.^[1] Most of the work is routine; what is unusual is how lopsided the errors are, which the end of this page returns to. The walkthrough builds the pipeline in order, because each step depends on the one before it.

Transcribe with timestamps

Redaction starts from a transcript, but a special kind: one with word-level timestamps. Audio redaction is the reason: it requires knowing exactly when each sensitive word was spoken, and that timing comes from recognition.

{ "tokens": [
  { "text": "card", "start_ms": 8100, "end_ms": 8300 },
  { "text": "is",   "start_ms": 8300, "end_ms": 8420 },
  { "text": "4539", "start_ms": 8500, "end_ms": 9100 },
  { "text": "8821", "start_ms": 9100, "end_ms": 9700 }
]}

Detect the sensitive information

Find the PII. This combines two techniques: pattern matching for structured data (card numbers, SSNs, and emails have recognizable shapes), and named-entity recognition or a language model for the contextual cases (names, addresses, health terms). The categories map to the compliance regime: direct identifiers (names, IDs), financial data (cards, accounts) for PCI, and protected health information (diagnoses, medications) for HIPAA.

Select a redaction method

Redaction covers several operations, and for each category you decide which one applies, trading compliance against usability.

Masking ([CARD], [NAME]) is the simplest and safest, because the information is gone. Pseudonymizing replaces it with a consistent token, so "John" becomes "Person A" everywhere and the transcript still reads like a conversation while the identity is removed. Hashing and tokenizing is for when you need to match records later without storing the raw value. The category usually dictates the choice: health and card data must be fully removed, while names in a meeting transcript might only need pseudonymizing to stay readable.

Redact the transcript

Apply the chosen action to each detected span in the text. This is the easy half, and the reason the next step so often gets forgotten.

Redact the audio

This is the step people forget, and the reason the transcript needed timestamps. A redacted transcript does not protect the recording, because the card number is still spoken aloud in the audio. For true compliance you must redact the audio too, muting or bleeping the segments where PII was spoken, located by the timestamps of the detected spans.

Timing precision matters here. A bleep placed a beat late leaks the first digit, and one placed too wide clips the surrounding words, so the audio redaction is only as good as the timestamps it works from.

flowchart TB A[Audio] --> B[Transcribe<br/>+ timestamps] B --> C[Detect PII] C --> D[Redact transcript] C --> E[Redact audio<br/>by timestamp]

Redaction touches both outputs. The transcript and the audio each have to be cleaned, joined by the timestamps.

Prioritize recall

One property sets redaction apart from most extraction tasks: the two kinds of error cost wildly different amounts. Over-redacting (a masked word that was not sensitive) makes a transcript slightly harder to read. Under-redacting (a real SSN that slipped through) is a breach with a legal bill attached. Because the costs are that lopsided, you tune for recall, catch everything sensitive, and accept the false positives as the price. Verify by measuring recall on real data and treating every miss as serious rather than as a tolerable error rate.

Difficult redaction cases

The worst case is alphanumerics. Card numbers, SSNs, and account IDs are spoken as digit strings, the hardest thing for recognition to get right, and a misheard digit corrupts the transcript and confuses detection. The rule that saves you is to mute the audio at that timestamp even when the transcribed digits are wrong. Context is the other gap: whether a number or name is sensitive often depends on context a pattern matcher cannot see, which is why contextual detection (NER, language models) runs alongside regex and the recall-first stance backstops both.

Real-time redaction adds the streaming twist. In a live call you must detect and bleep PII as it is spoken, working from partial results. That is why card-capture flows often pause recording or route the digits to a separate secure path rather than relying on live redaction alone.

Common questions

What counts as PII in a call transcript?

Anything that identifies a person or is otherwise sensitive: names, addresses, phone and account numbers, credit-card numbers, government IDs, dates of birth, and under HIPAA health details like diagnoses and medications. The exact list is set by the compliance regime you fall under, so map your categories to HIPAA, PCI-DSS, or the GDPR rather than redacting by gut feel.

Why redact the audio and not just the transcript?

Because the card number is still spoken aloud in the recording even after the transcript is clean, and anyone who listens hears it. Real redaction mutes or bleeps the audio at the timestamps of each detected span, which is why the pipeline starts from a transcript with word-level timestamps in the first place.

Should redaction favor catching everything or avoiding mistakes?

Catching everything: tune for recall rather than precision. The two errors are wildly asymmetric, since over-redacting a harmless word is mildly annoying while one missed SSN is a breach with legal weight, so accept extra false positives to avoid a single miss.

How is sensitive data redacted from a live call?

In real time, working from partial results before the text settles, which makes it inherently imperfect. That is why card-capture flows often pause recording or route the digits to a separate secure path instead of trusting live detect-and-bleep to catch every one.

References

Ahmed, S., Chowdhury, A. R., Fawaz, K., & Ramanathan, P. (2019). Preech: A System for Privacy-Preserving Speech Transcription. arXiv preprint arXiv:1909.04198.

PII redaction in transcripts