Back to resources

OpenAI’s Privacy Filter vs Protegrity-PII and the Data Lesson As Old As Time

PII detection: where it matters and why it’s hard

Personally identifiable information (PII) is any piece of information that can be used to identify and/or extract information about an individual. PII frequently shows up in free-form text in the form of names, email addresses, phone numbers, and government IDs, to list a few that are uniquely tied to a sole person. The less obvious forms of PII include things like ZIP codes, job titles, employer names, or rare diagnoses, which on the surface look harmless, but in combination with one another, can pin down a single individual. As such, PII detection is at the forefront of data security companies.

For many companies, PII can live in a variety of locations; whether that be in databases or in free-form text logs. As such, leaks in PII can occur across numerous pipelines, which necessitates an accurate PII detector that can find PII within these pipelines and alert users where it exists. Without such a detector, PII, especially nowadays in the agentic era, can unintentionally leak to the public. In critical sectors such as banking and healthcare, leaks of PII have large downstream effects, including regulatory and legal consequences (GDPR, HIPAA, CCPA), as well as large brand fallout and loss in trust. Furthermore, in these regulated industries, the service provider for PII detection is legally responsible for any mishaps. This makes the mission of accurate and scalable PII detection all the more integral.

The reason PII detection is hard starts with the data itself. First, most PII doesn’t follow a specific or easy to identify pattern. Unlike a credit card number or a Social Security number, a name is just a word. So is a street or address. Lightweight tools like regex can’t help you here, because there’s nothing structurally distinctive to match against, so catching these requires an understanding of language and the subtle cues humans pick up on without thinking. Second, different industries also use different identifier schemas , such as for part ID numbers or health insurance ID numbers. As such, a detector that looks excellent on a generic public benchmark can fall over on real-world hospital correspondence, banking statements, or legal filings.

Underneath all of this is a classic chicken-and-egg problem. To train a strong detector you need a large and accurately labeled corpus of documents that contain PII. This almost surely cannot exist in practice. Real customer documents cannot be shared for privacy and legal reasons. Consequently, those interested in PII detection resort to generating synthetic data instead. Synthetic data sidesteps the privacy problem, as in principle no real entities such as names, are being used, but introduces a labelling problem of its own. We now need to either have our engineers hand label hundreds of thousands of data points (not happening, for us at least) or use programmatic ways to label the data (which has known issues such as entity span misalignment). The end result is a strange loop. To build a good PII detector, you need labeled data, but labeling that data sufficiently almost boils down to the same task as the one you’re trying to solve in the first place.

OpenAI releases Privacy Filter

Last month, OpenAI released Privacy Filter (OPF), a small open-weights model trained to find PII in free-form text. It can be ran on a local machine and furthermore reports a headline F1 of 0.96 on the public PII-Masking-300k benchmark. As such, PII teams across the enterprise space have been scrambling to figure out what this means for their stacks. As one such team here at Protegrity, we launched an internal investigation against our models, procedures, and datasets and detail such findings herein.

The short version, and the spoiler for the rest of the post, is that PII detection (and arguably many problems in the ML-sphere) is way more of a high-quality data generation problem than it is a model architecture problem. Critically, OPF is an open-weights release rather than an open-source one, so what the community gets is the trained model and not the training dataset/dataset generation pipeline. That is a perfectly reasonable choice on OpenAI’s part, but crucially, this means that their core contribution lives squarely at the model architecture. The dataset design and generation is still on individual teams/companies to solve for themselves. And it is precisely this design that is vital for downstream performance.

With that framing in place, the natural next question to investigate pertains to what the comparison between OPF and our in-house detector actually looks like once you put both models on the same data and score them the same way. We start with a toy example of the kind of enterprise text that a production PII detector sees in the real world and conclude on a comparison against the public benchmarks OPF reports on.

Knowledge extraction via from PII detector

Here at Protegrity, we are spending a great time thinking about and classifying what is knowledge. This has resulted in a myriad of technical and, at times, philosophical discussions.

Before getting into the standard benchmarks and discussions for PII classification, we want to ground the discussion in a concrete threat that can happen in production. The question we seek to investigate here is: what does an attacker actually do with the output of a PII detector, and what happens when that detector has imperfections? The example below walks through one such scenario.

Below is an data point that we ran through both our internal PII detector (Protegrity-PII) and the OPF classifier from our internal test dataset (synthetic example). Imagine you are processing healthcare correspondence, and a letter comes in with the following opener:

Subject is a Cambridge resident and former Governor of Massachusetts, sex M, born 1945-07-31, admitted 1996-05-18.

When we feed this text into both classifiers and perform redaction on the detected entities, we see:

OPF:

Subject is a Cambridge resident and former Governor of Massachusetts, sex M, born [PRIVATE_DATE], admitted [PRIVATE_DATE].

Protegrity-PII:

Subject is a [PRIVATE_ADDRESS] resident and former Governor of [PRIVATE_ADDRESS], sex [PRIVATE_PERSON], born [PRIVATE_DATE], admitted [PRIVATE_DATE].

Now what can we do with this information?

In this fictitious scenario, imagine we are an attacker and we’ve received back the text from both PII detectors. For the sake of this scenario, assume we have access to two databases: one that details state governors in the USA and one that details current residents of Massachusetts.

DB1 · US State Governors
name party term_start term_end sex dob state_served
Maura T. HealeyDemocratic20232026F1971-02-08MA
Edward J. KingDemocratic19791983M1925-05-11MA
Gray L. DavisDemocratic19992003M1942-12-26CA
Phil D. MurphyDemocratic20182026M1957-08-16NJ
Christopher J. ChristieRepublican20102018M1962-09-06NJ
Andrew M. CuomoDemocratic20112021M1957-12-06NY
Christian A. HerterRepublican19531957M1895-03-28MA
Jane M. SwiftRepublican20012003F1965-02-24MA
Paul A. DeverDemocratic19491953M1903-01-15MA
Tom W. CorbettRepublican20112015M1949-06-17PA
Tom W. WolfDemocratic20152023M1948-11-17PA
George E. DeukmejianRepublican19831991M1928-06-06CA
Endicott PeabodyDemocratic19631965M1920-02-15MA
William F. WeldRepublican19911997M1945-07-31MA
Hugh L. CareyDemocratic19751982M1919-04-11NY
Ron D. DeSantisRepublican20192026M1978-09-14FL
Robert F. BradfordRepublican19471949M1902-12-15MA
Arnold A. SchwarzeneggerRepublican20032011M1947-07-30CA
James E. McGreeveyDemocratic20022004M1957-08-06NJ
Rick L. ScottRepublican20112019M1952-12-01FL
Mario M. CuomoDemocratic19831994M1932-06-15NY
Jeb E. BushRepublican19992007M1953-02-11FL
Rick J. PerryRepublican20002015M1950-03-04TX
George E. PatakiRepublican19952006M1945-06-24NY
George W. BushRepublican19952000M1946-07-06TX
John A. VolpeRepublican19651969M1908-12-08MA
Ann W. RichardsDemocratic19911995F1933-09-01TX
Greg W. AbbottRepublican20152026M1957-11-13TX
Charlie M. CristRepublican20072011M1956-07-24FL
W. Mitt RomneyRepublican20032007M1947-03-12MA
Michael S. DukakisDemocratic19751979M1933-11-03MA
Maurice J. TobinDemocratic19451947M1901-05-22MA
Robert P. CaseyDemocratic19871995M1932-01-09PA
Pete B. WilsonRepublican19911999M1933-08-23CA
Francis W. SargentRepublican19691975M1915-07-29MA
Edward G. RendellDemocratic20032011M1944-01-05PA
Charles D. BakerRepublican20152023M1956-11-13MA
Deval L. PatrickDemocratic20072015M1956-07-31MA
Edmund G. Brown Jr.Democratic19751983M1938-04-07CA
Foster FurcoloDemocratic19571961M1911-07-29MA
Nelson A. RockefellerRepublican19591973M1908-07-08NY
DB2 · Massachusetts Residential Roll
name sex dob city state zip
Deval L. PatrickM1956-07-31MiltonMA02186
Michael S. DukakisM1933-11-03BrooklineMA02446
George N. YostM1952-04-15CambridgeMA02139
W. Mitt RomneyM1947-03-12BelmontMA02478
Rosa T. VargasF1979-05-11CambridgeMA02140
Mary G. SullivanF1981-12-09CambridgeMA02138
William F. WeldM1945-07-31CambridgeMA02138
Henry L. BowditchM1950-08-05CambridgeMA02139
Helen S. WhalenF1973-02-28CambridgeMA02140
Vincent J. CarmichaelM1976-10-22CambridgeMA02138
Charles D. BakerM1956-11-13SwampscottMA01907
Lawrence T. EastonM1957-03-17CambridgeMA02141
Jane M. SwiftF1965-02-24WilliamstownMA01267
Edward T. PetrilloM1953-12-08CambridgeMA02140
Robert C. AldrichM1962-04-22CambridgeMA02138
Helen K. BrennanF1981-10-29CambridgeMA02139
Janet R. HoldenF1972-04-12CambridgeMA02138
Peter F. DonovanM1944-09-12CambridgeMA02141
Roger D. QuincyM1948-02-11CambridgeMA02142
Albert P. MehrtensM1959-07-08CambridgeMA02142
Bruce K. HollisterM1971-02-06CambridgeMA02138
Doris E. QuincyF1983-04-05CambridgeMA02142
Norman P. FieldingM1962-11-19CambridgeMA02140
James P. BellamyM1958-08-30CambridgeMA02139
Margaret B. YatesF1971-10-22CambridgeMA02139
Susan H. DayF1990-11-15CambridgeMA02141
Maria E. ColeF1985-03-08CambridgeMA02139
Anthony N. SweeneyM1939-03-25CambridgeMA02142
Patricia J. ReillyF1976-08-21CambridgeMA02142
Frank G. LansingM1964-01-30CambridgeMA02139
Stephen H. MaddoxM1981-05-17CambridgeMA02141
Edward J. KingM1925-05-11WinthropMA02152
Diane S. AtwoodF1969-03-14CambridgeMA02140
Joan L. CarmodyF1988-06-17CambridgeMA02141
Vincent A. PellM1959-07-08CambridgeMA02142
Barbara K. LynchF1965-05-13CambridgeMA02139
Joyce A. WhitmanF1976-12-05CambridgeMA02141
Ruth N. AckermanF1977-11-14CambridgeMA02138
Marcus W. EllsworthM1966-11-19CambridgeMA02140
Janet F. OrtizF1980-08-19CambridgeMA02141
Charles K. RileyM1947-11-04CambridgeMA02140
Sandra L. PettyF1973-08-21CambridgeMA02138
Walter B. CrowleyM1955-06-19CambridgeMA02141
Carolyn B. TigheF1984-04-17CambridgeMA02142
Daniel R. HolcombM1972-09-12CambridgeMA02138
Linda B. ParkF1968-09-23CambridgeMA02140

Using the redacted text from Protegrity-PII, as the attacker, we essentially can do next to nothing malicious. The only thing we are able to extract is that a former Governor visited this particular hospital. From OPF’s redacted text, however, the unredacted tokens Cambridge, Governor, Massachusetts, and M give an attacker exactly the columns they need to query the two databases above. Let us walk through it.

Step 1. Filter DB1 to former male governors of Massachusetts.

SELECT name, dob FROM DB1
WHERE state_served = 'MA' AND sex = 'M';

The total search space shrinks to 14.

14 candidates from DB1
name dob
Christian A. Herter1895-03-28
Edward J. King1925-05-11
William F. Weld1945-07-31
Michael S. Dukakis1933-11-03
Foster Furcolo1911-07-29
Charles D. Baker1956-11-13
Francis W. Sargent1915-07-29
W. Mitt Romney1947-03-12
Deval L. Patrick1956-07-31
Robert F. Bradford1902-12-15
Paul A. Dever1903-01-15
Maurice J. Tobin1901-05-22
Endicott Peabody1920-02-15
John A. Volpe1908-12-08

Step 2. Filter DB2 to male residents of Cambridge.

SELECT name, dob, zip FROM DB2
WHERE city = 'Cambridge' AND sex = 'M';

The total search space shrinks to 21.

21 candidates from DB2
name dob zip
Charles K. Riley1947-11-0402140
Anthony N. Sweeney1939-03-2502142
Albert P. Mehrtens1959-07-0802142
Lawrence T. Easton1957-03-1702141
Peter F. Donovan1944-09-1202141
Daniel R. Holcomb1972-09-1202138
Edward T. Petrillo1953-12-0802140
Bruce K. Hollister1971-02-0602138
William F. Weld1945-07-3102138
Norman P. Fielding1962-11-1902140
Marcus W. Ellsworth1966-11-1902140
Vincent A. Pell1959-07-0802142
Vincent J. Carmichael1976-10-2202138
Roger D. Quincy1948-02-1102142
Frank G. Lansing1964-01-3002139
Henry L. Bowditch1950-08-0502139
Robert C. Aldrich1962-04-2202138
George N. Yost1952-04-1502139
James P. Bellamy1958-08-3002139
Walter B. Crowley1955-06-1902141
Stephen H. Maddox1981-05-1702141

Neither filter on its own is enough. 14 governors and 21 Cambridge men are still somewhat anonymized crowds. When we use a join however, we see as the attacker the critical identifier.

Step 3. Inner join the two filtered sets on (name, dob).

SELECT name, dob FROM db1_filtered
INNER JOIN db2_filtered USING (name, dob);

We have now collapsed into a single row.

name dob
William F. Weld 1945-07-31

The patient is Governor William F. Weld1! Note that the date of birth here was never present in either database query input. It falls out of the join because exactly one ex-Governor of Massachusetts also appears on the current Cambridge residential roll, and that one row carries a dob field that now ties the redacted clinical event to a specific human being. We still don’t know which day he was admitted, however, we now do know that he was in the hospital. This toy example exemplifies how a misaligned PII detector can result in the extraction of knowledge about the status of an individual, that was otherwise meant to be kept private. Maybe in this case, the attacker cannot do too many malicious things, but the precedent here is set. We now move on to more holisitic evaluations of Protegrity-PII and OPF.

Benchmarks (PII-300k, SPY, Healthcare)

For a standardized quantitative assessment, we run both our internal PII classifier (Protegrity-PII) and OPF model on three total datasets: the two benchmarks from OPF’s model card, plus a novel in-house healthcare set generated by our synthetic data pipeline that serves as a stringent out-of-distribution test for both detectors, and is designed to represent realistic text in the form of logs and correspondence that a PII detector would see in production. Below we characterize all three datasets used.

ai4privacy PII-Masking-300k. PII-Masking-300k is the public, openly-licensed synthetic dataset released by ai4privacy on HuggingFace. It contains roughly 300,000 multilingual examples spanning 6+ languages and a little over 30 PII categories, each with character-level span annotations. It is the de-facto public benchmark for token-classification-style PII detectors and is the dataset OPF reports its headline numbers on.

SPY. SPY is the second benchmark OPF reports against in their model card. Released by Savkin et al. (NAACL 2025), it contains roughly 10,000 synthetic records, primarily consisting of medical consultations legal questions. It covers seven PII categories with token-level BIO annotations and is explicitly designed to expose the gap between generic name-entity-recognition (NER) and fine-grained PII detection on more production-realisitic text for the healthcare and legal domains.

Healthcare specific set. Finally, we built a novel healthcare dataset in-house using our newly developing synthetic data generation pipeline. The pipeline produces realistic hospital correspondence, including things like clinical letters, billing receipts, lab-result summaries, referral notes, consent forms, and memos and contains PII from over 30 categories (using the same entity list as PII-Masking-300k). This set is almost surely unseen from OPF’s model training and is certainly unseen from ours.

It is important to note that our Protegrity-PII classifier and synthetic data pipeline produce a more fine-grained label set than OPF does. As such, for a fair comparison, we collapse our labels down to match OPF’s taxonomy, with the concrete mapping in the Appendix. The point is to make sure OPF is never penalized for emitting a coarser label than the dataset annotates, so any remaining gap reflects real detection ability rather than a mismatch in taxonomies.

Public benchmarks: PII-300k and SPY

We start with the two public benchmarks that OPF reports on directly in their model card. These are the most apples-to-apples comparisons available, since OPF validated against them. Reaching parity here is roughly the expected outcome for a competent detector, at the very least.

Metrics

We use two metric families on these datasets, span-level and token-level, and within each family we report a strict and a relaxed variant. Below are their definitions and an illustrative Figure to demonstrate them.

Span-level metrics. A span is a contiguous character range in the source text together with a label, e.g. (start=142, end=158, label=PRIVATE_PERSON), where start and end refer to index location in a source text, and label indicates the type of PII (see Figure above). We refer to the true locations of start and end as the gold boundaries. Strict requires the predicted boundaries to match the gold boundaries exactly. Relaxed uses span-containment, where any character-level overlap with a compatible gold span counts as a hit. The relaxed number is the one OPF reports on its model card and is the headline span metric throughout their post. Span-level metrics, such as span-level F1 score, is the natural metric for downstream redaction and encryption workflows.

Token-level metrics. Token scoring breaks the document into whitespace-delimited tokens, assigns each token its majority-character gold label (true entity label), and asks the model to label each token independently. Strict is multi-class token F1, where every token must receive the correct PII type from the list of supported entities for detection. Relaxed is binary token detection, where a token is scored as correct if it is flagged as PII at all, regardless of type. As in span-level, the relaxed metric for token-level F1 score is exactly what OPF reports in their model card.

Strict versus relaxed scoring illustration
Strict vs relaxed scoring illustration

To make the four variants concrete, consider the toy sentence My name is Hardeep Bassi and I hope no one figures out! with a single gold span on Hardeep Bassi. The Figure above walks through what each scoring rule rewards and penalizes on this example.

At the span level, an exact-boundary prediction passes both strict and relaxed conditions for span-level (green), whereas a prediction that overlaps the gold span but starts at the wrong character (orange) fails strict but passes relaxed, and a prediction that lands somewhere else in the sentence fails both (purple). At the token level, mislabeling Hardeep as ADDRESS is a strict miss but a relaxed hit since the token was still flagged as PII, while a false positive on one is wrong under both notions.

We now evaluate the performance on both public benchmarks. All numbers below are F1 scores using strict and relaxed notions define above.

Performance ai4privacy PII-Masking-300k

Span F1 (OPF) Span F1 (Protegrity-PII) Token F1 (OPF) Token F1 (Protegrity-PII)
Strict 0.59 0.91 0.95 0.92
Relaxed 0.92 0.92 0.96 0.98

SPY

Span F1 (OPF) Span F1 (Protegrity-PII) Token F1 (OPF) Token F1 (Protegrity-PII)
Strict 0.13 0.56 0.54 0.62
Relaxed 0.44 0.58 0.56 0.62

Let us digest what these results mean. First, we notice the obvious gap between strict and relaxed span-level F1 for OPF on both datasets. This gap means OPF is able to find text that contains the PII it is trying to detect, but is consistently off on boundaries (we found most commonly that this is a stray whitespace character). An example of this would be in the sentence My name is John Smith and I ran today., their model labels [whitespace] John Smith as a person, as opposed to John Smith. In contrast, our Protegrity-PII classifier’s strict and relaxed span-level F1 scores are tightly clustered, so the boundaries we detect land where the truly are within the text (i.e. in the previous example, labeling John Smith). Second, on the headline relaxed metrics the two models are comparable on PII-300k and Protegrity-PII leads on SPY. Finally, at a token-level, we see a smaller gap. For PII-300k, we suspect that the slightly better performance in strict token-level for OPF stems from having to collapse our supported entity span list space into theirs. However, on the SPY dataset, we see dominating performance from Protegrity-PII. In general, we view the SPY gap in the token- and span-level metrics as the more meaningful gap between Protegrity-PII and OPF as SPY is almost certainly out-of-distribution for both for OPF, and is out-of-distribution for Protegrity-PII.

Healthcare

The public benchmarks tell us how the two models compare on data OPF was tuned and evaluated against in their model card. However, they tell us comparatively little about how either model behaves on realistic enterprise correspondence. For domains like healthcare and finance, this is the regime that matters for production deployments. To probe this regime, we evaluated both models on the in-house healthcare set developed by our novel synthetic data generation pipeline, as described previously. This dataset was not seen during training by either system, so it serves as a stringent, strongly out-of-distribution test for both models.

We report the same span-level and token-level F1 metrics, this time using only the relaxed of both metrics and visualized as figures instead of tabulated.

Healthcare span-level performance comparison
Healthcare

Above, we see that the overall span-level metrics on the healthcare dataset put Protegrity-PII ahead on both precision and recall (and consequently F1). OPF’s lower F1 reflects its difficulty generalizing to the kinds of text we curated for the healthcare set, which were designed to mirror realistic scenarios like hospital receipts and internal chat logs between staff. In contrast, Protegrity-PII handles these formats reliably, suggesting that our model is able to capture the messiness and variety of real healthcare text in a way OPF’s does not.

Per-label F1 comparison
Per-label F1

To investigate further, we can break the span-level F1 score down by supported entity list labels to see whether or not Protegrity-PII’s dominance can be attributed to a single label, or is general across the board. We see in the Figure above that is in fact the case that we exceed performance on all labels, and match only on a single one (PHONE).

Healthcare token-level performance comparison
Healthcare

Above, we now look into the token-level performance on the healthcare dataset. Again, we see that the token-level F1 score tells the same story. Protegrity-PII again leads, and because token F1 is invariant to span boundaries, the gap here is purely about which detected entities got flagged by either model. This confirms that the underlying coverage of detected PII is genuinely higher from Protegrity-PII.

Healthcare false positive rate comparison
Healthcare

Finally, we investigate the false positive rate for both models, because a usable PII detector has to balance catching real PII with not overflagging genuine text that customers actually need to keep. In the Figure above, we see that OPF flags substantially more non-PII characters as PII than Progerity-PII, which directly explains why its precision lags from the span-level metric Figure previously.

Latency

The span-level and token-level metrics aforementioned are only one of several legs that determine whether a PII detector is usable in production. Another important leg that we have yet to investigate is the efficiency–accuracy tradeoff, which is broadly the relationship between how accurate a model is and what it costs to run that accurately at production volume. In practice, this tradeoff is what eliminates most candidate models long before quality alone does, since a large model tends to be more accurate but slower and more expensive to run, and a small model tends to be faster and cheaper but loses ground on quality. In other words, if a model is more accurate, but takes 10x longer to run, is it really useful in deployment? The useful PII detector is the one that lands in a sensible spot on all of those axes at once rather than maxing out any single one. Let’s focus our discussion here efficiency aspect, and specifically regarding latency of both Protegrity-PII and OPF.

Healthcare latency comparison
Healthcare

From the Figure above, we see that OPF averages about 436 ms per document on our healthcare set while our model averages about 73 ms on the same hardware and the same inputs. This is roughly a 6x reduction in wallclock runtime. Primarily, we attribute this speedup to differences in the inference stack rather than model size. Protegrity-PII actually has more active parameters than OPF (125 million vs OPF’s claimed 50 million active out of 1.5 billion total), but it runs on ONNX, which gives various optimizations for a more efficient runtime than OPF’s stock PyTorch path. We found that OPF is not straightforward to put on the same footing off the shelf. It’s use of mixture-of-experts (MoE) and dynamic mask shapes break both torch.onnx and optimum, so getting it onto a comparable ONNX serving path is non-trivial.

Cost

Finally, we investigate the cost per 1K characters processed for each model. For OPF, we see that there is a $0.000287 cost per 1K characters, in contrast to Protegrity-PII’s cost of $0.000004 per 1K tokens, when ran on a g4dn.2xlarge EC2 instance, which uses an NVIDIA T4 16GB GPU. We see from this that our model is roughly 2 order of magnitude cheaper to run.

Discussion

The benchmark numbers above tell us how much the two models disagree, but in isolation, they don’t really tell us why. In general, this interpretation problem is ubiquitous across all ML tasks in every field, and many formalisms and interpretations are offered by researchers and authors alike to explain the phenomenona they observe. To answer the discrepancy between Protegrity-PII and OPF, we read through the cases where the two models predicted differently and looked for patterns. Here’s our take from a few that stand out. Spoiler: it all ties back to data.

Let’s look at the following example from our healthcare dataset below:

Test set healthcare letter example

source text:

Consent verification was confirmed through cross-reference with the

national health system identification card number (IDCARD: NHS-987654321),

which remains valid and unrevoked as of this date. Given her sustained

clinical stability and continued willingness to participate, we affirm

that Ms. Smith meets all inclusion criteria under the study protocol.

Respectfully submitted,

Diane Walters

Senior Clinical Epidemiologist

Brown Inc

===============================================================

OPF:

[ACCOUNTNUMBER|BIC|CCN|IBAN] “NHS-987654321”

[NAME] “Ms. Smith”

[NAME] “Diane Walters”

===============================================================

Protegrity-PII:

[IDCARD] “NHS-987654321”

[NAME] “Smith”

[NAME] “Diane Walters”

[COMPANYNAME] “Brown Inc”

===============================================================

Taxonomy issues. OPF’s eight-category label set is naturally coarse grained. In fact, it is coarse in ways that lose information users actually need. ACCOUNT_NUMBER is the likely the clearest example. In their taxonomy, it conflates national health IDs, MRNs, IBANs, BICs, credit card numbers, and arbitrary internal account identifiers into a single bucket. For a redaction pipeline this might be acceptable, but for any downstream consumer that wants to apply different policy to, say, a credit card vs an internal patient ID, this coarseness is an active detriment, since you are obfuscating your own downstream task! Similarly, PRIVATE_PERSON collapses individual names with company names, and in many examples (such as the test set example shown previously), we see that OPF is unable to make this distinction.

Context-dependent PII. A more subtle failure mode is that OPF, like most detectors trained on span-annotated synthetic data, learns to recognize the visual signature of canonical PII rather than the underlying concept of identifying information. Tokens that only re-identify someone when read in context tend to slip through. Hence, an isolated city name like “Cambridge” (see the Knowledge extraction example) will generally be passed through unflagged, despite being PII that can be used to re-identify an individual. This behavior follows fairly directly from how the training data is annotated, and in their own model card, they report the inability to catch addresses that don’t follow a full standard format of [STREET ADDRESS, STREET, CITY, COUNTRY, ZIPCODE].

Conservative on rare/structured identifiers. OPF reliably flags credit card numbers, IBANs, and email-shaped strings, but is noticeably less consistent on domain-specific identifiers that don’t match a common regex profile such as internal patient IDs and insurance claim codes. In contrast, our model, trained on data that explicitly covers these shapes, picks them up reliably. This segues nicely into the final point.

Data holds the weight. This is the single most important point that we wish to emphasize. PII detection is a data problem far more than a modeling problem . The model choice matters at the margin, but the difference between a usable production-grade PII detector and a brittle one comes down to quality of the annotation of the dataset and the range of formats covered in the training set. Critically, OPF released weights and model design of within their model card paper but not the training corpus or data generation pipeline! That means when we see OPF systematically miss something, we cannot tell whether it’s a coverage gap in their data or a modeling artifact. We can only observe the symptom. For users trying to decide whether OPF will work on their domain, this is a real problem as there is no way to predict failure modes from the model card alone. Empirical experiments must be ran at scale and it’s hard to pinpoint exactly where the failure comes from. In contrast, we can systematically determine from our procedure if the problem arises from a training-test distribution mismatch or a model failure.

Conclusion

One could glance at the tables above, treat this as a scoreboard, hand Protegrity-PII the trophy, and walk away, but framing it that way buries the point we are actually trying to make.

In contrast to OpenAI’s release, which only includes the model, here at Protegrity, what we bring to the problem of detecting PII is the model and also the relevant carefully curated and high-quality dataset and pipeline that generated it. This synthetic data generation pipeline is aimed exactly at the kinds of documents enterprise customers can actually send to us: hospital correspondence, billing and claims text, banking statements, legal filings, contact-center transcripts, you name it. The taxonomy that our data generation pipeline and model supports is also far finer than OPF’s eight categories because real downstream policy requires that resolution.

In conclusion, the takeaway here is rather simple and oftentimes lost in conversations regarding model performance or architecture design. Whether the setting is enterprise ML or scientific machine learning, model choice is more often than not secondary to the assembly and deployment of high-quality training data. Historically, this seems to also be the case. Take the transformer architecture itself as the obvious example. It was a genuine breakthrough when it landed in 2017, but on its own it did not produce ChatGPT or Claude. What produced those systems was the same architecture pushed to scale on enormous, carefully curated text corpora and then aligned with human feedback data. The model itself was certainly a key player in the unlock, the data which these systems were trained is really what made those products shine. In our view, PII detection sits in the same place. The hard part of this whole problem comes down to sourcing the right data, getting it labeled correctly, and grinding through enough iterations to cover vast array of forms PII actually takes in real enterprise text.

Appendix: Evaluation setup

For every experiment in this post, we project into OPF’s label space. We primarly opt to do this because OPF emits a deliberately narrow taxonomy of eight categories: PRIVATE_PERSON, PRIVATE_ADDRESS, PRIVATE_DATE, PRIVATE_EMAIL, PRIVATE_PHONE, PRIVATE_URL, ACCOUNT_NUMBER, and SECRET. In contrast, by construction our model and synthetic data generation pipeline incorporate and support a more natural fine grained label taxonomy. For example, our production model distinguishes NAME from COMPANYNAME, STREET from ZIPCODE from CITY, and breaks ACCOUNT_NUMBER into IDCARD, IBAN, BIC, CCN, etc. Similar can also be said for the benchmark datasets.

Concretely: name-like categories (NAME, COMPANYNAME, …) collapse to PRIVATE_PERSON; address-like categories (STREET, CITY, ZIPCODE, …) collapse to PRIVATE_ADDRESS; identifier-like categories (IDCARD, IBAN, BIC, CCN, ACCOUNTNUMBER, …) collapse to ACCOUNT_NUMBER; and DATE, EMAIL, PHONE, URL map straight through. Categories with no OPF analogue are excluded from scoring on both sides. This deliberately throws away signal that our model and the gold annotations actually contain. For example, a name flagged as COMPANYNAME vs NAME is treated as identical post-projection. Consequently, this ensures OPF is never penalized for emitting a coarser label than the dataset annotates, and that any remaining gap reflects detection ability rather than taxonomy mismatch.

1This test case was of course inspired by the famous result from a 1997 data privacy study where graduate student Latanya Sweeney demonstrated that ~87% of the U.S. population could be uniquely identified using only ZIP code, gender, and date of birth. She used this to famously reidentify Governor William Weld’s confidential medical records by linking them with a voter database.