Most redaction and data-protection failures come from mismatches between narrow token-level detectors and varied real-world text. This dataset supplies 300,000 synthetic and curated text examples with span-level PII annotations across six European languages, aiming to reduce that mismatch and give practitioners a single, sizable benchmark for token-level masking workflows.
What Sets It Apart
- Scope and scale: 300k annotated examples makes it one of the larger public PII masking corpora intended for multilingual model training and evaluation, not only one-off test suites.
- Multilingual coverage: includes English, French, German, Italian, Spanish, and Dutch — useful when evaluating cross-lingual transfer and multilingual token classifiers.
- Task-oriented annotations: annotations are provided at the span/token level for masking/redaction tasks, making the dataset directly usable for token-classification and sequence-labeling pipelines.
- Reproducible citation: dataset has a DOI (10.57967/hf/1995) to ease academic and industry referencing.
Who It's For and Trade-offs
Great fit if you need a large, multilingual training or benchmark set for PII detection/masking, or to measure model robustness across languages and domains. It works well for token-classification models, data-privacy tooling, and pre-/post-processing evaluation in redaction pipelines. Look elsewhere if you need human-verified legal adjudication of what counts as PII (this dataset mixes synthetic and curated examples and does not replace legal review), or if your target language set falls outside the six covered languages.
Where It Fits
Use this dataset to pretrain or fine-tune token-level PII detectors, to evaluate masking strategies in data pipelines, or to stress-test LLM prompt-based redaction via supervised baselines. It complements smaller, hand-curated legal PII corpora and larger general-purpose corpora by providing focused, span-level redaction annotations.
How It Was Built (brief)
Creators aggregated synthetic and curated text samples, annotated spans corresponding to PII categories, and packaged the data in JSON format with metadata and provenance tags. The dataset card lists task tags (token-classification, QA, summarization, etc.) because the corpus can be repurposed for multiple downstream tasks, but its primary value remains token-level masking.
Overall, the dataset is a practical benchmark for teams building automated PII redaction and detection systems — valuable for multilingual evaluation and model comparison, but not a substitute for legal or domain-specific human review.
