TheoDB/french-pii-eval
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/TheoDB/french-pii-eval
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fr
- en
task_categories:
- token-classification
tags:
- pii
- privacy
- french
- evaluation
- NER
size_categories:
- 10K<n<100K
---
# French PII Evaluation Dataset
A curated French PII detection evaluation and training dataset, built for benchmarking [TheoDB/privacy-filter-fr](https://huggingface.co/TheoDB/privacy-filter-fr).
## Dataset Structure
| Split | Examples | Purpose |
|-------|----------|---------|
| `test.jsonl` | 2,500 | **Held-out evaluation** — never used in training |
| `test_english.jsonl` | 426 | English regression check |
| `train.jsonl` | 57,248 | Training data |
| `val.jsonl` | 500 | Validation data |
## Label Taxonomy
8 PII classes (same as openai/privacy-filter):
| Class | Test Spans |
|-------|-----------|
| private_address | 1,827 |
| private_person | 1,579 |
| account_number | 991 |
| private_date | 493 |
| secret | 454 |
| private_email | 249 |
| private_url | 190 |
| private_phone | 185 |
All classes have ≥100 spans in the test set (minimum class support requirement).
## Format
OPF JSONL format compatible with `opf eval`:
```json
{
"text": "Bonjour Koby, rappel pour votre bilan...",
"spans": {
"private_person: Koby": [[8, 12]],
"private_date: 19/03/2007": [[60, 70]]
},
"info": {
"id": "abc123",
"source": "TypicaAI/pii-masking-60k_fr",
"language": "fr"
}
}
```
## Sources
- [TypicaAI/pii-masking-60k_fr](https://huggingface.co/datasets/TypicaAI/pii-masking-60k_fr) — French-only PII data
- [Isotonic/pii-masking-200k](https://huggingface.co/datasets/Isotonic/pii-masking-200k) — Multilingual, filtered for French
- [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) — Multilingual with coreference, filtered for French
## Label Mapping
Labels from source datasets are mapped to the 8-class taxonomy. See [label_mapping.md](https://huggingface.co/TheoDB/privacy-filter-fr/blob/main/reports/label_mapping.md) for the full mapping documentation.
## Construction
1. Loaded all French data from 3 sources
2. Deduplicated by text hash (60,248 unique examples)
3. Stratified test set construction ensuring ≥100 spans per class
4. Random seed: 42
5. Test set is strictly held out — never used for training or hyperparameter tuning
提供机构:
TheoDB



