TheoDB/french-pii-eval

Name: TheoDB/french-pii-eval
Creator: TheoDB
Published: 2026-04-27 15:09:02
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/TheoDB/french-pii-eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fr - en task_categories: - token-classification tags: - pii - privacy - french - evaluation - NER size_categories: - 10K<n<100K --- # French PII Evaluation Dataset A curated French PII detection evaluation and training dataset, built for benchmarking [TheoDB/privacy-filter-fr](https://huggingface.co/TheoDB/privacy-filter-fr). ## Dataset Structure | Split | Examples | Purpose | |-------|----------|---------| | `test.jsonl` | 2,500 | **Held-out evaluation** — never used in training | | `test_english.jsonl` | 426 | English regression check | | `train.jsonl` | 57,248 | Training data | | `val.jsonl` | 500 | Validation data | ## Label Taxonomy 8 PII classes (same as openai/privacy-filter): | Class | Test Spans | |-------|-----------| | private_address | 1,827 | | private_person | 1,579 | | account_number | 991 | | private_date | 493 | | secret | 454 | | private_email | 249 | | private_url | 190 | | private_phone | 185 | All classes have ≥100 spans in the test set (minimum class support requirement). ## Format OPF JSONL format compatible with `opf eval`: ```json { "text": "Bonjour Koby, rappel pour votre bilan...", "spans": { "private_person: Koby": [[8, 12]], "private_date: 19/03/2007": [[60, 70]] }, "info": { "id": "abc123", "source": "TypicaAI/pii-masking-60k_fr", "language": "fr" } } ``` ## Sources - [TypicaAI/pii-masking-60k_fr](https://huggingface.co/datasets/TypicaAI/pii-masking-60k_fr) — French-only PII data - [Isotonic/pii-masking-200k](https://huggingface.co/datasets/Isotonic/pii-masking-200k) — Multilingual, filtered for French - [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) — Multilingual with coreference, filtered for French ## Label Mapping Labels from source datasets are mapped to the 8-class taxonomy. See [label_mapping.md](https://huggingface.co/TheoDB/privacy-filter-fr/blob/main/reports/label_mapping.md) for the full mapping documentation. ## Construction 1. Loaded all French data from 3 sources 2. Deduplicated by text hash (60,248 unique examples) 3. Stratified test set construction ensuring ≥100 spans per class 4. Random seed: 42 5. Test set is strictly held out — never used for training or hyperparameter tuning

提供机构：

TheoDB

5,000+

优质数据集

54 个

任务类型

进入经典数据集