five

TheoDB/french-pii-eval

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/TheoDB/french-pii-eval
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - fr - en task_categories: - token-classification tags: - pii - privacy - french - evaluation - NER size_categories: - 10K<n<100K --- # French PII Evaluation Dataset A curated French PII detection evaluation and training dataset, built for benchmarking [TheoDB/privacy-filter-fr](https://huggingface.co/TheoDB/privacy-filter-fr). ## Dataset Structure | Split | Examples | Purpose | |-------|----------|---------| | `test.jsonl` | 2,500 | **Held-out evaluation** — never used in training | | `test_english.jsonl` | 426 | English regression check | | `train.jsonl` | 57,248 | Training data | | `val.jsonl` | 500 | Validation data | ## Label Taxonomy 8 PII classes (same as openai/privacy-filter): | Class | Test Spans | |-------|-----------| | private_address | 1,827 | | private_person | 1,579 | | account_number | 991 | | private_date | 493 | | secret | 454 | | private_email | 249 | | private_url | 190 | | private_phone | 185 | All classes have ≥100 spans in the test set (minimum class support requirement). ## Format OPF JSONL format compatible with `opf eval`: ```json { "text": "Bonjour Koby, rappel pour votre bilan...", "spans": { "private_person: Koby": [[8, 12]], "private_date: 19/03/2007": [[60, 70]] }, "info": { "id": "abc123", "source": "TypicaAI/pii-masking-60k_fr", "language": "fr" } } ``` ## Sources - [TypicaAI/pii-masking-60k_fr](https://huggingface.co/datasets/TypicaAI/pii-masking-60k_fr) — French-only PII data - [Isotonic/pii-masking-200k](https://huggingface.co/datasets/Isotonic/pii-masking-200k) — Multilingual, filtered for French - [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) — Multilingual with coreference, filtered for French ## Label Mapping Labels from source datasets are mapped to the 8-class taxonomy. See [label_mapping.md](https://huggingface.co/TheoDB/privacy-filter-fr/blob/main/reports/label_mapping.md) for the full mapping documentation. ## Construction 1. Loaded all French data from 3 sources 2. Deduplicated by text hash (60,248 unique examples) 3. Stratified test set construction ensuring ≥100 spans per class 4. Random seed: 42 5. Test set is strictly held out — never used for training or hyperparameter tuning
提供机构:
TheoDB
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作