five

Pritesh-2711/pii-bench

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Pritesh-2711/pii-bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 pretty_name: PIIBench task_categories: - token-classification language: - en tags: - pii - ner - privacy - benchmark size_categories: - 1M<n<10M --- # PIIBench ## Description PIIBench is a unified benchmark dataset for PII detection across multiple domains. ## Paper - arXiv: http://arxiv.org/abs/2604.15776 ## Dataset Summary - Total records: ~1.39M - Entity types: 48 - Format: BIO tagging ## Structure Each example contains: - `tokens`: list of tokens - `labels`: BIO labels - `source`: original data source of the sample ## Splits - `train.jsonl` - `validation.jsonl` - `test.jsonl` ## Source Ten datasets are downloaded from Hugging Face and consolidated into a unified BIO-tagged format: | Dataset | Rows | Domain | |---|---:|---| | ai4privacy/pii-masking-400k | ~400k | General, 63 PII classes | | ai4privacy/pii-masking-300k | ~300k | General + Finance (FinPII-80k) | | gretelai/synthetic_pii_finance_multilingual | ~56k | Finance (100 doc types) | | nvidia/Nemotron-PII | ~100k | General (50+ industries) | | wikiann (en) | ~20k | Wikipedia, PER/ORG/LOC only | | Babelscape/multinerd (en) | varies | Wikipedia + news, 15 types | | DFKI-SLT/few-nerd | ~188k | Wikipedia, 66 fine-grained types | | conll2003 | ~14k | News (Reuters), 4 types | | nlpaueb/finer-139 | ~1.1M | Finance (SEC filings), 139 XBRL tags | | Isotonic/pii-masking-200k | ~200k | General, 54 PII classes | `finer-139` is capped at 150k records during data preparation. Entity types with fewer than 500 B-mentions globally are collapsed to `O`. ## License This dataset is derived from multiple sources. Users must comply with the original dataset licenses of the constituent datasets. ## Citation ```bibtex @article{jha2026piibench, title={PIIBench: A Unified Multi-Source Benchmark Corpus for PII Detection}, author={Jha, Pritesh}, year={2026} }
提供机构:
Pritesh-2711
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作