575-lab/kiji-inspector-reviewed-pairs
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/575-lab/kiji-inspector-reviewed-pairs
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- da
- de
- en
- es
- fr
- nl
license: apache-2.0
task_categories:
- token-classification
task_ids:
- named-entity-recognition
tags:
- pii
- privacy
- ner
- coreference-resolution
- synthetic
pretty_name: Kiji PII Detection Training Data
size_categories:
- 10K<n<100K
---
# Kiji PII Detection Training Data
Synthetic multilingual dataset for training PII (Personally Identifiable Information) detection models with token-level entity annotations and coreference resolution.
## Dataset Summary
| | |
|---|---|
| **Samples** | 99,990 (train: 89,991, test: 9,999) |
| **Languages** | 6 (Dutch, Spanish, German, English, Danish, French) |
| **Countries** | 20 |
| **PII entity types** | 26 |
| **Total entity annotations** | 814,306 (avg 8.1 per sample) |
| **Coreference clusters** | 142,142 (99% of samples) |
| **Text length** | 130–1,149 chars (avg 446) |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("575-lab/kiji-inspector-reviewed-pairs")
# Access a sample
sample = ds["train"][0]
print(sample["text"])
print(sample["privacy_mask"]) # PII entity annotations
print(sample["coreferences"]) # Coreference clusters
print(sample["language"]) # e.g. "English"
print(sample["country"]) # e.g. "United States"
```
## Schema
Each sample contains:
| Column | Type | Description |
|--------|------|-------------|
| `text` | `string` | Natural language text with embedded PII |
| `privacy_mask` | `list[{"value": str, "label": str}]` | PII entities with their text span and label |
| `coreferences` | `list[{"mentions": list[str], "entity_type": str, "cluster_id": int}]` | Coreference clusters linking mentions of the same entity |
| `language` | `string` | Language of the text |
| `country` | `string` | Country context for the PII (affects address/ID formats) |
### Example sample
```json
{
"text": "Contact Dr. Maria Santos at maria.santos@hospital.org or call +1-555-123-4567.",
"privacy_mask": [
{"value": "Maria", "label": "FIRSTNAME"},
{"value": "Santos", "label": "SURNAME"},
{"value": "maria.santos@hospital.org", "label": "EMAIL"},
{"value": "+1-555-123-4567", "label": "PHONENUMBER"}
],
"coreferences": [
{
"mentions": ["Dr. Maria Santos", "maria.santos"],
"entity_type": "FIRSTNAME",
"cluster_id": 0
}
],
"language": "English",
"country": "United States"
}
```
## PII Labels
| Label | Count |
|-------|------:|
| `FIRSTNAME` | 86,136 |
| `CITY` | 80,194 |
| `BUILDINGNUM` | 76,268 |
| `SURNAME` | 74,278 |
| `STREET` | 71,946 |
| `ZIP` | 55,054 |
| `STATE` | 43,194 |
| `COUNTRY` | 22,728 |
| `COMPANYNAME` | 16,036 |
| `DATEOFBIRTH` | 15,114 |
| `PHONENUMBER` | 13,740 |
| `EMAIL` | 13,322 |
| `DRIVERLICENSENUM` | 13,078 |
| `SSN` | 12,886 |
| `SECURITYTOKEN` | 12,816 |
| `PASSPORTID` | 12,702 |
| `IBAN` | 12,688 |
| `PASSWORD` | 12,586 |
| `NATIONALID` | 12,510 |
| `TAXNUM` | 12,346 |
| `LICENSEPLATENUM` | 12,320 |
| `IDCARDNUM` | 12,140 |
| `URL` | 12,096 |
| `AGE` | 11,610 |
| `CREDITCARDNUMBER` | 7,130 |
| `USERNAME` | 6,468 |
## Language Distribution
| Language | Samples | % |
|----------|--------:|--:|
| Spanish | 16,846 | 16.8% |
| German | 16,708 | 16.7% |
| Dutch | 16,704 | 16.7% |
| Danish | 16,692 | 16.7% |
| English | 16,546 | 16.5% |
| French | 16,494 | 16.5% |
## Country Distribution
| Country | Samples | % |
|---------|--------:|--:|
| Denmark | 16,692 | 16.7% |
| Belgium | 11,654 | 11.7% |
| Switzerland | 8,938 | 8.9% |
| Netherlands | 8,418 | 8.4% |
| Canada | 5,924 | 5.9% |
| Germany | 5,574 | 5.6% |
| Austria | 5,512 | 5.5% |
| Luxembourg | 3,304 | 3.3% |
| France | 3,296 | 3.3% |
| Peru | 2,890 | 2.9% |
| Mexico | 2,886 | 2.9% |
| Colombia | 2,824 | 2.8% |
| New Zealand | 2,810 | 2.8% |
| United Kingdom | 2,804 | 2.8% |
| Chile | 2,790 | 2.8% |
| *(5 more)* | 13,674 | 13.7% |
## Data Generation
Samples are synthetically generated using LLMs with structured outputs. The generation pipeline:
1. **NER generation** — LLM produces text with embedded PII and entity annotations
2. **Coreference generation** — second pass links pronouns and references to their antecedent entities
3. **Review (optional)** — additional LLM pass validates and corrects annotations
4. **Format conversion** — samples are converted to a clean, standardized schema
## Intended Use
This dataset is designed for training token-classification models that detect and classify PII in text. The coreference annotations enable training models that can also resolve entity mentions (e.g., linking "he" back to "John Smith").
## Limitations
- All data is **synthetically generated** — entity distributions may not match real-world text
- Coreference annotations are LLM-generated and may contain errors
- Address and ID formats are country-specific but may not cover all regional variations
语言:
- 丹麦语(da)
- 德语(de)
- 英语(en)
- 西班牙语(es)
- 法语(fr)
- 荷兰语(nl)
许可证:Apache-2.0
任务类别:
- 令牌分类(token-classification)
任务子项:
- 命名实体识别(named-entity-recognition)
标签:
- 个人可识别信息(PII)
- 隐私
- 命名实体识别(NER)
- 共指消解(coreference-resolution)
- 合成数据集
展示名称:Kiji PII检测训练数据
规模类别:10K<n<100K
# Kiji PII检测训练数据
本数据集为合成多语言数据集,用于训练具备词元(Token)级实体标注与共指消解能力的个人可识别信息(PII, Personally Identifiable Information)检测模型。
## 数据集概览
| | |
|---|---|
| **样本总量** | 99,990(训练集:89,991,测试集:9,999) |
| **覆盖语言** | 6种(荷兰语、西班牙语、德语、英语、丹麦语、法语) |
| **涉及国家** | 20个 |
| **个人可识别信息实体类型** | 26种 |
| **总实体标注数** | 814,306(单样本平均标注数:8.1) |
| **共指聚类簇数量** | 142,142(覆盖99%的样本) |
| **文本长度** | 130–1,149字符(单样本平均长度:446) |
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("575-lab/kiji-inspector-reviewed-pairs")
# 获取单条样本
sample = ds["train"][0]
print(sample["text"])
print(sample["privacy_mask"]) # 个人可识别信息实体标注
print(sample["coreferences"]) # 共指聚类簇
print(sample["language"]) # 例如:"英语"
print(sample["country"]) # 例如:"美国"
## 数据结构
每个样本包含以下字段:
| 字段名 | 类型 | 描述 |
|--------|------|-------------|
| `text` | `string` | 嵌入了个人可识别信息的自然语言文本 |
| `privacy_mask` | `list[{"value": str, "label": str}]` | 列表类型,元素为包含`value`(实体文本片段)与`label`(实体标签)的字典 |
| `coreferences` | `list[{"mentions": list[str], "entity_type": str, "cluster_id": int}]` | 共指聚类簇列表,用于关联同一实体的不同提及,包含提及文本列表、实体类型与聚类ID |
| `language` | `string` | 文本所属语言 |
| `country` | `string` | 个人可识别信息对应的国家上下文(影响地址与身份证件格式) |
### 样本示例
json
{
"text": "Contact Dr. Maria Santos at maria.santos@hospital.org or call +1-555-123-4567.",
"privacy_mask": [
{"value": "Maria", "label": "名字(FIRSTNAME)"},
{"value": "Santos", "label": "姓氏(SURNAME)"},
{"value": "maria.santos@hospital.org", "label": "电子邮箱(EMAIL)"},
{"value": "+1-555-123-4567", "label": "电话号码(PHONENUMBER)"}
],
"coreferences": [
{
"mentions": ["Dr. Maria Santos", "maria.santos"],
"entity_type": "名字(FIRSTNAME)",
"cluster_id": 0
}
],
"language": "English",
"country": "United States"
}
## 个人可识别信息标签
| 标签 | 出现次数 |
|-------|------:|
| 名字(FIRSTNAME) | 86,136 |
| 城市(CITY) | 80,194 |
| 门牌号(BUILDINGNUM) | 76,268 |
| 姓氏(SURNAME) | 74,278 |
| 街道(STREET) | 71,946 |
| 邮政编码(ZIP) | 55,054 |
| 州/省(STATE) | 43,194 |
| 国家(COUNTRY) | 22,728 |
| 公司名称(COMPANYNAME) | 16,036 |
| 出生日期(DATEOFBIRTH) | 15,114 |
| 电话号码(PHONENUMBER) | 13,740 |
| 电子邮箱(EMAIL) | 13,322 |
| 驾驶证号(DRIVERLICENSENUM) | 13,078 |
| 社保号(SSN) | 12,886 |
| 安全令牌(SECURITYTOKEN) | 12,816 |
| 护照号(PASSPORTID) | 12,702 |
| 国际银行账户号码(IBAN) | 12,688 |
| 密码(PASSWORD) | 12,586 |
| 国民身份证号(NATIONALID) | 12,510 |
| 税号(TAXNUM) | 12,346 |
| 车牌号码(LICENSEPLATENUM) | 12,320 |
| 身份证号(IDCARDNUM) | 12,140 |
| 网址(URL) | 12,096 |
| 年龄(AGE) | 11,610 |
| 信用卡号(CREDITCARDNUMBER) | 7,130 |
| 用户名(USERNAME) | 6,468 |
## 语言分布
| 语言 | 样本数 | 占比 |
|----------|--------:|--:|
| 西班牙语 | 16,846 | 16.8% |
| 德语 | 16,708 | 16.7% |
| 荷兰语 | 16,704 | 16.7% |
| 丹麦语 | 16,692 | 16.7% |
| 英语 | 16,546 | 16.5% |
| 法语 | 16,494 | 16.5% |
## 国家分布
| 国家 | 样本数 | 占比 |
|---------|--------:|--:|
| 丹麦 | 16,692 | 16.7% |
| 比利时 | 11,654 | 11.7% |
| 瑞士 | 8,938 | 8.9% |
| 荷兰 | 8,418 | 8.4% |
| 加拿大 | 5,924 | 5.9% |
| 德国 | 5,574 | 5.6% |
| 奥地利 | 5,512 | 5.5% |
| 卢森堡 | 3,304 | 3.3% |
| 法国 | 3,296 | 3.3% |
| 秘鲁 | 2,890 | 2.9% |
| 墨西哥 | 2,886 | 2.9% |
| 哥伦比亚 | 2,824 | 2.8% |
| 新西兰 | 2,810 | 2.8% |
| 英国 | 2,804 | 2.8% |
| 智利 | 2,790 | 2.8% |
| *(另有5个国家)* | 13,674 | 13.7% |
## 数据生成流程
本数据集通过大语言模型(LLM, Large Language Model)结合结构化输出合成生成。具体生成流程如下:
1. **实体标注生成**:大语言模型生成嵌入个人可识别信息并附带实体标注的自然语言文本
2. **共指消解生成**:通过第二轮处理将代词与引用链接至其对应的先行实体
3. **可选审核**:通过额外的大语言模型流程验证并修正实体标注
4. **格式标准化**:将所有样本转换为统一的整洁格式
## 预期用途
本数据集专为训练能够检测并分类文本中个人可识别信息的令牌分类模型设计。共指标注可支持训练可解析实体提及的模型(例如将代词“他”链接至先行实体“约翰·史密斯”)。
## 局限性
- 所有数据均为**合成生成**,实体分布可能与真实世界文本存在差异
- 共指标注由大语言模型生成,可能存在错误
- 地址与身份证件格式为国家特定格式,但未覆盖所有区域变体
提供机构:
575-lab



