jellewas/ai4privacy-pii-nl
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jellewas/ai4privacy-pii-nl
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- nl
tags:
- pii
- dutch
- privacy
- ner
- benchmark
- gdpr
size_categories:
- 1K<n<10K
---
# ai4privacy Dutch PII samples
Dutch-language samples extracted from [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k).
Used as evaluation data for [pii-nl-bench](https://huggingface.co/spaces/jellewas/pii-nl-bench).
## Files
- `ai4privacy_nl_validation.jsonl` — **6829 samples** (use for evaluation)
- `ai4privacy_nl_train.jsonl` — **25930 samples** (use for training only)
- `ai4privacy_nl.jsonl` — combined (both splits)
## Validation label distribution
| Label | Count |
|-------|-------|
| TIME | 4033 |
| USERNAME | 2830 |
| IDCARD | 2750 |
| SOCIALNUMBER | 2685 |
| DRIVERLICENSE | 2645 |
| EMAIL | 2635 |
| PASSPORT | 2562 |
| LASTNAME1 | 2476 |
| IP | 2294 |
| BOD | 2273 |
| SEX | 2180 |
| GIVENNAME1 | 2135 |
| POSTCODE | 2061 |
| TEL | 2023 |
| STREET | 2007 |
| CITY | 1986 |
| BUILDING | 1962 |
| STATE | 1939 |
| DATE | 1849 |
| COUNTRY | 1756 |
| TITLE | 1727 |
| PASS | 1664 |
| SECADDRESS | 820 |
| LASTNAME2 | 627 |
| GIVENNAME2 | 541 |
| GEOCOORD | 239 |
| LASTNAME3 | 213 |
| CARDISSUER | 2 |
## Schema
Each line is a JSON object:
```json
{
"id": "ai4privacy_nl_validation_00000",
"text": "the source text with PII",
"language": "nl",
"source": "ai4privacy",
"split": "validation",
"annotations": [
{"start": 10, "end": 25, "text": "Jan de Vries", "label": "FIRSTNAME"}
]
}
```
## Usage
```python
import json
from huggingface_hub import hf_hub_download
path = hf_hub_download(repo_id="jellewas/ai4privacy-pii-nl", filename="ai4privacy_nl_validation.jsonl", repo_type="dataset")
with open(path) as f:
samples = [json.loads(line) for line in f]
print(f"Loaded {len(samples)} Dutch PII validation samples")
```
---
许可证:Apache-2.0
语言:荷兰语(nl)
标签:
- 个人可识别信息(Personally Identifiable Information,PII)
- 荷兰语
- 隐私
- 命名实体识别(Named Entity Recognition,NER)
- 基准测试集
- 通用数据保护条例(General Data Protection Regulation,GDPR)
样本规模类别:1000 < 样本数 < 10000
---
# ai4privacy 荷兰语PII样本集
该数据集的荷兰语样本取自[ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k)。
该数据集被用作[pii-nl-bench](https://huggingface.co/spaces/jellewas/pii-nl-bench)的评估数据。
## 文件说明
- `ai4privacy_nl_validation.jsonl`:共**6829条样本**,用于模型评估
- `ai4privacy_nl_train.jsonl`:共**25930条样本**,仅用于模型训练
- `ai4privacy_nl.jsonl`:包含所有数据划分的合并数据集
## 验证集标签分布
| 标签 | 数量 |
|-------|-------|
| TIME(时间) | 4033 |
| USERNAME(用户名) | 2830 |
| IDCARD(身份证号) | 2750 |
| SOCIALNUMBER(社保号码) | 2685 |
| DRIVERLICENSE(驾驶证号) | 2645 |
| EMAIL(电子邮箱) | 2635 |
| PASSPORT(护照号码) | 2562 |
| LASTNAME1(姓氏1) | 2476 |
| IP(IP地址) | 2294 |
| BOD(出生日期,Date of Birth) | 2273 |
| SEX(性别) | 2180 |
| GIVENNAME1(名字1) | 2135 |
| POSTCODE(邮政编码) | 2061 |
| TEL(电话号码) | 2023 |
| STREET(街道) | 2007 |
| CITY(城市) | 1986 |
| BUILDING(楼宇编号) | 1962 |
| STATE(州/省) | 1939 |
| DATE(日期) | 1849 |
| COUNTRY(国家) | 1756 |
| TITLE(头衔) | 1727 |
| PASS(密码) | 1664 |
| SECADDRESS(次要地址) | 820 |
| LASTNAME2(姓氏2) | 627 |
| GIVENNAME2(名字2) | 541 |
| GEOCOORD(地理坐标) | 239 |
| LASTNAME3(姓氏3) | 213 |
| CARDISSUER(发卡机构) | 2 |
## 数据架构
每条数据行均为一个JSON对象,格式示例如下:
json
{
"id": "ai4privacy_nl_validation_00000",
"text": "包含PII的源文本",
"language": "nl",
"source": "ai4privacy",
"split": "验证集",
"annotations": [
{"start": 10, "end": 25, "text": "Jan de Vries", "label": "FIRSTNAME(名字,First Name)"}
]
}
## 使用方法
python
import json
from huggingface_hub import hf_hub_download
# 下载目标数据集文件
path = hf_hub_download(repo_id="jellewas/ai4privacy-pii-nl", filename="ai4privacy_nl_validation.jsonl", repo_type="dataset")
# 逐行读取并解析为JSON样本
with open(path) as f:
samples = [json.loads(line) for line in f]
# 打印加载的样本数量
print(f"已加载 {len(samples)} 条荷兰语PII验证集样本")
提供机构:
jellewas



