five

jellewas/ai4privacy-pii-nl

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jellewas/ai4privacy-pii-nl
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - nl tags: - pii - dutch - privacy - ner - benchmark - gdpr size_categories: - 1K<n<10K --- # ai4privacy Dutch PII samples Dutch-language samples extracted from [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k). Used as evaluation data for [pii-nl-bench](https://huggingface.co/spaces/jellewas/pii-nl-bench). ## Files - `ai4privacy_nl_validation.jsonl` — **6829 samples** (use for evaluation) - `ai4privacy_nl_train.jsonl` — **25930 samples** (use for training only) - `ai4privacy_nl.jsonl` — combined (both splits) ## Validation label distribution | Label | Count | |-------|-------| | TIME | 4033 | | USERNAME | 2830 | | IDCARD | 2750 | | SOCIALNUMBER | 2685 | | DRIVERLICENSE | 2645 | | EMAIL | 2635 | | PASSPORT | 2562 | | LASTNAME1 | 2476 | | IP | 2294 | | BOD | 2273 | | SEX | 2180 | | GIVENNAME1 | 2135 | | POSTCODE | 2061 | | TEL | 2023 | | STREET | 2007 | | CITY | 1986 | | BUILDING | 1962 | | STATE | 1939 | | DATE | 1849 | | COUNTRY | 1756 | | TITLE | 1727 | | PASS | 1664 | | SECADDRESS | 820 | | LASTNAME2 | 627 | | GIVENNAME2 | 541 | | GEOCOORD | 239 | | LASTNAME3 | 213 | | CARDISSUER | 2 | ## Schema Each line is a JSON object: ```json { "id": "ai4privacy_nl_validation_00000", "text": "the source text with PII", "language": "nl", "source": "ai4privacy", "split": "validation", "annotations": [ {"start": 10, "end": 25, "text": "Jan de Vries", "label": "FIRSTNAME"} ] } ``` ## Usage ```python import json from huggingface_hub import hf_hub_download path = hf_hub_download(repo_id="jellewas/ai4privacy-pii-nl", filename="ai4privacy_nl_validation.jsonl", repo_type="dataset") with open(path) as f: samples = [json.loads(line) for line in f] print(f"Loaded {len(samples)} Dutch PII validation samples") ```

--- 许可证:Apache-2.0 语言:荷兰语(nl) 标签: - 个人可识别信息(Personally Identifiable Information,PII) - 荷兰语 - 隐私 - 命名实体识别(Named Entity Recognition,NER) - 基准测试集 - 通用数据保护条例(General Data Protection Regulation,GDPR) 样本规模类别:1000 < 样本数 < 10000 --- # ai4privacy 荷兰语PII样本集 该数据集的荷兰语样本取自[ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k)。 该数据集被用作[pii-nl-bench](https://huggingface.co/spaces/jellewas/pii-nl-bench)的评估数据。 ## 文件说明 - `ai4privacy_nl_validation.jsonl`:共**6829条样本**,用于模型评估 - `ai4privacy_nl_train.jsonl`:共**25930条样本**,仅用于模型训练 - `ai4privacy_nl.jsonl`:包含所有数据划分的合并数据集 ## 验证集标签分布 | 标签 | 数量 | |-------|-------| | TIME(时间) | 4033 | | USERNAME(用户名) | 2830 | | IDCARD(身份证号) | 2750 | | SOCIALNUMBER(社保号码) | 2685 | | DRIVERLICENSE(驾驶证号) | 2645 | | EMAIL(电子邮箱) | 2635 | | PASSPORT(护照号码) | 2562 | | LASTNAME1(姓氏1) | 2476 | | IP(IP地址) | 2294 | | BOD(出生日期,Date of Birth) | 2273 | | SEX(性别) | 2180 | | GIVENNAME1(名字1) | 2135 | | POSTCODE(邮政编码) | 2061 | | TEL(电话号码) | 2023 | | STREET(街道) | 2007 | | CITY(城市) | 1986 | | BUILDING(楼宇编号) | 1962 | | STATE(州/省) | 1939 | | DATE(日期) | 1849 | | COUNTRY(国家) | 1756 | | TITLE(头衔) | 1727 | | PASS(密码) | 1664 | | SECADDRESS(次要地址) | 820 | | LASTNAME2(姓氏2) | 627 | | GIVENNAME2(名字2) | 541 | | GEOCOORD(地理坐标) | 239 | | LASTNAME3(姓氏3) | 213 | | CARDISSUER(发卡机构) | 2 | ## 数据架构 每条数据行均为一个JSON对象,格式示例如下: json { "id": "ai4privacy_nl_validation_00000", "text": "包含PII的源文本", "language": "nl", "source": "ai4privacy", "split": "验证集", "annotations": [ {"start": 10, "end": 25, "text": "Jan de Vries", "label": "FIRSTNAME(名字,First Name)"} ] } ## 使用方法 python import json from huggingface_hub import hf_hub_download # 下载目标数据集文件 path = hf_hub_download(repo_id="jellewas/ai4privacy-pii-nl", filename="ai4privacy_nl_validation.jsonl", repo_type="dataset") # 逐行读取并解析为JSON样本 with open(path) as f: samples = [json.loads(line) for line in f] # 打印加载的样本数量 print(f"已加载 {len(samples)} 条荷兰语PII验证集样本")
提供机构:
jellewas
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作