tanaos/synthetic-text-anonymizer-dataset-v1

Name: tanaos/synthetic-text-anonymizer-dataset-v1
Creator: tanaos
Published: 2025-12-21 15:12:20
License: 暂无描述

Hugging Face2025-12-21 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/tanaos/synthetic-text-anonymizer-dataset-v1

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集由Tanaos使用Artifex Python库合成创建，旨在训练和评估用于文本匿名化的命名实体识别系统，即能够识别和编辑文本中的个人身份信息（PII）的模型。数据集包含带有命名实体标签的文本样本，每个样本由一句话或段落组成，实体按照以下类别逐词标注：`O`（无实体）、`PERSON`（个人、虚构角色）、`LOCATION`（地理区域）、`DATE`（绝对或相对日期，包括年、月、日）、`ADDRESS`（完整地址）、`PHONE_NUMBER`（电话号码）。标点符号和特殊字符未被标注。

This dataset was created synthetically by Tanaos with the Artifex Python library. The dataset is designed to train and evaluate Named Entity Recognition systems for Text Anonymization — models that can identify and redact Personally Identifiable Information (PII) from text. The dataset contains text samples labeled with named entity tags. Each sample consists of a sentence or paragraph with entities annotated, word-by-word, according to the following categories: `O` (No entity is present), `PERSON` (Individual people, fictional characters), `LOCATION` (Geographical areas), `DATE` (Absolute or relative dates, including years, months and/or days), `ADDRESS` (Full addresses), `PHONE_NUMBER` (Telephone numbers). Punctuation and special characters are not labeled.

提供机构：

tanaos

5,000+

优质数据集

54 个

任务类型

进入经典数据集