five

BoB14TeamSentinel/sentinel-kr-sensitive-entities-synthetic-v3

收藏
Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/BoB14TeamSentinel/sentinel-kr-sensitive-entities-synthetic-v3
下载链接
链接失效反馈
官方服务:
资源简介:
**Sentinel KR Sensitive Entities (Synthetic) v3** 是一个韩语的**合成(AI生成)**数据集,用于在DLP / LLM护栏场景中进行**仅白名单敏感实体检测**。该数据集中的所有敏感值(如电话号码、电子邮件、ID、令牌、密钥)都是**由AI人工生成**的,**不来自真实个人、真实事件或收集的私人数据集**。数据集的主要格式为JSON / JSONL(聊天风格的SFT样本),包含系统、用户和助理消息。助理消息包含JSON输出,其中包含检测到的敏感实体及其标签。数据集旨在训练和评估模型,以便在将文本发送到外部LLM服务之前检测敏感信息,为企业环境构建护栏(阻止/警告/屏蔽/编辑),并为韩语敏感实体提取提供基准。数据集不适用于重新识别或跟踪真实个人。数据集采用CC BY 4.0许可发布,使用时需提供适当的署名。

**Sentinel KR Sensitive Entities (Synthetic) v3** is a Korean **synthetic (AI-generated)** dataset for **whitelist-only sensitive-entity detection** in DLP / LLM guardrail scenarios. All sensitive values in this dataset (e.g., phone numbers, emails, IDs, tokens, keys) are **artificially generated by AI** and **do not come from real individuals, real incidents, or collected private datasets**. The primary format is JSON / JSONL (chat-style SFT samples), including system, user, and assistant messages. The assistant messages contain JSON outputs with detected sensitive entities and their labels. The dataset is intended for training and evaluating models to detect sensitive information before sending text to external LLM services, building guardrails (block / warn / mask / redact) for enterprise environments, and benchmarking Korean sensitive-entity extraction. It is not intended for re-identification or tracking real individuals. The dataset is released under the CC BY 4.0 license and requires proper attribution.
提供机构:
BoB14TeamSentinel
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作