CordwainerSmith/GolemGuard

Name: CordwainerSmith/GolemGuard
Creator: CordwainerSmith
Published: 2024-11-04 13:34:51
License: 暂无描述

Hugging Face2024-11-04 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/CordwainerSmith/GolemGuard

下载链接

链接失效反馈

官方服务：

资源简介：

GolemGuard是一个专门用于训练和评估个人身份信息（PII）检测和掩码模型的希伯来语数据集。该数据集包含约600MB的合成文本数据，涵盖了以色列专业和行政环境中常见的文档类型和通信格式。数据集包含多种实体类型，如姓名、地址、电话号码等，并提供了训练和测试集的分割。数据集的结构为JSONL格式，每个实例包含原始文本、掩码后的文本、实体位置等信息。数据集的目标是提高希伯来语文本处理中的隐私保护，支持GDPR和PPLA等法规的合规性。

GolemGuard is a comprehensive Hebrew language dataset specifically designed for training and evaluating models for Personal Identifiable Information (PII) detection and masking. The dataset contains ~600MB of synthetic text data representing various document types and communication formats commonly found in Israeli professional and administrative contexts. The dataset combines synthetic data generated from multiple authoritative sources, including names, addresses, and synthetic identifiers. It includes various PII entity types such as first names, last names, Israeli ID numbers, phone numbers, dates, street addresses, city names, email addresses, postal codes, bank account numbers, credit card numbers, and credit card providers. The dataset also includes 2,607 unique document templates covering various document types. The dataset size is approximately 600MB, containing 115,453 examples in JSONL format. The dataset is split into training and test sets, with 97,453 and 18,000 instances respectively. The goal of the dataset is to improve privacy protection in Hebrew text processing by enabling better PII detection and masking.

提供机构：

CordwainerSmith

5,000+

优质数据集

54 个

任务类型

进入经典数据集