tomekkorbak/pile-pii-scrubadub

Name: tomekkorbak/pile-pii-scrubadub
Creator: tomekkorbak
Published: 2023-02-07 15:26:41
License: 暂无描述

Hugging Face2023-02-07 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/tomekkorbak/pile-pii-scrubadub

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language: - en language_creators: - found license: - mit multilinguality: - monolingual pretty_name: pile-pii-scrubadub size_categories: - 1M<n<10M source_datasets: - extended|the_pile tags: - pii - personal - identifiable - information - pretraining-with-human-feedback task_categories: - text-classification - other task_ids: - acceptability-classification - text-scoring --- # Dataset Card for pile-pii-scrubadub ## Dataset Description - **Repository: https://github.com/tomekkorbak/aligned-pretraining-objectives** - **Paper: Arxiv link to be added** ### Dataset Summary This dataset contains text from [The Pile](https://huggingface.co/datasets/the_pile), annotated based on the personal idenfitiable information (PII) in each sentence. Each document (row in the dataset) is segmented into sentences, and each sentence is given a score: the percentage of words in it that are classified as PII by [Scrubadub](https://scrubadub.readthedocs.io/en/stable/). ### Supported Tasks and Leaderboards [More Information Needed] ### Languages This dataset is taken from [The Pile](https://huggingface.co/datasets/the_pile), which is English text. ## Dataset Structure ### Data Instances 1949977 ### Data Fields - texts (sequence): a list of the sentences in the document (segmented using [SpaCy](https://spacy.io/)) - meta (dict): the section of [The Pile](https://huggingface.co/datasets/the_pile) from which it originated - scores (sequence): a score for each sentence in the `texts` column indicating the percent of words that are detected as PII by [Scrubadub](https://scrubadub.readthedocs.io/en/stable/) - avg_score (float64): the average of the scores listed in the `scores` column - num_sents (int64): the number of sentences (and scores) in that document ### Data Splits Training set only ## Dataset Creation ### Curation Rationale This is labeled text from [The Pile](https://huggingface.co/datasets/the_pile), a large dataset of text in English. The PII is labeled so that generative language models can be trained to avoid generating PII. ### Source Data #### Initial Data Collection and Normalization This is labeled text from [The Pile](https://huggingface.co/datasets/the_pile). #### Who are the source language producers? Please see [The Pile](https://huggingface.co/datasets/the_pile) for the source of the dataset. ### Annotations #### Annotation process For each sentence, [Scrubadub](https://scrubadub.readthedocs.io/en/stable/) was used to detect: - email addresses - addresses and postal codes - phone numbers - credit card numbers - US social security numbers - vehicle plates numbers - dates of birth - URLs - login credentials #### Who are the annotators? [Scrubadub](https://scrubadub.readthedocs.io/en/stable/) ### Personal and Sensitive Information This dataset contains all PII that was originally contained in [The Pile](https://huggingface.co/datasets/the_pile), with all detected PII annotated. ## Considerations for Using the Data ### Social Impact of Dataset This dataset contains examples of real PII (conveniently annotated in the text!). Please take care to avoid misusing it or putting anybody in danger by publicizing their information. This dataset is intended for research purposes only. We cannot guarantee that all PII has been detected, and we cannot guarantee that models trained using it will avoid generating PII. We do not recommend deploying models trained on this data. ### Discussion of Biases This dataset contains all biases from The Pile discussed in their paper: https://arxiv.org/abs/2101.00027 ### Other Known Limitations The PII in this dataset was detected using imperfect automated detection methods. We cannot guarantee that the labels are 100% accurate. ## Additional Information ### Dataset Curators [The Pile](https://huggingface.co/datasets/the_pile) ### Licensing Information From [The Pile](https://huggingface.co/datasets/the_pile): PubMed Central: [MIT License](https://github.com/EleutherAI/pile-pubmedcentral/blob/master/LICENSE) ### Citation Information Paper information to be added ### Contributions [The Pile](https://huggingface.co/datasets/the_pile)

提供机构：

tomekkorbak

原始信息汇总

数据集概述：pile-pii-scrubadub

数据集描述

语言: 英语
许可证: MIT
多语言性: 单语种
数据集大小: 1M<n<10M
来源: 扩展自The Pile
标签: 个人识别信息 (PII), 个人, 可识别, 信息, 人类反馈预训练
任务类别: 文本分类, 其他
任务ID: 可接受性分类, 文本评分

数据集总结

本数据集包含来自The Pile的文本，根据每句话中的个人识别信息(PII)进行标注。每个文档（数据集中的行）被分割成句子，每个句子被赋予一个分数：句子中被Scrubadub分类为PII的单词的百分比。

数据集结构

数据实例

总数: 1949977

数据字段

texts: 文档中的句子列表（使用SpaCy分割）
meta: 来自The Pile的原始部分
scores: 每个句子在texts列中的分数，指示由Scrubadub检测为PII的单词的百分比
avg_score: scores列中分数的平均值
num_sents: 文档中的句子和分数的数量

数据分割

训练集: 仅包含训练集

数据集创建

来源数据

初始数据收集和标准化: 来自The Pile的标注文本
源语言生产者: 请参见The Pile

标注

标注过程: 使用Scrubadub检测每句话中的PII
标注者: Scrubadub

使用数据时的考虑

社会影响: 本数据集包含真实的PII示例，请谨慎使用以避免滥用或公开他人信息的风险。本数据集仅供研究使用。
偏见: 数据集包含The Pile中讨论的所有偏见
其他已知限制: PII的检测使用了不完美的自动化方法，无法保证标签的100%准确性。

5,000+

优质数据集

54 个

任务类型

进入经典数据集