five

grammarly/pseudonymization-data

收藏
Hugging Face2023-08-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/grammarly/pseudonymization-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification - summarization language: - en pretty_name: Pseudonymization data size_categories: - 100M<n<1T --- This repository contains all the datasets used in our paper "Privacy- and Utility-Preserving NLP with Anonymized data: A case study of Pseudonymization" (https://aclanthology.org/2023.trustnlp-1.20). # Dataset Card for Pseudonymization data ## Dataset Description - **Homepage:** https://huggingface.co/datasets/grammarly/pseudonymization-data - **Paper:** https://aclanthology.org/2023.trustnlp-1.20/ - **Point of Contact:** oleksandr.yermilov@ucu.edu.ua ### Dataset Summary This dataset repository contains all the datasets, used in our paper. It includes datasets for different NLP tasks, pseudonymized by different algorithms; a dataset for training Seq2Seq model which translates text from original to "pseudonymized"; and a dataset for training model which would detect if the text was pseudonymized. ### Languages English. ## Dataset Structure Each folder contains preprocessed train versions of different datasets (e.g, in the `cnn_dm` folder there will be preprocessed CNN/Daily Mail dataset). Each file has a name, which corresponds with the algorithm from the paper used for its preprocessing (e.g. `ner_ps_spacy_imdb.csv` is imdb dataset, preprocessed with NER-based pseudonymization using FLAIR system). I ## Dataset Creation Datasets in `imdb` and `cnn_dm` folders were created by pseudonymizing corresponding datasets with different pseudonymization algorithms. Datasets in `detection` folder are combined original datasets and pseudonymized datasets, grouped by pseudonymization algorithm used. Datasets in `seq2seq` folder are datasets for training Seq2Seq transformer-based pseudonymization model. At first, a dataset was fetched from Wikipedia articles, which was preprocessed with either NER-PS<sub>FLAIR</sub> or NER-PS<sub>spaCy</sub> algorithms. ### Personal and Sensitive Information This datasets bring no sensitive or personal information; it is completely based on data present in open sources (Wikipedia, standard datasets for NLP tasks). ## Considerations for Using the Data ### Known Limitations Only English texts are present in the datasets. Only a limited part of named entity types are replaced in the datasets. Please, also check the Limitations section of our paper. ## Additional Information ### Dataset Curators Oleksandr Yermilov (oleksandr.yermilov@ucu.edu.ua) ### Citation Information ``` @inproceedings{yermilov-etal-2023-privacy, title = "Privacy- and Utility-Preserving {NLP} with Anonymized data: A case study of Pseudonymization", author = "Yermilov, Oleksandr and Raheja, Vipul and Chernodub, Artem", booktitle = "Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.trustnlp-1.20", doi = "10.18653/v1/2023.trustnlp-1.20", pages = "232--241", abstract = "This work investigates the effectiveness of different pseudonymization techniques, ranging from rule-based substitutions to using pre-trained Large Language Models (LLMs), on a variety of datasets and models used for two widely used NLP tasks: text classification and summarization. Our work provides crucial insights into the gaps between original and anonymized data (focusing on the pseudonymization technique) and model quality and fosters future research into higher-quality anonymization techniques better to balance the trade-offs between data protection and utility preservation. We make our code, pseudonymized datasets, and downstream models publicly available.", } ```
提供机构:
grammarly
原始信息汇总

数据集概述

数据集名称

  • 名称: Pseudonymization data

许可证

  • 许可证: Apache-2.0

任务类别

  • 任务类别:
    • text-classification
    • summarization

语言

  • 语言: English

大小类别

  • 大小类别: 100M<n<1T

数据集描述

  • 概述: 该数据集包含用于不同NLP任务的多个数据集,通过不同的算法进行伪名化处理。还包括用于训练Seq2Seq模型(将文本从原始转换为“伪名化”)和用于训练检测文本是否被伪名化的模型的数据集。

数据集结构

  • 结构: 数据集分为多个文件夹,每个文件夹包含预处理的训练数据集版本。文件名对应于用于预处理的论文中的算法。

数据集创建

  • 创建: 数据集通过不同的伪名化算法对原始数据集进行处理创建。

使用考虑

  • 限制: 数据集仅包含英文文本,并且仅替换了一部分命名实体类型。

数据集管理员

  • 管理员: Oleksandr Yermilov (oleksandr.yermilov@ucu.edu.ua)

引用信息

@inproceedings{yermilov-etal-2023-privacy, title = "Privacy- and Utility-Preserving {NLP} with Anonymized data: A case study of Pseudonymization", author = "Yermilov, Oleksandr and Raheja, Vipul and Chernodub, Artem", booktitle = "Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.trustnlp-1.20", doi = "10.18653/v1/2023.trustnlp-1.20", pages = "232--241", abstract = "This work investigates the effectiveness of different pseudonymization techniques, ranging from rule-based substitutions to using pre-trained Large Language Models (LLMs), on a variety of datasets and models used for two widely used NLP tasks: text classification and summarization. Our work provides crucial insights into the gaps between original and anonymized data (focusing on the pseudonymization technique) and model quality and fosters future research into higher-quality anonymization techniques better to balance the trade-offs between data protection and utility preservation. We make our code, pseudonymized datasets, and downstream models publicly available.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作