grammarly/pseudonymization-data

Name: grammarly/pseudonymization-data
Creator: grammarly
Published: 2023-08-23 21:07:17
License: 暂无描述

Hugging Face2023-08-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/grammarly/pseudonymization-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification - summarization language: - en pretty_name: Pseudonymization data size_categories: - 100M<n<1T --- This repository contains all the datasets used in our paper "Privacy- and Utility-Preserving NLP with Anonymized data: A case study of Pseudonymization" (https://aclanthology.org/2023.trustnlp-1.20). # Dataset Card for Pseudonymization data ## Dataset Description - **Homepage:** https://huggingface.co/datasets/grammarly/pseudonymization-data - **Paper:** https://aclanthology.org/2023.trustnlp-1.20/ - **Point of Contact:** oleksandr.yermilov@ucu.edu.ua ### Dataset Summary This dataset repository contains all the datasets, used in our paper. It includes datasets for different NLP tasks, pseudonymized by different algorithms; a dataset for training Seq2Seq model which translates text from original to "pseudonymized"; and a dataset for training model which would detect if the text was pseudonymized. ### Languages English. ## Dataset Structure Each folder contains preprocessed train versions of different datasets (e.g, in the `cnn_dm` folder there will be preprocessed CNN/Daily Mail dataset). Each file has a name, which corresponds with the algorithm from the paper used for its preprocessing (e.g. `ner_ps_spacy_imdb.csv` is imdb dataset, preprocessed with NER-based pseudonymization using FLAIR system). I ## Dataset Creation Datasets in `imdb` and `cnn_dm` folders were created by pseudonymizing corresponding datasets with different pseudonymization algorithms. Datasets in `detection` folder are combined original datasets and pseudonymized datasets, grouped by pseudonymization algorithm used. Datasets in `seq2seq` folder are datasets for training Seq2Seq transformer-based pseudonymization model. At first, a dataset was fetched from Wikipedia articles, which was preprocessed with either NER-PS<sub>FLAIR</sub> or NER-PS<sub>spaCy</sub> algorithms. ### Personal and Sensitive Information This datasets bring no sensitive or personal information; it is completely based on data present in open sources (Wikipedia, standard datasets for NLP tasks). ## Considerations for Using the Data ### Known Limitations Only English texts are present in the datasets. Only a limited part of named entity types are replaced in the datasets. Please, also check the Limitations section of our paper. ## Additional Information ### Dataset Curators Oleksandr Yermilov (oleksandr.yermilov@ucu.edu.ua) ### Citation Information ``` @inproceedings{yermilov-etal-2023-privacy, title = "Privacy- and Utility-Preserving {NLP} with Anonymized data: A case study of Pseudonymization", author = "Yermilov, Oleksandr and Raheja, Vipul and Chernodub, Artem", booktitle = "Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.trustnlp-1.20", doi = "10.18653/v1/2023.trustnlp-1.20", pages = "232--241", abstract = "This work investigates the effectiveness of different pseudonymization techniques, ranging from rule-based substitutions to using pre-trained Large Language Models (LLMs), on a variety of datasets and models used for two widely used NLP tasks: text classification and summarization. Our work provides crucial insights into the gaps between original and anonymized data (focusing on the pseudonymization technique) and model quality and fosters future research into higher-quality anonymization techniques better to balance the trade-offs between data protection and utility preservation. We make our code, pseudonymized datasets, and downstream models publicly available.", } ```

提供机构：

grammarly

原始信息汇总

数据集概述

数据集名称

名称: Pseudonymization data

许可证

许可证: Apache-2.0

任务类别

任务类别:
- text-classification
- summarization

语言

语言: English

大小类别

大小类别: 100M<n<1T

数据集描述

概述: 该数据集包含用于不同NLP任务的多个数据集，通过不同的算法进行伪名化处理。还包括用于训练Seq2Seq模型（将文本从原始转换为“伪名化”）和用于训练检测文本是否被伪名化的模型的数据集。

数据集结构

结构: 数据集分为多个文件夹，每个文件夹包含预处理的训练数据集版本。文件名对应于用于预处理的论文中的算法。

数据集创建

创建: 数据集通过不同的伪名化算法对原始数据集进行处理创建。

使用考虑

限制: 数据集仅包含英文文本，并且仅替换了一部分命名实体类型。

数据集管理员

管理员: Oleksandr Yermilov (oleksandr.yermilov@ucu.edu.ua)

引用信息

@inproceedings{yermilov-etal-2023-privacy, title = "Privacy- and Utility-Preserving {NLP} with Anonymized data: A case study of Pseudonymization", author = "Yermilov, Oleksandr and Raheja, Vipul and Chernodub, Artem", booktitle = "Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.trustnlp-1.20", doi = "10.18653/v1/2023.trustnlp-1.20", pages = "232--241", abstract = "This work investigates the effectiveness of different pseudonymization techniques, ranging from rule-based substitutions to using pre-trained Large Language Models (LLMs), on a variety of datasets and models used for two widely used NLP tasks: text classification and summarization. Our work provides crucial insights into the gaps between original and anonymized data (focusing on the pseudonymization technique) and model quality and fosters future research into higher-quality anonymization techniques better to balance the trade-offs between data protection and utility preservation. We make our code, pseudonymized datasets, and downstream models publicly available.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集