panda
收藏魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/panda
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for PANDA
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** https://github.com/facebookresearch/ResponsibleNLP/
- **Paper:** https://arxiv.org/abs/2205.12586
- **Point of Contact:** rebeccaqian@meta.com, ccross@meta.com, douwe@huggingface.co, adinawilliams@meta.com
### Dataset Summary
PANDA (Perturbation Augmentation NLP DAtaset) consists of approximately 100K pairs of crowdsourced human-perturbed text snippets (original, perturbed). Annotators were given selected terms and target demographic attributes, and instructed to rewrite text snippets along three demographic axes: gender, race and age, while preserving semantic meaning. Text snippets were sourced from a range of text corpora (BookCorpus, Wikipedia, ANLI, MNLI, SST, SQuAD). PANDA can be used for training a learned perturber that can rewrite text with control. PANDA can also be used to evaluate the demographic robustness of language models.
### Languages
English
## Dataset Structure
### Data Instances
- Size of training data: 198.6 MB
- Size of validation data: 22.2 MB
Examples of data instances:
```
{
"original": "the moment the girl mentions the subject she will be yours .",
"selected_word": "girl",
"target_attribute": "man",
"perturbed": "the moment the boy mentions the subject he will be yours.\n\n"
}
{
"original": "are like magic tricks, says the New York Times ' Michael Kimmelman. <SEP> Michael Kimmelman has never likened anything to a magic trick.",
"selected_word": "Michael",
"target_attribute": "woman",
"perturbed": "are like magic tricks, says the New York Times' Michelle Kimmelman. <SEP> Michelle Kimmelman has never likened anything to a magic trick."
}
{
"original": "lilly ann looked at him asking herself how he cold not know .",
"selected_word": "he",
"target_attribute": "non-binary",
"perturbed": "Lilly Ann looked at them, asking herself how they could not know."
}
```
Examples with <SEP> tokens are the result of concatenation of text fields in source datasets, such as the premise and hypothesis of NLI datasets.
### Data Fields
- `original`: Source (unperturbed) text snippet, sampled from a variety of English text corpora.
- `selected_word`: Demographic term that needs to be perturbed.
- `target_attribute`: Target demographic category.
- `perturbed`: Perturbed text snippet, which is the source text rewritten to alter the selected word along the specified target demographic attribute. For example, if the selected word is "Lily" and target is "man", all references to "Lily" (eg. pronouns) in the source text are altered to refer to a man. Note that some examples may be unchanged, either due to the lack of demographic information, or ambiguity of the task; given the subjective nature of identifying demographic terms and attributes, we allow some room for interpretation provided the rewrite does not perpetuate harmful social biases.
### Data Splits
- `train`: 94966
- `valid`: 10551
## Dataset Creation
### Curation Rationale
We constructed PANDA to create and release the first large scale dataset of demographic text perturbations. This enables the training of the first neural perturber model, which outperforms heuristic approaches.
### Source Data
#### Initial Data Collection and Normalization
We employed 524 crowdworkers to create PANDA examples over the span of several months. Annotators were tasked with rewriting text snippets sourced from popular English text corpora. For more information on the task UI and methodology, see our paper *Perturbation Augmentation for Fairer NLP*.
### Annotations
#### Annotation process
PANDA was collected in a 3 stage annotation process:
1. Span identification: Annotators select demographic terms in source text samples.
2. Attribute identification: Identified demographic terms are annotated for gender/race/age attributes, such as "man", "Asian", "old" etc.
3. Rewrite text: Annotators rewrite text by modifying the selected entity to reflect the target demographic attribute. Annotators are encouraged to create minimal edits, eg. "George" -> "Georgina".
The annotation process is explained in more detail in our paper.
#### Who are the annotators?
PANDA was annotated by English speaking Amazon Mechanical Turk workers. We included a voluntary demographic survey along with annotation tasks that did not contribute to pay. For a breakdown of annotators' demographic identities, see our paper.
### Personal and Sensitive Information
PANDA does not contain identifying information about annotators.
## Considerations for Using the Data
### Social Impact of Dataset
By releasing the first large scale dataset of demographic text rewrites, we hope to enable exciting future work in fairness in NLP toward more scalable, automated approaches to reducing biases in datasets and language models.
Furthermore, PANDA aims to be diverse in text domain and demographic representation. PANDA includes a large proportion of non-binary gender annotations, which are underrepresented in existing text corpora and prior fairness datasets. Text examples vary in length, with examples spanning single sentences and long Wikipedia passages, and are sourced from a variety of text corpora that can be used to train a domain agnostic perturber.
### Discussion of Biases
For this work, we sourced our annotated data from a range of sources to ensure: (i) permissive data licensing, (ii) that our perturber works well on downstream applications such as NLU classification tasks, and (iii) that our perturber can handle data from multiple domains to be maximally useful. However, we acknowledge that there may be other existing biases in PANDA as a result of our data sourcing choices. For example, it is possible that data sources like BookWiki primarily contain topics of interest to people with a certain amount of influence and educational access, people from the so-called “Western world”, etc. Other topics that might be interesting and relevant to others may be missing or only present in limited quantities. The present approach can only weaken associations inherited from the data sources we use, but in future work, we would love to explore the efficacy of our approach on text from other sources that contain a wider range of topics and text domain differences.
### Other Known Limitations
Our augmentation process can sometimes create nonexistent versions of real people, such as discussing an English King Victor (not a historical figure), as opposed to a Queen Victoria (a historical figure). We embrace the counterfactuality of many of our perturbations, but the lack of guaranteed factuality means that our approach may not be well-suited to all NLP tasks. For example, it might not be suitable for augmenting misinformation detection datasets, because peoples’ names, genders, and other demographic information should not be changed.
## Additional Information
### Dataset Curators
Rebecca Qian, Candace Ross, Jude Fernandes, Douwe Kiela and Adina Williams.
### Licensing Information
PANDA is released under the MIT license.
### Citation Information
https://arxiv.org/abs/2205.12586
### Contributions
Thanks to [@Rebecca-Qian](https://github.com/Rebecca-Qian) for adding this dataset.
# PANDA 数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [已知其他局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **仓库地址**:https://github.com/facebookresearch/ResponsibleNLP/
- **相关论文**:https://arxiv.org/abs/2205.12586
- **联系邮箱**:rebeccaqian@meta.com, ccross@meta.com, douwe@huggingface.co, adinawilliams@meta.com
### 数据集概述
PANDA(扰动增强自然语言处理数据集,Perturbation Augmentation NLP DAtaset)包含约10万条众包人工扰动的文本片段对(原始文本、扰动后文本)。标注人员会获得指定术语与目标人口属性,并被要求从性别、种族、年龄三个人口统计维度改写文本片段,同时保留语义内涵。文本片段来源于多个文本语料库:BookCorpus、维基百科(Wikipedia)、ANLI、MNLI、SST、SQuAD。PANDA可用于训练可控文本扰动模型,也可用于评估语言模型的人口统计鲁棒性。
### 语言
英语
## 数据集结构
### 数据实例
- 训练数据大小:198.6 MB
- 验证数据大小:22.2 MB
数据实例示例如下:
json
{
"original": "the moment the girl mentions the subject she will be yours .",
"selected_word": "girl",
"target_attribute": "man",
"perturbed": "the moment the boy mentions the subject he will be yours.
"
}
{
"original": "are like magic tricks, says the New York Times ' Michael Kimmelman. <SEP> Michael Kimmelman has never likened anything to a magic trick.",
"selected_word": "Michael",
"target_attribute": "woman",
"perturbed": "are like magic tricks, says the New York Times' Michelle Kimmelman. <SEP> Michelle Kimmelman has never likened anything to a magic trick."
}
{
"original": "lilly ann looked at him asking herself how he cold not know .",
"selected_word": "he",
"target_attribute": "non-binary",
"perturbed": "Lilly Ann looked at them, asking herself how they could not know."
}
带有<SEP>标记的示例为源数据集中多个文本字段拼接的结果,例如自然语言推理(Natural Language Inference, NLI)数据集的前提与假设。
### 数据字段
- `original`:原始(未经过扰动的)文本片段,采样自多个英语文本语料库。
- `selected_word`:需要进行扰动的人口统计术语。
- `target_attribute`:目标人口统计类别。
- `perturbed`:经过扰动的文本片段,即按照指定目标人口统计属性改写原始文本得到的结果。例如,若选中术语为"Lily"且目标属性为"男性",则原始文本中所有指向"Lily"的指代(如代词)都将被修改为指代男性。需注意,部分示例可能未发生改动,这要么是因为文本不含人口统计信息,要么是任务存在歧义;鉴于识别人口统计术语与属性具有主观性,只要改写未助长有害社会偏见,我们允许一定的解释空间。
### 数据划分
- 训练集(train):94966条
- 验证集(valid):10551条
## 数据集构建
### 构建初衷
我们构建PANDA旨在创建并发布首个大规模人口统计文本扰动数据集,以此支持首个神经扰动模型的训练,该模型的表现优于启发式方法。
### 源数据
#### 初始数据收集与标准化
我们招募了524名众包人员,耗时数月完成PANDA数据实例的创建。标注人员的任务是改写从主流英语文本语料库中提取的文本片段。如需了解任务界面与方法的更多细节,请参阅我们的论文《Perturbation Augmentation for Fairer NLP》。
### 标注
#### 标注流程
PANDA的标注分为三个阶段:
1. 跨度识别:标注人员从源文本样本中选取人口统计术语。
2. 属性识别:为已识别的人口统计术语标注性别/种族/年龄属性,例如"男性""亚裔""年长"等。
3. 文本改写:标注人员修改选中的实体以匹配目标人口统计属性,进而改写文本。我们鼓励标注人员进行最小化改动,例如"George"→"Georgina"。
标注流程的更多细节可参阅我们的论文。
#### 标注人员构成
PANDA的标注者均为会说英语的亚马逊机械 Turk(Amazon Mechanical Turk)平台众包工作者。我们在标注任务中附带了一份自愿填写的人口统计调查问卷,该问卷不影响薪酬。如需了解标注人员的人口统计身份分布,请参阅我们的论文。
### 个人与敏感信息
PANDA未包含标注人员的身份识别信息。
## 数据集使用注意事项
### 数据集的社会影响
通过发布首个大规模人口统计文本改写数据集,我们希望推动自然语言处理公平性领域的后续研究,开发更具可扩展性的自动化方法,以降低数据集与语言模型中的偏差。
此外,PANDA旨在实现文本领域与人口统计表征的多样性。该数据集包含大量非二元性别标注,这类标注在现有文本语料库与既往公平性数据集中占比极低。文本示例的长度各异,从单句到较长的维基百科段落均有覆盖,且来源于多个文本语料库,可用于训练领域无关的扰动模型。
### 偏差讨论
在本研究中,我们从多个来源收集标注数据,以实现以下目标:(i) 宽松的数据许可协议;(ii) 我们的扰动模型在自然语言理解(Natural Language Understanding, NLU)分类任务等下游应用中表现良好;(iii) 我们的扰动模型能够处理多领域数据,以最大化其实用性。然而,我们承认,由于数据来源的选择,PANDA可能存在其他固有偏差。例如,BookWiki等数据源可能主要包含对具有一定影响力与教育背景的人群、所谓"西方世界"人群等感兴趣的主题,其他可能对他人有意义的主题可能缺失或占比极低。本方法仅能削弱我们所用数据源中继承的关联,但在未来的工作中,我们希望探索该方法在包含更广泛主题与文本领域差异的其他来源文本上的有效性。
### 已知其他局限性
我们的增强流程有时会生成现实中不存在的人物,例如讨论"英格兰国王维克多"(并非历史人物),而非"维多利亚女王"(真实历史人物)。我们接受多数扰动的反事实特性,但无法保证事实性,这意味着我们的方法可能并非适用于所有自然语言处理任务。例如,该方法可能不适用于虚假信息检测数据集的增强,因为其中的人物姓名、性别及其他人口统计信息不应被修改。
## 附加信息
### 数据集维护者
Rebecca Qian、Candace Ross、Jude Fernandes、Douwe Kiela与Adina Williams。
### 许可信息
PANDA采用MIT许可证发布。
### 引用信息
https://arxiv.org/abs/2205.12586
### 贡献
感谢[@Rebecca-Qian](https://github.com/Rebecca-Qian)添加本数据集。
提供机构:
maas
创建时间:
2025-05-20



