asset
收藏魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/asset
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for ASSET
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** [ASSET Github repository](https://github.com/facebookresearch/asset)
- **Paper:** [ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations](https://www.aclweb.org/anthology/2020.acl-main.424/)
- **Point of Contact:** [Louis Martin](louismartincs@gmail.com)
### Dataset Summary
[ASSET](https://github.com/facebookresearch/asset) [(Alva-Manchego et al., 2020)](https://www.aclweb.org/anthology/2020.acl-main.424.pdf) is multi-reference dataset for the evaluation of sentence simplification in English. The dataset uses the same 2,359 sentences from [TurkCorpus]( https://github.com/cocoxu/simplification/) [(Xu et al., 2016)](https://www.aclweb.org/anthology/Q16-1029.pdf) and each sentence is associated with 10 crowdsourced simplifications. Unlike previous simplification datasets, which contain a single transformation (e.g., lexical paraphrasing in TurkCorpus or sentence
splitting in [HSplit](https://www.aclweb.org/anthology/D18-1081.pdf)), the simplifications in ASSET encompass a variety of rewriting transformations.
### Supported Tasks and Leaderboards
The dataset supports the evaluation of `text-simplification` systems. Success in this tasks is typically measured using the [SARI](https://huggingface.co/metrics/sari) and [FKBLEU](https://huggingface.co/metrics/fkbleu) metrics described in the paper [Optimizing Statistical Machine Translation for Text Simplification](https://www.aclweb.org/anthology/Q16-1029.pdf).
### Languages
The text in this dataset is in English (`en`).
## Dataset Structure
### Data Instances
- `simplification` configuration: an instance consists in an original sentence and 10 possible reference simplifications.
- `ratings` configuration: a data instance consists in an original sentence, a simplification obtained by an automated system, and a judgment of quality along one of three axes by a crowd worker.
### Data Fields
- `original`: an original sentence from the source datasets
- `simplifications`: in the `simplification` config, a set of reference simplifications produced by crowd workers.
- `simplification`: in the `ratings` config, a simplification of the original obtained by an automated system
- `aspect`: in the `ratings` config, the aspect on which the simplification is evaluated, one of `meaning`, `fluency`, `simplicity`
- `rating`: a quality rating between 0 and 100
### Data Splits
ASSET does not contain a training set; many models use [WikiLarge](https://github.com/XingxingZhang/dress) (Zhang and Lapata, 2017) for training.
Each input sentence has 10 associated reference simplified sentences. The statistics of ASSET are given below.
| | Dev | Test | Total |
| ----- | ------ | ---- | ----- |
| Input Sentences | 2000 | 359 | 2359 |
| Reference Simplifications | 20000 | 3590 | 23590 |
The test and validation sets are the same as those of TurkCorpus. The split was random.
There are 19.04 tokens per reference on average (lower than 21.29 and 25.49 for TurkCorpus and HSplit, respectively). Most (17,245) of the referece sentences do not involve sentence splitting.
## Dataset Creation
### Curation Rationale
ASSET was created in order to improve the evaluation of sentence simplification. It uses the same input sentences as the [TurkCorpus]( https://github.com/cocoxu/simplification/) dataset from [(Xu et al., 2016)](https://www.aclweb.org/anthology/Q16-1029.pdf). The 2,359 input sentences of TurkCorpus are a sample of "standard" (not simple) sentences from the [Parallel Wikipedia Simplification (PWKP)](https://www.informatik.tu-darmstadt.de/ukp/research_6/data/sentence_simplification/simple_complex_sentence_pairs/index.en.jsp) dataset [(Zhu et al., 2010)](https://www.aclweb.org/anthology/C10-1152.pdf), which come from the August 22, 2009 version of Wikipedia. The sentences of TurkCorpus were chosen to be of similar length [(Xu et al., 2016)](https://www.aclweb.org/anthology/Q16-1029.pdf). No further information is provided on the sampling strategy.
The TurkCorpus dataset was developed in order to overcome some of the problems with sentence pairs from Standard and Simple Wikipedia: a large fraction of sentences were misaligned, or not actually simpler [(Xu et al., 2016)](https://www.aclweb.org/anthology/Q16-1029.pdf). However, TurkCorpus mainly focused on *lexical paraphrasing*, and so cannot be used to evaluate simplifications involving *compression* (deletion) or *sentence splitting*. HSplit [(Sulem et al., 2018)](https://www.aclweb.org/anthology/D18-1081.pdf), on the other hand, can only be used to evaluate sentence splitting. The reference sentences in ASSET include a wider variety of sentence rewriting strategies, combining splitting, compression and paraphrasing. Annotators were given examples of each kind of transformation individually, as well as all three transformations used at once, but were allowed to decide which transformations to use for any given sentence.
An example illustrating the differences between TurkCorpus, HSplit and ASSET is given below:
> **Original:** He settled in London, devoting himself chiefly to practical teaching.
>
> **TurkCorpus:** He rooted in London, devoting himself mainly to practical teaching.
>
> **HSplit:** He settled in London. He devoted himself chiefly to practical teaching.
>
> **ASSET:** He lived in London. He was a teacher.
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
The input sentences are from English Wikipedia (August 22, 2009 version). No demographic information is available for the writers of these sentences. However, most Wikipedia editors are male (Lam, 2011; Graells-Garrido, 2015), which has an impact on the topics covered (see also [the Wikipedia page on Wikipedia gender bias](https://en.wikipedia.org/wiki/Gender_bias_on_Wikipedia)). In addition, Wikipedia editors are mostly white, young, and from the Northern Hemisphere [(Wikipedia: Systemic bias)](https://en.wikipedia.org/wiki/Wikipedia:Systemic_bias).
Reference sentences were written by 42 workers on Amazon Mechanical Turk (AMT). The requirements for being an annotator were:
- Passing a Qualification Test (appropriately simplifying sentences). Out of 100 workers, 42 passed the test.
- Being a resident of the United States, United Kingdom or Canada.
- Having a HIT approval rate over 95%, and over 1000 HITs approved.
No other demographic or compensation information is provided in the ASSET paper.
### Annotations
#### Annotation process
The instructions given to the annotators are available [here](https://github.com/facebookresearch/asset/blob/master/crowdsourcing/AMT_AnnotationInstructions.pdf).
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
The dataset may contain some social biases, as the input sentences are based on Wikipedia. Studies have shown that the English Wikipedia contains both gender biases (Schmahl et al., 2020) and racial biases (Adams et al., 2019).
> Adams, Julia, Hannah Brückner, and Cambria Naslund. "Who Counts as a Notable Sociologist on Wikipedia? Gender, Race, and the “Professor Test”." Socius 5 (2019): 2378023118823946.
> Schmahl, Katja Geertruida, et al. "Is Wikipedia succeeding in reducing gender bias? Assessing changes in gender bias in Wikipedia using word embeddings." Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science. 2020.
### Other Known Limitations
Dataset provided for research purposes only. Please check dataset license for additional information.
## Additional Information
### Dataset Curators
ASSET was developed by researchers at the University of Sheffield, Inria,
Facebook AI Research, and Imperial College London. The work was partly supported by Benoît Sagot's chair in the PRAIRIE institute, funded by the French National Research Agency (ANR) as part of the "Investissements d’avenir" program (reference ANR-19-P3IA-0001).
### Licensing Information
[Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)
### Citation Information
```
@inproceedings{alva-manchego-etal-2020-asset,
title = "{ASSET}: {A} Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations",
author = "Alva-Manchego, Fernando and
Martin, Louis and
Bordes, Antoine and
Scarton, Carolina and
Sagot, Beno{\^\i}t and
Specia, Lucia",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.424",
pages = "4668--4679",
}
```
This dataset card uses material written by [Juan Diego Rodriguez](https://github.com/juand-r).
### Contributions
Thanks to [@yjernite](https://github.com/yjernite) for adding this dataset.
# ASSET数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知限制](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集制作者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **代码仓库**:[ASSET Github仓库](https://github.com/facebookresearch/asset)
- **相关论文**:[ASSET:面向多改写转换的句子简化模型调优与评估数据集](https://www.aclweb.org/anthology/2020.acl-main.424/)
- **联系方式**:[Louis Martin](louismartincs@gmail.com)
### 数据集摘要
[ASSET](https://github.com/facebookresearch/asset) [(Alva-Manchego et al., 2020)](https://www.aclweb.org/anthology/2020.acl-main.424.pdf) 是用于英语句子简化(Sentence Simplification)评估的多参考数据集。该数据集沿用了[TurkCorpus](https://github.com/cocoxu/simplification/) [(Xu et al., 2016)](https://www.aclweb.org/anthology/Q16-1029.pdf)中的2359条句子,每条句子对应10条众包生成的简化句。与以往仅包含单一改写类型的简化数据集(如TurkCorpus仅支持词汇释义,或[HSplit](https://www.aclweb.org/anthology/D18-1081.pdf)仅支持句子拆分)不同,ASSET中的简化句涵盖了多种改写转换方式。
### 支持任务与排行榜
该数据集可用于评估**文本简化(text-simplification)**系统。该任务的性能通常通过论文《Optimizing Statistical Machine Translation for Text Simplification》中提出的[SARI](https://huggingface.co/metrics/sari)与[FKBLEU](https://huggingface.co/metrics/fkbleu)指标进行衡量。
### 语言
本数据集的文本语言为英语(`en`)。
## 数据集结构
### 数据实例
- **`simplification`配置**:每条数据包含一条原始句子与10条参考简化句。
- **`ratings`配置**:每条数据包含一条原始句子、一条由自动化系统生成的简化句,以及众包工作者从三个维度之一对该简化句的质量评分。
### 数据字段
- `original`:源数据集中的原始句子
- `simplifications`:在`simplification`配置中,指众包工作者生成的参考简化句集合
- `simplification`:在`ratings`配置中,指自动化系统生成的原始句子简化版本
- `aspect`:在`ratings`配置中,指简化句的评估维度,可选值为`meaning`(语义一致性)、`fluency`(流畅性)、`simplicity`(简洁性)
- `rating`:0至100之间的质量评分
### 数据划分
ASSET未设置训练集;多数模型使用[WikiLarge](https://github.com/XingxingZhang/dress)(Zhang and Lapata, 2017)进行训练。
每条输入句子对应10条参考简化句。ASSET的统计信息如下:
| | 开发集(Dev) | 测试集(Test) | 总计(Total) |
| ----- | ------ | ---- | ----- |
| 输入句子数 | 2000 | 359 | 2359 |
| 参考简化句数 | 20000 | 3590 | 23590 |
该数据集的测试集与验证集与TurkCorpus完全一致,划分方式为随机划分。
每条参考简化句的平均Token数为19.04,低于TurkCorpus的21.29与HSplit的25.49。其中多数(17245条)参考简化句未涉及句子拆分操作。
## 数据集构建
### 构建依据
创建ASSET的目的是为了提升句子简化任务的评估效果。该数据集沿用了[TurkCorpus](https://github.com/cocoxu/simplification/) [(Xu et al., 2016)](https://www.aclweb.org/anthology/Q16-1029.pdf)的输入句子。TurkCorpus的2359条输入句子源自[并行维基百科简化(Parallel Wikipedia Simplification, PWKP)](https://www.informatik.tu-darmstadt.de/ukp/research_6/data/sentence_simplification/simple_complex_sentence_pairs/index.en.jsp) [(Zhu et al., 2010)](https://www.aclweb.org/anthology/C10-1152.pdf)数据集的“标准(非简化)”句子样本,数据取自2009年8月22日版的维基百科。TurkCorpus的句子被选为长度相近的样本[(Xu et al., 2016)](https://www.aclweb.org/anthology/Q16-1029.pdf),未公开具体的采样策略细节。
TurkCorpus数据集旨在解决标准维基百科与简单维基百科句子对存在的诸多问题:大量句子存在对齐错误,或实际上并未简化[(Xu et al., 2016)](https://www.aclweb.org/anthology/Q16-1029.pdf)。但TurkCorpus主要聚焦于**词汇释义**,无法用于评估包含**文本压缩(删除冗余内容)**或**句子拆分**的简化任务。而HSplit [(Sulem et al., 2018)](https://www.aclweb.org/anthology/D18-1081.pdf)仅能评估句子拆分任务。ASSET中的参考简化句涵盖了更丰富的句子改写策略,结合了拆分、压缩与释义三种改写方式。标注员会收到每种改写类型的单独示例,以及三种改写同时使用的示例,但允许标注员自行选择针对每条句子使用的改写方式。
以下示例展示了TurkCorpus、HSplit与ASSET之间的差异:
> **原句:** He settled in London, devoting himself chiefly to practical teaching.
>
> **TurkCorpus版本:** He rooted in London, devoting himself mainly to practical teaching.
>
> **HSplit版本:** He settled in London. He devoted himself chiefly to practical teaching.
>
> **ASSET版本:** He lived in London. He was a teacher.
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生产者是谁?
输入句子取自2009年8月22日版的英文维基百科。未公开这些句子作者的人口统计信息。但已有研究表明,维基百科编辑者大多为男性(Lam, 2011; Graells-Garrido, 2015),这会影响维基百科的话题覆盖范围(详见[维基百科关于性别偏差的页面](https://en.wikipedia.org/wiki/Gender_bias_on_Wikipedia))。此外,维基百科编辑者大多为白人、年轻人且来自北半球[(维基百科:系统性偏差)](https://en.wikipedia.org/wiki/Wikipedia:Systemic_bias)。
参考简化句由亚马逊机械Turk(Amazon Mechanical Turk, AMT)平台上的42名工作者撰写。标注员需满足以下要求:
- 通过资格测试:能够准确完成句子简化任务。在100名申请者中,仅有42人通过测试。
- 为美国、英国或加拿大居民。
- HIT(人类智能任务)批准率超过95%,且已批准的HIT数量超过1000个。
ASSET论文未公开其他人口统计或报酬相关信息。
### 标注信息
#### 标注流程
提供给标注员的说明文档可参见[此处](https://github.com/facebookresearch/asset/blob/master/crowdsourcing/AMT_AnnotationInstructions.pdf)。
#### 标注员信息
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
本数据集可能包含部分社会偏差,因其输入句子源自维基百科。已有研究表明,英文维基百科同时存在性别偏差(Schmahl et al., 2020)与种族偏差(Adams et al., 2019)。
> Adams, Julia, Hannah Brückner, and Cambria Naslund. "Who Counts as a Notable Sociologist on Wikipedia? Gender, Race, and the “Professor Test"." Socius 5 (2019): 2378023118823946.
> Schmahl, Katja Geertruida, et al. "Is Wikipedia succeeding in reducing gender bias? Assessing changes in gender bias in Wikipedia using word embeddings." Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science. 2020.
### 其他已知限制
本数据集仅用于研究目的。如需更多信息,请查阅数据集许可协议。
## 附加信息
### 数据集制作者
ASSET由谢菲尔德大学、法国国家信息与自动化研究所(Inria)、Facebook人工智能研究(Facebook AI Research)以及伦敦帝国理工学院的研究人员开发。本研究部分受法国国家研究署(ANR)资助的PRAIRIE研究所Benoît Sagot主席项目支持,属于“未来投资计划”(项目编号ANR-19-P3IA-0001)。
### 许可信息
[署名-非商业性使用4.0国际许可(CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)
### 引用信息
@inproceedings{alva-manchego-etal-2020-asset,
title = "{ASSET}: {A} Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations",
author = "Alva-Manchego, Fernando and
Martin, Louis and
Bordes, Antoine and
Scarton, Carolina and
Sagot, Beno{^i}t and
Specia, Lucia",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.424",
pages = "4668--4679",
}
本数据集卡片的内容参考了[Juan Diego Rodriguez](https://github.com/juand-r)的撰写材料。
### 贡献
感谢[@yjernite](https://github.com/yjernite)添加本数据集。
提供机构:
maas
创建时间:
2025-05-20



