GEM/cochrane-simplification
收藏Hugging Face2022-10-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/GEM/cochrane-simplification
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- none
language_creators:
- unknown
language:
- en
license:
- cc-by-4.0
multilinguality:
- unknown
size_categories:
- unknown
source_datasets:
- original
task_categories:
- text2text-generation
task_ids:
- text-simplification
pretty_name: cochrane-simplification
---
# Dataset Card for GEM/cochrane-simplification
## Dataset Description
- **Homepage:** https://github.com/AshOlogn/Paragraph-level-Simplification-of-Medical-Texts
- **Repository:** https://github.com/AshOlogn/Paragraph-level-Simplification-of-Medical-Texts
- **Paper:** https://aclanthology.org/2021.naacl-main.395/
- **Leaderboard:** N/A
- **Point of Contact:** Ashwin Devaraj
### Link to Main Data Card
You can find the main data card on the [GEM Website](https://gem-benchmark.com/data_cards/cochrane-simplification).
### Dataset Summary
Cochrane is an English dataset for paragraph-level simplification of medical texts. Cochrane is a database of systematic reviews of clinical questions, many of which have summaries in plain English targeting readers without a university education. The dataset comprises about 4,500 of such pairs.
You can load the dataset via:
```
import datasets
data = datasets.load_dataset('GEM/cochrane-simplification')
```
The data loader can be found [here](https://huggingface.co/datasets/GEM/cochrane-simplification).
#### website
[Link](https://github.com/AshOlogn/Paragraph-level-Simplification-of-Medical-Texts)
#### paper
[Link](https://aclanthology.org/2021.naacl-main.395/)
#### authors
Ashwin Devaraj (The University of Texas at Austin), Iain J. Marshall (King's College London), Byron C. Wallace (Northeastern University), Junyi Jessy Li (The University of Texas at Austin)
## Dataset Overview
### Where to find the Data and its Documentation
#### Webpage
<!-- info: What is the webpage for the dataset (if it exists)? -->
<!-- scope: telescope -->
[Link](https://github.com/AshOlogn/Paragraph-level-Simplification-of-Medical-Texts)
#### Download
<!-- info: What is the link to where the original dataset is hosted? -->
<!-- scope: telescope -->
[Link](https://github.com/AshOlogn/Paragraph-level-Simplification-of-Medical-Texts)
#### Paper
<!-- info: What is the link to the paper describing the dataset (open access preferred)? -->
<!-- scope: telescope -->
[Link](https://aclanthology.org/2021.naacl-main.395/)
#### BibTex
<!-- info: Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex. -->
<!-- scope: microscope -->
```
@inproceedings{devaraj-etal-2021-paragraph,
title = "Paragraph-level Simplification of Medical Texts",
author = "Devaraj, Ashwin and
Marshall, Iain and
Wallace, Byron and
Li, Junyi Jessy",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.395",
doi = "10.18653/v1/2021.naacl-main.395",
pages = "4972--4984",
}
```
#### Contact Name
<!-- quick -->
<!-- info: If known, provide the name of at least one person the reader can contact for questions about the dataset. -->
<!-- scope: periscope -->
Ashwin Devaraj
#### Contact Email
<!-- info: If known, provide the email of at least one person the reader can contact for questions about the dataset. -->
<!-- scope: periscope -->
ashwin.devaraj@utexas.edu
#### Has a Leaderboard?
<!-- info: Does the dataset have an active leaderboard? -->
<!-- scope: telescope -->
no
### Languages and Intended Use
#### Multilingual?
<!-- quick -->
<!-- info: Is the dataset multilingual? -->
<!-- scope: telescope -->
no
#### Covered Languages
<!-- quick -->
<!-- info: What languages/dialects are covered in the dataset? -->
<!-- scope: telescope -->
`English`
#### License
<!-- quick -->
<!-- info: What is the license of the dataset? -->
<!-- scope: telescope -->
cc-by-4.0: Creative Commons Attribution 4.0 International
#### Intended Use
<!-- info: What is the intended use of the dataset? -->
<!-- scope: microscope -->
The intended use of this dataset is to train models that simplify medical text at the paragraph level so that it may be more accessible to the lay reader.
#### Primary Task
<!-- info: What primary task does the dataset support? -->
<!-- scope: telescope -->
Simplification
#### Communicative Goal
<!-- quick -->
<!-- info: Provide a short description of the communicative goal of a model trained for this task on this dataset. -->
<!-- scope: periscope -->
A model trained on this dataset can be used to simplify medical texts to make them more accessible to readers without medical expertise.
### Credit
#### Curation Organization Type(s)
<!-- info: In what kind of organization did the dataset curation happen? -->
<!-- scope: telescope -->
`academic`
#### Curation Organization(s)
<!-- info: Name the organization(s). -->
<!-- scope: periscope -->
The University of Texas at Austin, King's College London, Northeastern University
#### Dataset Creators
<!-- info: Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s). -->
<!-- scope: microscope -->
Ashwin Devaraj (The University of Texas at Austin), Iain J. Marshall (King's College London), Byron C. Wallace (Northeastern University), Junyi Jessy Li (The University of Texas at Austin)
#### Funding
<!-- info: Who funded the data creation? -->
<!-- scope: microscope -->
National Institutes of Health (NIH) grant R01-LM012086, National Science Foundation (NSF) grant IIS-1850153, Texas Advanced Computing Center (TACC) computational resources
#### Who added the Dataset to GEM?
<!-- info: Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM. -->
<!-- scope: microscope -->
Ashwin Devaraj (The University of Texas at Austin)
### Dataset Structure
#### Data Fields
<!-- info: List and describe the fields present in the dataset. -->
<!-- scope: telescope -->
- `gem_id`: string, a unique identifier for the example
- `doi`: string, DOI identifier for the Cochrane review from which the example was generated
- `source`: string, an excerpt from an abstract of a Cochrane review
- `target`: string, an excerpt from the plain-language summary of a Cochrane review that roughly aligns with the source text
#### Example Instance
<!-- info: Provide a JSON formatted example of a typical instance in the dataset. -->
<!-- scope: periscope -->
```
{
"gem_id": "gem-cochrane-simplification-train-766",
"doi": "10.1002/14651858.CD002173.pub2",
"source": "Of 3500 titles retrieved from the literature, 24 papers reporting on 23 studies could be included in the review. The studies were published between 1970 and 1997 and together included 1026 participants. Most were cross-over studies. Few studies provided sufficient information to judge the concealment of allocation. Four studies provided results for the percentage of symptom-free days. Pooling the results did not reveal a statistically significant difference between sodium cromoglycate and placebo. For the other pooled outcomes, most of the symptom-related outcomes and bronchodilator use showed statistically significant results, but treatment effects were small. Considering the confidence intervals of the outcome measures, a clinically relevant effect of sodium cromoglycate cannot be excluded. The funnel plot showed an under-representation of small studies with negative results, suggesting publication bias. There is insufficient evidence to be sure about the efficacy of sodium cromoglycate over placebo. Publication bias is likely to have overestimated the beneficial effects of sodium cromoglycate as maintenance therapy in childhood asthma.",
"target": "In this review we aimed to determine whether there is evidence for the effectiveness of inhaled sodium cromoglycate as maintenance treatment in children with chronic asthma. Most of the studies were carried out in small groups of patients. Furthermore, we suspect that not all studies undertaken have been published. The results show that there is insufficient evidence to be sure about the beneficial effect of sodium cromoglycate compared to placebo. However, for several outcome measures the results favoured sodium cromoglycate."
}
```
#### Data Splits
<!-- info: Describe and name the splits in the dataset if there are more than one. -->
<!-- scope: periscope -->
- `train`: 3568 examples
- `validation`: 411 examples
- `test`: 480 examples
## Dataset in GEM
### Rationale for Inclusion in GEM
#### Why is the Dataset in GEM?
<!-- info: What does this dataset contribute toward better generation evaluation and why is it part of GEM? -->
<!-- scope: microscope -->
This dataset is the first paragraph-level simplification dataset published (as prior work had primarily focused on simplifying individual sentences). Furthermore, this dataset is in the medical domain, which is an especially useful domain for text simplification.
#### Similar Datasets
<!-- info: Do other datasets for the high level task exist? -->
<!-- scope: telescope -->
no
#### Ability that the Dataset measures
<!-- info: What aspect of model ability can be measured with this dataset? -->
<!-- scope: periscope -->
This dataset measures the ability for a model to simplify paragraphs of medical text through the omission non-salient information and simplification of medical jargon.
### GEM-Specific Curation
#### Modificatied for GEM?
<!-- info: Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data? -->
<!-- scope: telescope -->
no
#### Additional Splits?
<!-- info: Does GEM provide additional splits to the dataset? -->
<!-- scope: telescope -->
no
### Getting Started with the Task
## Previous Results
### Previous Results
#### Measured Model Abilities
<!-- info: What aspect of model ability can be measured with this dataset? -->
<!-- scope: telescope -->
This dataset measures the ability for a model to simplify paragraphs of medical text through the omission non-salient information and simplification of medical jargon.
#### Metrics
<!-- info: What metrics are typically used for this task? -->
<!-- scope: periscope -->
`Other: Other Metrics`, `BLEU`
#### Other Metrics
<!-- info: Definitions of other metrics -->
<!-- scope: periscope -->
SARI measures the quality of text simplification
#### Previous results available?
<!-- info: Are previous results available? -->
<!-- scope: telescope -->
yes
#### Relevant Previous Results
<!-- info: What are the most relevant previous results for this task/dataset? -->
<!-- scope: microscope -->
The paper which introduced this dataset trained BART models (pretrained on XSum) with unlikelihood training to produce simplification models achieving maximum SARI and BLEU scores of 40 and 43 respectively.
## Dataset Curation
### Original Curation
#### Sourced from Different Sources
<!-- info: Is the dataset aggregated from different data sources? -->
<!-- scope: telescope -->
no
### Language Data
#### Data Validation
<!-- info: Was the text validated by a different worker or a data curator? -->
<!-- scope: telescope -->
not validated
#### Was Data Filtered?
<!-- info: Were text instances selected or filtered? -->
<!-- scope: telescope -->
not filtered
### Structured Annotations
#### Additional Annotations?
<!-- quick -->
<!-- info: Does the dataset have additional annotations for each instance? -->
<!-- scope: telescope -->
none
#### Annotation Service?
<!-- info: Was an annotation service used? -->
<!-- scope: telescope -->
no
### Consent
#### Any Consent Policy?
<!-- info: Was there a consent policy involved when gathering the data? -->
<!-- scope: telescope -->
no
### Private Identifying Information (PII)
#### Contains PII?
<!-- quick -->
<!-- info: Does the source language data likely contain Personal Identifying Information about the data creators or subjects? -->
<!-- scope: telescope -->
yes/very likely
#### Any PII Identification?
<!-- info: Did the curators use any automatic/manual method to identify PII in the dataset? -->
<!-- scope: periscope -->
no identification
### Maintenance
#### Any Maintenance Plan?
<!-- info: Does the original dataset have a maintenance plan? -->
<!-- scope: telescope -->
no
## Broader Social Context
### Previous Work on the Social Impact of the Dataset
#### Usage of Models based on the Data
<!-- info: Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems? -->
<!-- scope: telescope -->
no
### Impact on Under-Served Communities
#### Addresses needs of underserved Communities?
<!-- info: Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models). -->
<!-- scope: telescope -->
yes
#### Details on how Dataset Addresses the Needs
<!-- info: Describe how this dataset addresses the needs of underserved communities. -->
<!-- scope: microscope -->
This dataset can be used to simplify medical texts that may otherwise be inaccessible to those without medical training.
### Discussion of Biases
#### Any Documented Social Biases?
<!-- info: Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group. -->
<!-- scope: telescope -->
unsure
#### Are the Language Producers Representative of the Language?
<!-- info: Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ? -->
<!-- scope: periscope -->
The dataset was generated from abstracts and plain-language summaries of medical literature reviews that were written by medical professionals and thus does was not generated by people representative of the entire English-speaking population.
## Considerations for Using the Data
### PII Risks and Liability
### Licenses
### Known Technical Limitations
#### Technical Limitations
<!-- info: Describe any known technical limitations, such as spurrious correlations, train/test overlap, annotation biases, or mis-annotations, and cite the works that first identified these limitations when possible. -->
<!-- scope: microscope -->
The main limitation of this dataset is that the information alignment between the abstract and plain-language summary is often rough, so the plain-language summary may contain information that isn't found in the abstract. Furthermore, the plain-language targets often contain formulaic statements like "this evidence is current to [month][year]" not found in the abstracts. Another limitation is that some plain-language summaries do not simplify the technical abstracts very much and still contain medical jargon.
#### Unsuited Applications
<!-- info: When using a model trained on this dataset in a setting where users or the public may interact with its predictions, what are some pitfalls to look out for? In particular, describe some applications of the general task featured in this dataset that its curation or properties make it less suitable for. -->
<!-- scope: microscope -->
The main pitfall to look out for is errors in factuality. Simplification work so far has not placed a strong emphasis on the logical fidelity of model generations with the input text, and the paper introducing this dataset does not explore modeling techniques to combat this. These kinds of errors are especially pernicious in the medical domain, and the models introduced in the paper do occasionally alter entities like disease and medication names.
提供机构:
GEM
原始信息汇总
数据集概述
数据集基本信息
- 名称: cochrane-simplification
- 语言: 英语
- 许可证: Creative Commons Attribution 4.0 International (cc-by-4.0)
- 任务类型: 文本简化
- 数据来源: 原始数据
数据集详情
- 描述: Cochrane是一个用于段落级医学文本简化的英语数据集。该数据集包含约4,500对文本,来源于Cochrane系统评价数据库的摘要和面向非大学教育读者的简明英语总结。
- 用途: 用于训练模型简化医学文本,使其更易于非专业读者理解。
- 主要任务: 简化
- 通信目标: 简化医学文本,使其更易于非医学专业读者理解。
数据集结构
-
数据字段:
gem_id: 字符串,唯一标识符doi: 字符串,Cochrane评价的DOI标识符source: 字符串,Cochrane评价摘要的摘录target: 字符串,与源文本大致对应的Cochrane评价简明语言总结摘录
-
数据分割:
train: 3568个例子validation: 411个例子test: 480个例子
数据集创建者
- 创建者: Ashwin Devaraj (The University of Texas at Austin), Iain J. Marshall (Kings College London), Byron C. Wallace (Northeastern University), Junyi Jessy Li (The University of Texas at Austin)
- 资金支持: 国家卫生研究院(NIH)资助R01-LM012086,国家科学基金会(NSF)资助IIS-1850153,德克萨斯高级计算中心(TACC)计算资源
数据集维护与使用
- 维护计划: 无
- 技术限制: 信息对齐通常是粗略的,简明语言总结可能包含摘要中未提及的信息。此外,简明语言目标可能包含摘要中未找到的公式化陈述。
- 不适合的应用: 在使用此数据集训练的模型时,应特别注意事实准确性,因为简化工作尚未强调模型生成与输入文本的逻辑一致性。
搜集汇总
数据集介绍

构建方式
在医学文本简化领域,Cochrane简化数据集通过系统化的方式构建,其核心源于Cochrane系统评价数据库。该数据集精心选取了医学综述的摘要部分与对应的通俗语言总结,形成了约4500对文本对齐实例。构建过程中,研究者从已发表的临床问题综述中提取专业摘要,并匹配其面向非大学教育读者的简明总结,确保了源文本与目标文本在段落层面的语义对应。这种构建方法不仅依托于权威的医学文献来源,还通过独特的配对设计,为段落级文本简化任务提供了结构化的数据基础。
特点
该数据集在医学自然语言处理领域展现出显著特点,其专注于段落层面的文本简化,突破了传统句子级简化的局限。数据集包含训练、验证和测试三个划分,分别涵盖3568、411和480个实例,每个实例均包含唯一的标识符、数字对象标识符以及源文本与目标文本对。源文本为医学综述的专业摘要,语言严谨且富含术语;目标文本则为相应的通俗总结,旨在提升文本的可读性与普及性。这种设计使得数据集能够有效衡量模型在省略非关键信息与简化医学行话方面的能力,尤其适用于评估生成模型在专业领域的适应性。
使用方法
使用该数据集时,研究人员可通过Hugging Face的datasets库直接加载,具体操作为调用load_dataset函数并指定'GEM/cochrane-simplification'路径。数据集适用于文本到文本生成任务,特别是医学文本的段落简化。典型应用包括训练序列到序列模型,如BART架构,以学习将专业医学摘要转化为通俗语言。在评估方面,常采用SARI和BLEU等指标来衡量简化质量与忠实度。需要注意的是,由于源文本与目标文本间的对齐可能较为粗略,且目标文本可能包含源文本未涵盖的信息,使用时应关注模型生成的事实准确性,避免在医学语境下产生误导性简化。
背景与挑战
背景概述
在医学信息传播领域,专业文献的复杂语言往往成为公众获取健康知识的障碍。为此,由德克萨斯大学奥斯汀分校、伦敦国王学院及东北大学的研究团队于2021年共同构建了Cochrane简化数据集,旨在推动段落级医学文本简化技术的研究。该数据集源自Cochrane系统评价数据库,包含约4500对专业摘要与通俗摘要的对照文本,核心研究问题聚焦于如何通过自然语言处理技术,将艰深的医学论述转化为易于非专业读者理解的表述。这一资源的出现,显著促进了医疗健康领域的知识可及性研究,为开发面向公众的智能医学文本解读工具奠定了数据基础。
当前挑战
该数据集致力于解决医学文本简化这一特定领域问题,其核心挑战在于如何在保留原文关键医学事实的前提下,实现术语的通俗化转换与复杂句式的重构,同时避免因简化而引入事实性谬误。在构建过程中,研究人员面临多重困难:源文本与目标文本之间的信息对齐往往较为粗略,部分通俗摘要包含专业摘要中未出现的内容;目标文本中常出现模式化表述,如证据更新日期,这些内容在源文本中并无对应;此外,部分通俗摘要并未充分简化医学术语,仍保留相当的专业性,这影响了简化模型的学习效果与泛化能力。
常用场景
经典使用场景
在医学信息传播领域,专业文献的复杂性往往构成公众获取知识的障碍。GEM/cochrane-simplification数据集以其约4500对段落级医学文本简化配对,为自然语言处理研究提供了经典范例。该数据集主要应用于训练文本简化模型,通过将科克伦系统综述中的专业摘要转化为面向非专业读者的通俗语言,实现了从复杂医学术语到清晰表达的转换。这一过程不仅涉及词汇替换,更包含信息结构的重组与冗余内容的剔除,为段落级简化任务设立了新的基准。
实际应用
在实际应用层面,该数据集支撑的系统可广泛应用于公共卫生信息传播、患者教育材料生成和临床决策支持工具开发。例如,医疗机构可利用基于该数据集训练的模型,自动将最新的医学研究成果转化为患者易懂的健康指南,提升医患沟通效率。医学知识平台也能借此技术为公众提供准确且易于理解的疾病预防信息,特别是在流行病防控等需要快速普及专业知识的场景中,此类技术能显著降低健康信息壁垒,促进健康公平。
衍生相关工作
该数据集的发布催生了多项重要研究工作。其原始论文采用基于BART架构的预训练模型,结合反似然训练策略,在简化任务上取得了突破性进展。后续研究在此基础上探索了事实一致性增强、领域自适应迁移等方向,例如通过引入医学知识图谱约束来确保简化文本的准确性。这些工作不仅深化了对医学文本简化机制的理解,也为多模态医学信息处理、临床叙事生成等相邻领域提供了可借鉴的方法论,形成了以可及性为核心的医学自然语言处理研究脉络。
以上内容由遇见数据集搜集并总结生成



