voidful/EQG-RACE-PLUS
收藏Hugging Face2023-05-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/voidful/EQG-RACE-PLUS
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: questions
list:
- name: answer
struct:
- name: answer_index
dtype: int64
- name: answer_text
dtype: string
- name: options
sequence: string
- name: question
dtype: string
- name: question_type
dtype: string
- name: article
dtype: string
- name: id
dtype: string
splits:
- name: train_all
num_bytes: 63952721
num_examples: 25137
- name: train_middle
num_bytes: 12480455
num_examples: 6409
- name: dev_high
num_bytes: 2790766
num_examples: 1021
- name: dev_middle
num_bytes: 712198
num_examples: 368
- name: test_middle
num_bytes: 714595
num_examples: 362
- name: train_high
num_bytes: 51472267
num_examples: 18728
- name: test_high
num_bytes: 2850894
num_examples: 1045
download_size: 33312158
dataset_size: 134973896
---
# Dataset Card for "QGG-RACE Dataset"
Table of Contents
- Dataset Description
- Dataset Summary
- Supported Tasks and Leaderboards
- Languages
- Dataset Structure
- Data Instances
- Data Fields
- Data Splits
- Dataset Creation
- Curation Rationale
- Source Data
- Annotations
- Personal and Sensitive Information
- Considerations for Using the Data
- Social Impact of Dataset
- Discussion of Biases
- Other Known Limitations
- Additional Information
- Dataset Curators
- Licensing Information
- Citation Information
- Contributions
## Dataset Description
- GitHub Repository: N/A
- Paper: N/A
- Leaderboard: N/A
- Point of Contact: N/A
## Dataset Summary
QGG-RACE Dataset is a subset of RACE, containing three types of questions: Factoid, Cloze, and Summarization.
Dataset Download: [GitHub Release](https://github.com/p208p2002/QGG-RACE-dataset/releases/download/v1.0/qgg-dataset.zip)
Data Statistics:
Types | Examples | Train | Dev | Test
------------- | ------------------------------------------ | ----- | ---- | ----
Cloze | Yingying is Wangwang's _ . | 43167 | 2405 | 2462
Factiod | What can Mimi do? | 18405 | 1030 | 944
Summarization | According to this passage we know that _ . | 3004 | 175 | 184
## Supported Tasks and Leaderboards
- Question Generation
- Reading Comprehension
- Text Summarization
## Languages
The dataset is in English.
## Dataset Structure
### Data Instances
An example data instance from the dataset is shown below:
```json
{
"answers": [
"D",
"A",
"B",
"C"
],
"options": [
[
"States",
"Doubts",
"Confirms",
"Removes"
],
[
"shows the kind of male birds females seek out.",
"indicates the wandering albatross is the most faithful.",
"is based on Professor Stutchbury's 20 years' research.",
"suggests that female birds select males near their home."
],
[
"young birds' quality depends on their feather.",
"some male birds care for others' young as their own.",
"female birds go to find males as soon as autumn comes.",
"female birds are responsible for feeding the hungry babies."
],
[
"A book about love-birds.",
"Birds' living habits and love life",
"The fact that birds don't love their mates forever.",
"The factors that influence birds to look for another mate."
]
],
"questions": [
"What does the underline word \"dispels\" mean?",
"The book The Private Lives of Birds _ .",
"According to the passage, we can infer that _ .",
"What is the passage mainly about?"
],
"article": "Birds are not as loyal to their partners as you might think ...",
"id": "high11327.txt",
"factoid_questions": [
"What does the underline word \"dispels\" mean?"
],
"cloze_questions": [
"The book The Private Lives of Birds _ ."
],
"summarization_questions": [
"According to the passage, we can infer that _ ."
]
}
```
## Data Fields
- id: Unique identifier for the example.
- article: The main text passage.
- questions: List of questions related to the passage.
- options: List of answer options for each question.
- answers: Indexes of the correct answers for each question.
- factoid_questions: List of factoid questions.
- cloze_questions: List of cloze questions.
- summarization_questions: List of summarization questions.
### Data Splits
- Train: Contains 65,576 examples.
- Dev: Contains 3,610 examples.
- Test: Contains 3,590 examples.
## Dataset Creation
### Curation Rationale
QGG-RACE dataset is created as a subset of RACE, focusing on three types of questions: Factoid, Cloze, and Summarization. This dataset is intended to facilitate research in question generation and reading comprehension.
### Source Data
#### Initial Data Collection and Normalization
QGG-RACE dataset is derived from RACE dataset.
#### Who are the source language producers?
The source language producers are the authors of the RACE dataset.
### Annotations
#### Annotation process
The dataset is annotated with questions and their corresponding answer options.
#### Who are the annotators?
The annotators are the authors of the RACE dataset.
### Personal and Sensitive Information
The dataset does not contain any personal or sensitive information.
## Considerations for Using the Data
### Social Impact of Dataset
The QGG-RACE dataset can be used for research in question generation and reading comprehension, leading to improvements in these fields.
### Discussion of Biases
The dataset may inherit some biases from the RACE dataset as it is a subset of it.
### Other Known Limitations
No other known limitations.
## Additional Information
### Dataset Curators
The QGG-RACE dataset is curated by the authors of the QGG-RACE dataset GitHub repository.
### Licensing Information
The dataset is released under the [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/).
### Citation Information
No citation information is available for the QGG-RACE dataset.
### Contributions
Thanks to @p208p2002 for creating the QGG-RACE dataset.
提供机构:
voidful
原始信息汇总
数据集概述
数据集名称
QGG-RACE Dataset
数据集内容
QGG-RACE Dataset 是 RACE 数据集的一个子集,包含三种类型的问答:事实型(Factoid)、填空型(Cloze)和摘要型(Summarization)。
数据集结构
数据字段
- id: 唯一标识符
- article: 主要文本段落
- questions: 与段落相关的问答列表
- options: 每个问题的答案选项列表
- answers: 每个问题的正确答案索引
- factoid_questions: 事实型问题列表
- cloze_questions: 填空型问题列表
- summarization_questions: 摘要型问题列表
数据分割
- Train: 包含 65,576 个例子
- Dev: 包含 3,610 个例子
- Test: 包含 3,590 个例子
数据集统计
| 类型 | 例子 | 训练 | 开发 | 测试 |
|---|---|---|---|---|
| Cloze | Yingying 是 Wangwang 的 _ . | 43167 | 2405 | 2462 |
| Factiod | Mimi 能做什么? | 18405 | 1030 | 944 |
| Summarization | 根据这段文字我们知道 _ . | 3004 | 175 | 184 |
数据集下载
数据集大小
- 下载大小: 33312158 字节
- 数据集大小: 134973896 字节
数据集分割详情
| 分割名称 | 字节数 | 例子数 |
|---|---|---|
| train_all | 63952721 | 25137 |
| train_middle | 12480455 | 6409 |
| dev_high | 2790766 | 1021 |
| dev_middle | 712198 | 368 |
| test_middle | 714595 | 362 |
| train_high | 51472267 | 18728 |
| test_high | 2850894 | 1045 |
数据集语言
英语
数据集用途
用于问题生成和阅读理解的研究。
搜集汇总
数据集介绍

构建方式
在自然语言处理领域,针对阅读理解与问题生成任务的数据集构建,往往依赖于对现有资源的深度挖掘与结构化重组。EQG-RACE-PLUS数据集便是基于RACE数据集精心构建的子集,其构建过程聚焦于提取并分类三种核心问题类型:事实型问题、完形填空问题以及摘要型问题。通过系统化筛选与标注,原始RACE数据集中的文章被重新组织,每篇文章均关联了多类问题及其对应选项,确保了数据在任务导向上的多样性与针对性。这一构建方式不仅保留了源数据的语言丰富性,还通过问题类型的明确划分,为模型训练提供了更为细粒度的监督信号。
特点
该数据集的核心特点在于其问题类型的结构化呈现与多任务适应性。数据集囊括了事实型、完形填空和摘要型三类问题,每种类型均具有独特的语言模式与认知需求,能够全面评估模型在不同阅读理解场景下的表现。数据实例中,每篇文章均配有多个问题、详尽的选项列表以及标准答案索引,这种设计支持端到端的答案预测与问题生成研究。此外,数据集按难度与用途划分为多个子集,如训练集、开发集与测试集,并进一步区分了中等与高等难度级别,为模型性能的细致评估与比较提供了坚实基础。
使用方法
使用该数据集时,研究者可将其广泛应用于自然语言处理的多项任务,尤其是问题生成、阅读理解与文本摘要。数据集以标准化的JSON格式提供,每个数据实例包含文章、问题列表、选项及答案等关键字段,便于直接加载与处理。对于模型训练,建议根据研究目标选择相应的子集,例如利用`train_middle`和`train_high`进行训练,并在`dev_middle`、`dev_high`等开发集上进行验证与调优。在评估阶段,可使用`test_middle`和`test_high`子集进行最终性能测试,确保模型在不同难度级别上的泛化能力得到客观衡量。
背景与挑战
背景概述
在自然语言处理领域,机器阅读理解与问题生成是评估人工智能语言理解能力的关键任务。EQG-RACE-PLUS数据集作为RACE数据集的一个子集,由研究人员p208p2002于2020年左右构建,旨在针对事实型、完形填空和摘要生成三类问题提供结构化标注。该数据集源自中国中学英语考试的真实语料,其核心研究问题聚焦于如何通过多样化的问题类型,推动模型在复杂语境下的推理与生成能力。这一资源的出现,显著丰富了阅读理解与问题生成任务的评估维度,为相关模型的精细化训练与验证提供了重要支撑。
当前挑战
EQG-RACE-PLUS数据集面临的挑战主要体现在两个方面:在领域问题层面,它需应对多类型问题生成的复杂性,例如事实型问题要求精准的信息抽取,而摘要生成问题则需模型具备高阶的语义归纳能力,这考验着模型在不同认知层次上的适应性。在构建过程中,挑战源于源数据RACE的固有偏差继承,包括文化背景与语言风格的特定性,可能导致模型泛化能力受限;同时,三类问题的平衡标注与质量把控,亦需克服人工标注的一致性与规模性难题,以确保数据集的可靠性与代表性。
常用场景
经典使用场景
在自然语言处理领域,EQG-RACE-PLUS数据集作为RACE数据集的精选子集,其经典使用场景聚焦于问题生成与阅读理解任务。该数据集通过提供事实型、完形填空和摘要型三类问题,为研究者构建了多维度评估模型能力的基准环境。在学术实验中,模型需基于给定文章生成或回答多样化问题,这不仅检验了模型对文本深层语义的理解,还推动了生成式与判别式方法的协同演进。
实际应用
在实际应用层面,EQG-RACE-PLUS数据集为智能教育系统和自动化内容生成工具提供了关键训练资源。基于该数据集训练的模型可辅助构建自适应学习平台,根据教材内容自动生成练习题与评估题目。同时,在新闻摘要或知识库构建场景中,模型能够从长文本中提取核心信息并转化为问答对,显著提升了信息检索与交互系统的智能化水平。
衍生相关工作
围绕该数据集衍生的经典工作主要集中于多任务学习与生成式阅读理解模型。例如,研究者利用其多问题类型特性开发了统一的问题生成框架,将事实查询、文本补全与摘要生成任务融合于单一架构。此外,基于注意力机制的序列到序列模型通过在该数据集上的迭代优化,显著提升了生成问题的相关性与多样性,为后续跨领域问答系统的研究奠定了方法论基础。
以上内容由遇见数据集搜集并总结生成



