openbookqa-es
收藏魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/BSC-LT/openbookqa-es
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for openbookqa_es
<!-- Provide a quick summary of the dataset. -->
openbookqa_es is a question answering dataset in Spanish, professionally translated from the main version of the [OpenBookQA](https://leaderboard.allenai.org/open_book_qa) dataset in English.
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
openbookqa_es (Open Book Question Answering - Spanish) is designed to simulate open book exams and assess human-like understanding of a subject. The dataset comprises 500 instances in the validation split and another 500 instances in the test split. Each instance contains a question stem, four possible choices, and the letter indicating the correct answer.
- **Curated by:** [Language Technologies Unit | BSC-CNS](https://www.bsc.es/discover-bsc/organisation/research-departments/language-technologies-unit)
- **Funded by:** [ILENIA](https://proyectoilenia.es/en/)
<!-- - **Shared by [optional]:** [More Information Needed] -->
- **Language(s) (NLP):** Spanish (`es-ES`)
- **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed) ([Original](https://github.com/allenai/OpenBookQA)) **
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [HuggingFace](https://huggingface.co/datasets/BSC-LT)
<!-- - **Paper [optional]:** [More Information Needed] -->
<!-- - **Demo [optional]:** [More Information Needed] -->
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
openbookqa_es is intended to evaluate science commonsense knowledge of language models. Below are some potential uses:
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
- Commonsense Question Answering: openbookqa_es contains questions that require basic background knowledge, like the material of a spoon.
- Multiple Choice Test: for each problem, openbookqa_es contains 4 different solutions, which requires reasoning between different options.
- Reading Comprehension Evaluation: problems and answers in openbookqa_es are formulated in natural language.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
openbookqa_es-test should **not** be used to train any language model. To facilitate removal from training corpora, we add a canary GUID string to the test file. The GUID string is ###TODO
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
The dataset is provided in two separate files in JSONL format, where each row corresponds to a question with multiple answers and contains an instance identifier, the question, a dictionary that contains possible answers (A/ B/ C/ D), and the corresponding letter for the correct answer. Each row contains the following fields:
- `id`: text string containing the question-answer pair identifier.
- `question`: text string with the question stem, to be completed with one of the choices.
- `choices`: dictionary containing a `text` key with the answers and a `label` key with their corresponding labels.
- `answerKey`: text string containing the letter for the correct answer.
For example:
```
{
"id": "8-376",
"question_stem": "Los tiburones anguila y los rapes viven muy por debajo de la superficie del océano, y por eso se les conoce como",
"choices": {
"text": [
"fauna abisal.",
"peces.",
"peces de mar adentro.",
"fauna de alta mar."
],
"label": [
"A",
"B",
"C",
"D"
]
},
"answerKey": "A"
}
```
openbookqa_es contains the validation and test splits from the main version of the original dataset.
| Metric | validation | test |
|----------------------------------|-----------:|-----:|
| Input Sentences | 500 | 500 |
| Average Row Length in Words | TODO | TODO |
| Average Row Length in Characters | TODO | TODO |
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
From the paper (Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.):
> While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic—in the context of common knowledge—and the language it is expressed in.
We have translated this dataset to improve the Spanish support in the NLP field and to allow cross-lingual comparisons in language models.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
openbookqa_es comes from the main version of [OpenBookQA](https://leaderboard.allenai.org/open_book_qa), which is inspired in recurring science themes and principles, and the question-answer pairs were annotated in a crowd-sourcing process with external knowledge coming from ConceptNet, Wikipedia, and a corpus with 14M science-related sentences.
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
Data was gathered from the main version of [OpenBookQA](https://huggingface.co/datasets/openbookqa). We did not modify the original dataset.
The translation process to Spanish was based on the following guidelines:
- **Date & Unit conversion**: Adapt dates, metric systems, currencies, etc., to our context, except when the task involves metric system conversion.
- **Personal Names**: Translate English names with clear Spanish equivalents; otherwise, use common names in our context. Maintain consistency in translated names throughout the text. Names of individual figures are not translated.
- **Language Style**: Avoid uniformity in translation, maintaining a rich and varied language reflecting our linguistic depth. In science texts - maintain precision and terminology while avoiding monotony.
- **Dataset Logic**: Ensure internal logic of datasets is maintained; answers should remain relevant and accurate. Factual accuracy is key in question-answer datasets. Maintain the correct option in multiple-choice datasets.
- **Error Handling**: Fix errors in the English text during translation unless otherwise specified for the specific dataset. Spelling mistakes must be corrected in Spanish.
- **Avoiding Patterns and Maintaining Length**: Avoid including patterns that could hint at the correct option, maintaining difficulty. Match the length of responses to the original text as closely as possible. Handle scientific terminology carefully to ensure consistency.
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
openbookqa_es is a professional translation of the [OpenBookQA dataset](https://huggingface.co/datasets/allenai/openbookqa), completed by a single translator who is a native speaker of Spanish. The translator was provided with the entire validation and test splits, as well as a set of translation preferences and guidelines, along with a brief explanation of the original corpus. To ensure ongoing communication, the translator was asked to provide sample translations at intervals of 50, 100, 250, and 500 examples. These translations were then reviewed by a Spanish speaker within our team.
Additionally, the translator was encouraged to seek clarification on any specific doubts they had, and any necessary corrections were applied to the entire dataset.
### Annotations [optional]
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
Refer to the original paper (Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.).
#### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
Refer to the original paper (Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.).
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
No personal or sensitive information included.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
## Citation [optional]
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
[More Information Needed]
## More Information [optional]
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of
the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215337.
** License was changed to CC-BY-4.0 since the authors only specified the default GitHub license Apache 2.0 which is meant for software and not for data artifacts, and does not require derivative works to be licensed under the same terms
## Dataset Card Authors [optional]
[More Information Needed]
## Dataset Card Contact
Language Technologies Unit (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).
# openbookqa_es 数据集卡片
<!-- 提供数据集的简要概述。 -->
openbookqa_es 是一款西班牙语问答数据集,由英文原版[OpenBookQA](https://leaderboard.allenai.org/open_book_qa)官方版本经专业翻译制作而成。
## 数据集详情
### 数据集概述
<!-- 对数据集内容进行更详细的说明。 -->
openbookqa_es(Open Book Question Answering - 西班牙语版)旨在模拟开卷考试场景,用以评估类人的学科理解能力。本数据集的验证集包含500条样本,测试集同样包含500条样本。每条样本均包含问题题干、四个可选答案,以及标识正确答案的字母。
- **数据整理方:** [语言技术单元 | BSC-CNS](https://www.bsc.es/discover-bsc/organisation/research-departments/language-technologies-unit)
- **资助方:** [ILENIA项目](https://proyectoilenia.es/en/)
- **自然语言语言:** 西班牙语(`es-ES`)
- **许可证:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed)([原版许可证](https://github.com/allenai/OpenBookQA))
### 数据集来源 [可选]
<!-- 提供数据集的基础链接。 -->
- **代码仓库:** [HuggingFace](https://huggingface.co/datasets/BSC-LT)
<!-- - **论文 [可选]:** [需补充更多信息] -->
<!-- - **演示 [可选]:** [需补充更多信息] -->
## 数据集用途
<!-- 说明数据集的预期使用场景。 -->
openbookqa_es 旨在评估大语言模型(Large Language Model, LLM)的科学常识知识水平。以下为部分潜在应用场景:
### 合法使用场景
<!-- 本部分描述数据集适用的使用案例。 -->
- 常识问答:openbookqa_es 涵盖需要基础背景知识的问题,例如勺子的材质。
- 多项选择测试:针对每个问题,本数据集提供4种不同的可选答案,要求模型在多个选项间进行推理判断。
- 阅读理解评估:openbookqa_es 中的问题与答案均以自然语言形式呈现。
### 超出范围的使用场景
<!-- 本部分说明误用、恶意使用,以及本数据集无法适配的使用场景。 -->
openbookqa_es 测试集** 严禁** 用于任何大语言模型的训练。为便于从训练语料库中移除该数据集,我们在测试文件中添加了一个可识别的全局唯一标识符(GUID)字符串。该GUID字符串为###TODO
## 数据集结构
<!-- 本部分描述数据集的字段信息,以及其他相关结构细节,例如拆分集的创建标准、数据点间的关系等。 -->
本数据集以JSONL格式分为两个独立文件存储,每一行对应一道多项选择题,包含样本标识符、问题题干、包含可选答案的字典,以及标识正确答案的字母。每行包含以下字段:
- `id`:文本字符串,存储问答对的唯一标识符。
- `question`:文本字符串,即问题题干,需结合选项完成作答。
- `choices`:字典类型,包含存储答案文本的`text`键,以及存储对应标签的`label`键。
- `answerKey`:文本字符串,存储正确答案的字母标识。
例如:
{
"id": "8-376",
"question_stem": "Los tiburones anguila y los rapes viven muy por debajo de la superficie del océano, y por eso se les conoce como",
"choices": {
"text": [
"fauna abisal.",
"peces.",
"peces de mar adentro.",
"fauna de alta mar."
],
"label": [
"A",
"B",
"C",
"D"
]
},
"answerKey": "A"
}
openbookqa_es 包含原始数据集官方版本的验证集与测试集拆分。
| 指标 | 验证集 | 测试集 |
|--------------------------------|-------:|-------:|
| 输入句子数量 | 500 | 500 |
| 每行平均单词数 | TODO | TODO |
| 每行平均字符数 | TODO | TODO |
## 数据集创建
### 数据整理初衷
<!-- 说明创建本数据集的动机。 -->
引用原论文(Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.)中的表述:
> 现有基于文档或知识库的问答数据集通常自成体系,侧重语言理解能力;而OpenBookQA则更深入地探究模型对主题(结合常识背景)以及问题表达语言的理解程度。
我们对该数据集进行翻译,旨在提升西班牙语在自然语言处理(Natural Language Processing, NLP)领域的支持力度,并支持大语言模型的跨语言对比研究。
### 源数据
<!-- 本部分描述源数据的类型,例如新闻文本与标题、社交媒体帖子、翻译语句等。 -->
openbookqa_es 源自[OpenBookQA](https://leaderboard.allenai.org/open_book_qa)官方版本,该数据集的设计灵感来源于反复出现的科学主题与原理,问答对通过众包流程标注,外部知识来源包括ConceptNet、维基百科以及包含1400万条科学相关句子的语料库。
#### 数据收集与处理流程
<!-- 本部分描述数据收集与处理的过程,例如数据选择标准、过滤与归一化方法、使用的工具与库等。 -->
数据源自[OpenBookQA](https://huggingface.co/datasets/openbookqa)官方版本,我们未对原始数据集进行任何修改。
西班牙语翻译流程遵循以下准则:
- **日期与单位转换**:将日期、公制单位、货币等适配至本地语境,但若任务涉及公制单位转换则保留原样。
- **人名**:将带有明确西班牙语对应译法的英文人名译为西班牙语;无对应译法的则使用本地常见姓名。确保译名人名在文本中保持一致。个体人物的姓名不进行翻译。
- **语言风格**:避免翻译风格单一化,使用丰富多样的语言以体现母语的表达深度。在科学文本中,需兼顾术语精准性与表达多样性,避免单调重复。
- **数据集逻辑**:确保数据集的内部逻辑一致,答案需保持相关性与准确性。问答数据集的核心在于事实准确性,需保留多项选择题的正确选项。
- **错误处理**:翻译过程中修正英文文本中的错误,除非特定数据集另有说明。西班牙语译文中需修正拼写错误。
- **避免模式化与保持长度**:避免使用可能暗示正确答案的翻译模式,以维持题目难度。尽可能使译文长度与原文保持一致。谨慎处理科学术语,确保术语使用的一致性。
#### 源数据生产者是谁?
<!-- 本部分描述原始数据的创建者或系统。若源数据创建者有自行申报的人口统计或身份信息,也需在此说明。 -->
openbookqa_es 是[OpenBookQA数据集](https://huggingface.co/datasets/allenai/openbookqa)的专业翻译版本,由一名西班牙语母语译者完成。译者收到了完整的验证集与测试集样本,以及一套翻译偏好与准则,同时附带原始语料库的简要说明。为确保沟通顺畅,要求译者分别在翻译50、100、250和500条样本时提交样例译文。这些译文随后由团队内的西班牙语母语使用者进行审核。此外,译者可随时就翻译中的疑问寻求澄清,整个数据集的翻译也会根据需要进行修正。
### 标注信息 [可选]
<!-- 若数据集包含初始数据收集之外的标注内容,请用本部分描述标注信息。 -->
#### 标注流程
<!-- 本部分描述标注流程,例如使用的标注工具、标注数据量、提供给标注人员的标注准则、标注者间一致性统计、标注验证等。 -->
详见原论文(Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.)。
#### 标注者是谁?
<!-- 本部分描述创建标注的人员或系统。 -->
详见原论文(Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.)。
#### 个人与敏感信息
<!-- 说明数据集是否包含可被视为个人、敏感或隐私的数据(例如揭示地址、唯一可识别的姓名或别名、种族或族裔起源、性取向、宗教信仰、政治观点、财务或健康数据等)。若已对数据进行匿名化处理,请描述匿名化流程。 -->
本数据集未包含任何个人或敏感信息。
## 偏差、风险与局限性
<!-- 本部分旨在说明技术与社会技术层面的局限性。 -->
[需补充更多信息]
### 使用建议
<!-- 本部分旨在针对数据集的偏差、风险与技术局限性给出使用建议。 -->
用户需知晓本数据集存在的风险、偏差与局限性。如需进一步的使用建议,需补充更多相关信息。
## 引用信息 [可选]
<!-- 若有介绍本数据集的论文或博客文章,需在此处提供APA与Bibtex格式的引用信息。 -->
** BibTeX格式:**
[需补充更多信息]
** APA格式:**
[需补充更多信息]
## 术语表 [可选]
<!-- 若有需要,可在此列出有助于读者理解数据集或数据集卡片的术语与计算公式。 -->
[需补充更多信息]
## 补充信息 [可选]
本研究由西班牙数字化转型与公职部资助,由欧盟——下一代欧盟(NextGenerationEU)在[ILENIA项目](https://proyectoilenia.es/)框架下资助,项目编号为2022/TL22/00215337。
** 许可证变更说明:** 由于原作者仅指定了适用于软件的默认GitHub许可证Apache 2.0,而非数据制品,因此本数据集的许可证变更为CC-BY 4.0,且不要求衍生作品遵循相同的许可证条款。
## 数据集卡片作者 [可选]
[需补充更多信息]
## 数据集卡片联系方式
巴塞罗那超级计算中心(BSC)语言技术单元(langtech@bsc.es)。
提供机构:
maas
创建时间:
2025-01-26



