openbookqa-es

Name: openbookqa-es
Creator: maas
Published: 2025-12-05 16:21:47
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-01 收录

下载链接：

https://modelscope.cn/datasets/BSC-LT/openbookqa-es

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for openbookqa_es  openbookqa_es is a question answering dataset in Spanish, professionally translated from the main version of the [OpenBookQA](https://leaderboard.allenai.org/open_book_qa) dataset in English. ## Dataset Details ### Dataset Description  openbookqa_es (Open Book Question Answering - Spanish) is designed to simulate open book exams and assess human-like understanding of a subject. The dataset comprises 500 instances in the validation split and another 500 instances in the test split. Each instance contains a question stem, four possible choices, and the letter indicating the correct answer. - **Curated by:** [Language Technologies Unit | BSC-CNS](https://www.bsc.es/discover-bsc/organisation/research-departments/language-technologies-unit) - **Funded by:** [ILENIA](https://proyectoilenia.es/en/)  - **Language(s) (NLP):** Spanish (`es-ES`) - **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed) ([Original](https://github.com/allenai/OpenBookQA)) ** ### Dataset Sources [optional]  - **Repository:** [HuggingFace](https://huggingface.co/datasets/BSC-LT)   ## Uses  openbookqa_es is intended to evaluate science commonsense knowledge of language models. Below are some potential uses: ### Direct Use  - Commonsense Question Answering: openbookqa_es contains questions that require basic background knowledge, like the material of a spoon. - Multiple Choice Test: for each problem, openbookqa_es contains 4 different solutions, which requires reasoning between different options. - Reading Comprehension Evaluation: problems and answers in openbookqa_es are formulated in natural language. ### Out-of-Scope Use  openbookqa_es-test should **not** be used to train any language model. To facilitate removal from training corpora, we add a canary GUID string to the test file. The GUID string is ###TODO ## Dataset Structure  The dataset is provided in two separate files in JSONL format, where each row corresponds to a question with multiple answers and contains an instance identifier, the question, a dictionary that contains possible answers (A/ B/ C/ D), and the corresponding letter for the correct answer. Each row contains the following fields: - `id`: text string containing the question-answer pair identifier. - `question`: text string with the question stem, to be completed with one of the choices. - `choices`: dictionary containing a `text` key with the answers and a `label` key with their corresponding labels. - `answerKey`: text string containing the letter for the correct answer. For example: ``` { "id": "8-376", "question_stem": "Los tiburones anguila y los rapes viven muy por debajo de la superficie del océano, y por eso se les conoce como", "choices": { "text": [ "fauna abisal.", "peces.", "peces de mar adentro.", "fauna de alta mar." ], "label": [ "A", "B", "C", "D" ] }, "answerKey": "A" } ``` openbookqa_es contains the validation and test splits from the main version of the original dataset. | Metric | validation | test | |----------------------------------|-----------:|-----:| | Input Sentences | 500 | 500 | | Average Row Length in Words | TODO | TODO | | Average Row Length in Characters | TODO | TODO | ## Dataset Creation ### Curation Rationale  From the paper (Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.): > While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic—in the context of common knowledge—and the language it is expressed in. We have translated this dataset to improve the Spanish support in the NLP field and to allow cross-lingual comparisons in language models. ### Source Data  openbookqa_es comes from the main version of [OpenBookQA](https://leaderboard.allenai.org/open_book_qa), which is inspired in recurring science themes and principles, and the question-answer pairs were annotated in a crowd-sourcing process with external knowledge coming from ConceptNet, Wikipedia, and a corpus with 14M science-related sentences. #### Data Collection and Processing  Data was gathered from the main version of [OpenBookQA](https://huggingface.co/datasets/openbookqa). We did not modify the original dataset. The translation process to Spanish was based on the following guidelines: - **Date & Unit conversion**: Adapt dates, metric systems, currencies, etc., to our context, except when the task involves metric system conversion. - **Personal Names**: Translate English names with clear Spanish equivalents; otherwise, use common names in our context. Maintain consistency in translated names throughout the text. Names of individual figures are not translated. - **Language Style**: Avoid uniformity in translation, maintaining a rich and varied language reflecting our linguistic depth. In science texts - maintain precision and terminology while avoiding monotony. - **Dataset Logic**: Ensure internal logic of datasets is maintained; answers should remain relevant and accurate. Factual accuracy is key in question-answer datasets. Maintain the correct option in multiple-choice datasets. - **Error Handling**: Fix errors in the English text during translation unless otherwise specified for the specific dataset. Spelling mistakes must be corrected in Spanish. - **Avoiding Patterns and Maintaining Length**: Avoid including patterns that could hint at the correct option, maintaining difficulty. Match the length of responses to the original text as closely as possible. Handle scientific terminology carefully to ensure consistency. #### Who are the source data producers?  openbookqa_es is a professional translation of the [OpenBookQA dataset](https://huggingface.co/datasets/allenai/openbookqa), completed by a single translator who is a native speaker of Spanish. The translator was provided with the entire validation and test splits, as well as a set of translation preferences and guidelines, along with a brief explanation of the original corpus. To ensure ongoing communication, the translator was asked to provide sample translations at intervals of 50, 100, 250, and 500 examples. These translations were then reviewed by a Spanish speaker within our team. Additionally, the translator was encouraged to seek clarification on any specific doubts they had, and any necessary corrections were applied to the entire dataset. ### Annotations [optional]  #### Annotation process  Refer to the original paper (Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.). #### Who are the annotators?  Refer to the original paper (Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.). #### Personal and Sensitive Information  No personal or sensitive information included. ## Bias, Risks, and Limitations  [More Information Needed] ### Recommendations  Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional]  **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional]  [More Information Needed] ## More Information [optional] This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215337. ** License was changed to CC-BY-4.0 since the authors only specified the default GitHub license Apache 2.0 which is meant for software and not for data artifacts, and does not require derivative works to be licensed under the same terms ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact Language Technologies Unit (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).

# openbookqa_es 数据集卡片  openbookqa_es 是一款西班牙语问答数据集，由英文原版[OpenBookQA](https://leaderboard.allenai.org/open_book_qa)官方版本经专业翻译制作而成。 ## 数据集详情 ### 数据集概述  openbookqa_es（Open Book Question Answering - 西班牙语版）旨在模拟开卷考试场景，用以评估类人的学科理解能力。本数据集的验证集包含500条样本，测试集同样包含500条样本。每条样本均包含问题题干、四个可选答案，以及标识正确答案的字母。 - **数据整理方：** [语言技术单元 | BSC-CNS](https://www.bsc.es/discover-bsc/organisation/research-departments/language-technologies-unit) - **资助方：** [ILENIA项目](https://proyectoilenia.es/en/) - **自然语言语言：** 西班牙语（`es-ES`） - **许可证：** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed)（[原版许可证](https://github.com/allenai/OpenBookQA)） ### 数据集来源 [可选]  - **代码仓库：** [HuggingFace](https://huggingface.co/datasets/BSC-LT)   ## 数据集用途  openbookqa_es 旨在评估大语言模型（Large Language Model, LLM）的科学常识知识水平。以下为部分潜在应用场景： ### 合法使用场景  - 常识问答：openbookqa_es 涵盖需要基础背景知识的问题，例如勺子的材质。 - 多项选择测试：针对每个问题，本数据集提供4种不同的可选答案，要求模型在多个选项间进行推理判断。 - 阅读理解评估：openbookqa_es 中的问题与答案均以自然语言形式呈现。 ### 超出范围的使用场景  openbookqa_es 测试集** 严禁** 用于任何大语言模型的训练。为便于从训练语料库中移除该数据集，我们在测试文件中添加了一个可识别的全局唯一标识符（GUID）字符串。该GUID字符串为###TODO ## 数据集结构  本数据集以JSONL格式分为两个独立文件存储，每一行对应一道多项选择题，包含样本标识符、问题题干、包含可选答案的字典，以及标识正确答案的字母。每行包含以下字段： - `id`：文本字符串，存储问答对的唯一标识符。 - `question`：文本字符串，即问题题干，需结合选项完成作答。 - `choices`：字典类型，包含存储答案文本的`text`键，以及存储对应标签的`label`键。 - `answerKey`：文本字符串，存储正确答案的字母标识。例如： { "id": "8-376", "question_stem": "Los tiburones anguila y los rapes viven muy por debajo de la superficie del océano, y por eso se les conoce como", "choices": { "text": [ "fauna abisal.", "peces.", "peces de mar adentro.", "fauna de alta mar." ], "label": [ "A", "B", "C", "D" ] }, "answerKey": "A" } openbookqa_es 包含原始数据集官方版本的验证集与测试集拆分。 | 指标 | 验证集 | 测试集 | |--------------------------------|-------:|-------:| | 输入句子数量 | 500 | 500 | | 每行平均单词数 | TODO | TODO | | 每行平均字符数 | TODO | TODO | ## 数据集创建 ### 数据整理初衷  引用原论文（Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.）中的表述： > 现有基于文档或知识库的问答数据集通常自成体系，侧重语言理解能力；而OpenBookQA则更深入地探究模型对主题（结合常识背景）以及问题表达语言的理解程度。我们对该数据集进行翻译，旨在提升西班牙语在自然语言处理（Natural Language Processing, NLP）领域的支持力度，并支持大语言模型的跨语言对比研究。 ### 源数据  openbookqa_es 源自[OpenBookQA](https://leaderboard.allenai.org/open_book_qa)官方版本，该数据集的设计灵感来源于反复出现的科学主题与原理，问答对通过众包流程标注，外部知识来源包括ConceptNet、维基百科以及包含1400万条科学相关句子的语料库。 #### 数据收集与处理流程  数据源自[OpenBookQA](https://huggingface.co/datasets/openbookqa)官方版本，我们未对原始数据集进行任何修改。西班牙语翻译流程遵循以下准则： - **日期与单位转换**：将日期、公制单位、货币等适配至本地语境，但若任务涉及公制单位转换则保留原样。 - **人名**：将带有明确西班牙语对应译法的英文人名译为西班牙语；无对应译法的则使用本地常见姓名。确保译名人名在文本中保持一致。个体人物的姓名不进行翻译。 - **语言风格**：避免翻译风格单一化，使用丰富多样的语言以体现母语的表达深度。在科学文本中，需兼顾术语精准性与表达多样性，避免单调重复。 - **数据集逻辑**：确保数据集的内部逻辑一致，答案需保持相关性与准确性。问答数据集的核心在于事实准确性，需保留多项选择题的正确选项。 - **错误处理**：翻译过程中修正英文文本中的错误，除非特定数据集另有说明。西班牙语译文中需修正拼写错误。 - **避免模式化与保持长度**：避免使用可能暗示正确答案的翻译模式，以维持题目难度。尽可能使译文长度与原文保持一致。谨慎处理科学术语，确保术语使用的一致性。 #### 源数据生产者是谁？  openbookqa_es 是[OpenBookQA数据集](https://huggingface.co/datasets/allenai/openbookqa)的专业翻译版本，由一名西班牙语母语译者完成。译者收到了完整的验证集与测试集样本，以及一套翻译偏好与准则，同时附带原始语料库的简要说明。为确保沟通顺畅，要求译者分别在翻译50、100、250和500条样本时提交样例译文。这些译文随后由团队内的西班牙语母语使用者进行审核。此外，译者可随时就翻译中的疑问寻求澄清，整个数据集的翻译也会根据需要进行修正。 ### 标注信息 [可选]  #### 标注流程  详见原论文（Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.）。 #### 标注者是谁？  详见原论文（Mihaylov, T. et al. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.）。 #### 个人与敏感信息  本数据集未包含任何个人或敏感信息。 ## 偏差、风险与局限性  [需补充更多信息] ### 使用建议  用户需知晓本数据集存在的风险、偏差与局限性。如需进一步的使用建议，需补充更多相关信息。 ## 引用信息 [可选]  ** BibTeX格式：** [需补充更多信息] ** APA格式：** [需补充更多信息] ## 术语表 [可选]  [需补充更多信息] ## 补充信息 [可选] 本研究由西班牙数字化转型与公职部资助，由欧盟——下一代欧盟（NextGenerationEU）在[ILENIA项目](https://proyectoilenia.es/)框架下资助，项目编号为2022/TL22/00215337。 ** 许可证变更说明：** 由于原作者仅指定了适用于软件的默认GitHub许可证Apache 2.0，而非数据制品，因此本数据集的许可证变更为CC-BY 4.0，且不要求衍生作品遵循相同的许可证条款。 ## 数据集卡片作者 [可选] [需补充更多信息] ## 数据集卡片联系方式巴塞罗那超级计算中心（BSC）语言技术单元（langtech@bsc.es）。

提供机构：

maas

创建时间：

2025-01-26

搜集汇总

数据集介绍

背景与挑战

背景概述

openbookqa-es是一个基于OpenBookQA数据集专业翻译而成的西班牙语问答数据集，旨在评估语言模型的科学常识知识。它包含验证集和测试集各500个实例，每个实例由题干、四个选项和正确答案组成，数据格式为JSONL。该数据集由BSC-CNS的语言技术单位策划，受ILENIA资助，使用西班牙语并采用CC-BY 4.0许可证。

以上内容由遇见数据集搜集并总结生成