COPA-es

Name: COPA-es
Creator: maas
Published: 2025-12-05 16:21:47
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-01 收录

下载链接：

https://modelscope.cn/datasets/BSC-LT/COPA-es

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for COPA-es  COPA-es is a textual entailment dataset in Spanish, professionally translated from the COPA dataset in English. The dataset consists of 600 premises, each given a question and two choices with a label encoding which of the choices is more plausible given the annotator. ## Dataset Details ### Dataset Description  COPA-es (Choice of Plausible Alternatives - Spanish) is designed to simulate causal reasoning of text from commonsense subjects. The dataset comprises 100 instances in the validation split and another 500 instances in the test split. Each instance contains a question stem, four possible choices, and the letter indicating the correct answer. - **Curated by:** [Language Technologies Unit | BSC-CNS](https://www.bsc.es/discover-bsc/organisation/research-departments/language-technologies-unit) - **Funded by:** [ILENIA](https://proyectoilenia.es/en/)  - **Language(s) (NLP):** Spanish (`es-ES`) - **License:** [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) ([Original](https://github.com/felipessalvatore/NLI_datasets)) ### Dataset Sources [optional]  - **Repository:** [HuggingFace](https://huggingface.co/datasets/BSC-LT)   ## Uses  COPA-es is intended to evaluate textual entailment of language models. Below are some potential uses: ### Direct Use  - Casual Reasoning: COPA-es contains premise sentences, where the system must determine either the cause or effect of the premise. - Multiple Choice Test: for each premise, COPA-es contains 2 different choices, which requires reasoning between different options. - Reading Comprehension Evaluation: problems and answers in COPA-es are formulated in natural language. ### Out-of-Scope Use  COPA-es-test should **not** be used to train any language model. To facilitate removal from training corpora, we add a canary GUID string to the test file. The GUID string is ###TODO ## Dataset Structure  The dataset is provided in two separate files in JSONL format, where each row corresponds to a premise with two possible entailed sentences and contains an instance identifier, the premise, the two possible entailed sentences, the entailed relation, and the corresponding number for the correct choice. Each row contains the following fields: - `id`: text string containing the premise number identifier. - `premise`: text string with the premise, to be completed with one of the choices. - `choice1`: text string with the first possible entailed sentence. - `choice2`: text string with the first possible entailed sentence. - `question`: text string containing the entailed relation between the premise and the correct choice. - `label`: text string containing the number for the correct answer. For example: ``` { "id": "0", "premise": "El hombre abrió el grifo.", "choice1": "El retrete se llenó de agua.", "choice2": "El agua fluyó del grifo.", "question": "effect", "label": "1", } ``` COPA-es contains the validation and test splits from the original dataset. | Metric | validation | test | |----------------------------------|-----------:|-----:| | Input Sentences | 100 | 500 | | Average Row Length in Words | TODO | TODO | | Average Row Length in Characters | TODO | TODO | ## Dataset Creation ### Curation Rationale  From the paper (Roemmele, M. et al. (2011). Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning): > Research in open-domain commonsense reasoning has been hindered by the lack of evaluation metrics for judging, progress and comparing alternative approaches. Taking inspiration from large-scale question sets used in natural language processing research, we authored one thousand English-language questions that directly assess commonsense causal reasoning, called the Choice Of Plausible Alternatives (COPA) evaluation. We have translated this dataset to improve the Spanish support in the NLP field and to allow cross-lingual comparisons in language models. ### Source Data  COPA-es comes from the original [COPA](https://huggingface.co/datasets/nyu-mll/glue) as it is implemented in the GLUE benchmark, and focuses on topics from blogs and a photography-related encyclopedia. #### Data Collection and Processing  Data was gathered from the GLUE benchmark version of [COPA](https://huggingface.co/datasets/nyu-mll/glue). We did not modify the original dataset. The translation process to Spanish was based on the following guidelines: - **Date & Unit conversion**: Adapt dates, metric systems, currencies, etc., to our context, except when the task involves metric system conversion. - **Personal Names**: Translate English names with clear Spanish equivalents; otherwise, use common names in our context. Maintain consistency in translated names throughout the text. Names of individual figures are not translated. - **Language Style**: Avoid uniformity in translation, maintaining a rich and varied language reflecting our linguistic depth. In technical texts - maintain precision and terminology while avoiding monotony. - **Dataset Logic**: Ensure internal logic of datasets is maintained; answers should remain relevant and accurate. Factual accuracy is key in question-answer datasets. Maintain the correct option in multiple-choice datasets. - **Error Handling**: Fix errors in the English text during translation unless otherwise specified for the specific dataset. Spelling mistakes must be corrected in Spanish. - **Avoiding Patterns and Maintaining Length**: Avoid including patterns that could hint at the correct option, maintaining difficulty. Match the length of responses to the original text as closely as possible. Handle technical terminology carefully to ensure consistency. #### Who are the source data producers?  COPA-es is a professional translation of [COPA](https://huggingface.co/datasets/nyu-mll/glue), completed by a single translator who is a native speaker of Spanish. The translator was provided with the entire validation and test splits, as well as a set of translation preferences and guidelines, along with a brief explanation of the original corpus. To ensure ongoing communication, the translator was asked to provide sample translations at intervals of 50, 150 and 300 examples. These translations were then reviewed by a Spanish speaker within our team. Additionally, the translator was encouraged to seek clarification on any specific doubts they had, and any necessary corrections were applied to the entire dataset. ### Annotations [optional]  #### Annotation process  Refer to the original paper (Roemmele, M. et al. (2011). Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning). #### Who are the annotators?  Refer to the original paper (Roemmele, M. et al. (2011). Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning). #### Personal and Sensitive Information  No personal or sensitive information included. ## Bias, Risks, and Limitations  [More Information Needed] ### Recommendations  Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional]  **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional]  [More Information Needed] ## More Information [optional] This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215337. ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact Language Technologies Unit (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).

# COPA-es 数据集卡片  COPA-es是西班牙语的**文本蕴含（textual entailment）**数据集，由英文原版COPA数据集经专业翻译而来。该数据集包含600条前提句，每条前提句对应一个问题与两个选项，并带有标注，用以标识标注者认为更合理的选项。 ## 数据集详情 ### 数据集描述  COPA-es（合理选项选择-西班牙语版，Choice of Plausible Alternatives - Spanish）旨在模拟基于常识主题的文本**因果推理（causal reasoning）**。该数据集包含验证集100条实例与测试集500条实例。每个实例包含问题题干、两个可选答案，以及指示正确答案的字母。 - **整理方**：[语言技术研究组 | 巴塞罗那超级计算中心（BSC-CNS）](https://www.bsc.es/discover-bsc/organisation/research-departments/language-technologies-unit) - **资助方**：[ILENIA项目](https://proyectoilenia.es/en/)  - **自然语言处理所用语言**：西班牙语（`es-ES`） - **许可协议**：[知识共享署名-相同方式共享4.0（CC BY-SA 4.0）](https://creativecommons.org/licenses/by-sa/4.0/)（[原仓库链接](https://github.com/felipessalvatore/NLI_datasets)） ### 数据集来源（可选）  - **代码仓库**：[HuggingFace](https://huggingface.co/datasets/BSC-LT)   ## 数据集用途  COPA-es旨在评估大语言模型（LLM）的文本蕴含能力，以下为若干潜在用途： ### 直接用途  - **因果推理**：COPA-es包含前提句，模型需据此推理出前提的原因或结果。 - **多项选择测试**：针对每条前提，COPA-es提供两个不同选项，需模型在多个选项间完成推理。 - **阅读理解评估**：COPA-es的题目与答案均以自然语言编写。 ### 超出适用范围的用途  **不得**使用COPA-es测试集训练任何大语言模型。为便于从训练语料中移除该测试集，我们在测试文件中添加了一个可追踪GUID字符串，其内容为###TODO ## 数据集结构  本数据集以JSON Lines（JSONL）格式分为两个独立文件存储，每一行对应一条前提句与两个可能的蕴含句，包含实例标识符、前提句、两个可选蕴含句、蕴含关系类型，以及正确选项对应的编号。每行包含以下字段： - `id`：表示前提编号标识符的文本字符串。 - `premise`：前提句的文本字符串，需结合其中一个选项完成语义补全。 - `choice1`：第一个可选蕴含句的文本字符串。 - `choice2`：第二个可选蕴含句的文本字符串。 - `question`：表示前提与正确选项之间蕴含关系的文本字符串。 - `label`：表示正确答案编号的文本字符串。示例如下： { "id": "0", "premise": "El hombre abrió el grifo.", "choice1": "El retrete se llenó de agua.", "choice2": "El agua fluyó del grifo.", "question": "effect", "label": "1", } COPA-es包含原版数据集的验证集与测试集。 | 评价指标 | 验证集 | 测试集 | |----------------------------------|-----------:|-----:| | 输入句数量 | 100 | 500 | | 平均每行词数 | TODO | TODO | | 平均每行字符数 | TODO | TODO | ## 数据集构建 ### 构建初衷  引自原论文（Roemmele, M. et al. (2011). *Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning*）： > 开放域常识推理领域的研究因缺乏用于评估进展、对比不同方法的评测指标而发展受阻。受自然语言处理研究中大规模问答集的启发，我们编写了1000道英文题目，用以直接评估常识因果推理能力，即合理选项选择（COPA，Choice of Plausible Alternatives）评测任务。我们对该数据集进行翻译，以完善西班牙语自然语言处理（NLP）领域的可用资源，并支持跨语言的大语言模型对比研究。 ### 源数据  COPA-es源自GLUE基准中实现的原版[COPA](https://huggingface.co/datasets/nyu-mll/glue)数据集，主题涵盖博客与摄影相关百科内容。 #### 数据收集与处理  数据源自GLUE基准中的[COPA](https://huggingface.co/datasets/nyu-mll/glue)版本，未对原始数据集进行修改。西班牙语翻译流程遵循以下准则： - **日期与单位转换**：将日期、公制单位、货币等适配至西班牙语语境，除非任务本身涉及单位转换。 - **人名**：将有明确西班牙语对应译法的英文人名译为通用西班牙语姓名；无明确对应译法的则使用本语境中的常见姓名，全程保持译名一致。个人专有译名不做翻译。 - **语言风格**：避免翻译风格同质化，使用丰富多样的语言以体现语言深度；技术文本需保持术语精准性，避免行文单调。 - **数据集逻辑**：确保数据集内部逻辑一致，答案需保持相关性与准确性。问答类数据集需保证事实精准，多项选择数据集需保留正确选项。 - **错误修正**：翻译过程中修正英文原文中的错误，除非特定数据集另有说明。西班牙语译文中需修正拼写错误。 - **避免模式化与保持长度**：避免使用可能暗示正确答案的固定句式，维持题目难度。译文长度需尽可能与原文匹配。谨慎处理专业术语，确保术语使用一致。 #### 源数据创作者  COPA-es是[COPA](https://huggingface.co/datasets/nyu-mll/glue)数据集的专业翻译版本，由一名母语为西班牙语的译员完成。译员收到了完整的验证集与测试集、一套翻译偏好与准则，以及原始语料库的简要说明。为确保沟通顺畅，要求译员每完成50、150和300个示例后提交样例翻译。随后由团队内的西班牙语使用者对这些译稿进行审核。此外，鼓励译员就任何疑问寻求澄清，并对整个数据集进行必要的修正。 ### 标注（可选）  #### 标注流程  详见原论文（Roemmele, M. et al. (2011). *Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning*）。 #### 标注者  详见原论文（Roemmele, M. et al. (2011). *Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning*）。 #### 个人与敏感信息  本数据集未包含任何个人或敏感信息。 ## 偏差、风险与局限性  [需补充更多信息] ### 建议  使用者需知晓该数据集存在的偏差、风险与局限性。尚需更多信息以形成进一步的建议。 ## 引用（可选）  **BibTeX格式：** [需补充更多信息] **APA格式：** [需补充更多信息] ## 术语表（可选）  [需补充更多信息] ## 更多信息（可选）本项目由西班牙数字化与公共职能部资助，依托欧盟下一代欧盟（NextGenerationEU）框架下的[ILENIA项目](https://proyectoilenia.es/)实施，项目编号为2022/TL22/00215337。 ## 数据集卡片作者（可选） [需补充更多信息] ## 数据集卡片联系方式巴塞罗那超级计算中心（BSC）语言技术研究组（langtech@bsc.es）

提供机构：

maas

创建时间：

2025-01-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集