five

COPA-es

收藏
魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/BSC-LT/COPA-es
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for COPA-es <!-- Provide a quick summary of the dataset. --> COPA-es is a textual entailment dataset in Spanish, professionally translated from the COPA dataset in English. The dataset consists of 600 premises, each given a question and two choices with a label encoding which of the choices is more plausible given the annotator. ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> COPA-es (Choice of Plausible Alternatives - Spanish) is designed to simulate causal reasoning of text from commonsense subjects. The dataset comprises 100 instances in the validation split and another 500 instances in the test split. Each instance contains a question stem, four possible choices, and the letter indicating the correct answer. - **Curated by:** [Language Technologies Unit | BSC-CNS](https://www.bsc.es/discover-bsc/organisation/research-departments/language-technologies-unit) - **Funded by:** [ILENIA](https://proyectoilenia.es/en/) <!-- - **Shared by [optional]:** [More Information Needed] --> - **Language(s) (NLP):** Spanish (`es-ES`) - **License:** [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) ([Original](https://github.com/felipessalvatore/NLI_datasets)) ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [HuggingFace](https://huggingface.co/datasets/BSC-LT) <!-- - **Paper [optional]:** [More Information Needed] --> <!-- - **Demo [optional]:** [More Information Needed] --> ## Uses <!-- Address questions around how the dataset is intended to be used. --> COPA-es is intended to evaluate textual entailment of language models. Below are some potential uses: ### Direct Use <!-- This section describes suitable use cases for the dataset. --> - Casual Reasoning: COPA-es contains premise sentences, where the system must determine either the cause or effect of the premise. - Multiple Choice Test: for each premise, COPA-es contains 2 different choices, which requires reasoning between different options. - Reading Comprehension Evaluation: problems and answers in COPA-es are formulated in natural language. ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> COPA-es-test should **not** be used to train any language model. To facilitate removal from training corpora, we add a canary GUID string to the test file. The GUID string is ###TODO ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> The dataset is provided in two separate files in JSONL format, where each row corresponds to a premise with two possible entailed sentences and contains an instance identifier, the premise, the two possible entailed sentences, the entailed relation, and the corresponding number for the correct choice. Each row contains the following fields: - `id`: text string containing the premise number identifier. - `premise`: text string with the premise, to be completed with one of the choices. - `choice1`: text string with the first possible entailed sentence. - `choice2`: text string with the first possible entailed sentence. - `question`: text string containing the entailed relation between the premise and the correct choice. - `label`: text string containing the number for the correct answer. For example: ``` { "id": "0", "premise": "El hombre abrió el grifo.", "choice1": "El retrete se llenó de agua.", "choice2": "El agua fluyó del grifo.", "question": "effect", "label": "1", } ``` COPA-es contains the validation and test splits from the original dataset. | Metric | validation | test | |----------------------------------|-----------:|-----:| | Input Sentences | 100 | 500 | | Average Row Length in Words | TODO | TODO | | Average Row Length in Characters | TODO | TODO | ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> From the paper (Roemmele, M. et al. (2011). Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning): > Research in open-domain commonsense reasoning has been hindered by the lack of evaluation metrics for judging, progress and comparing alternative approaches. Taking inspiration from large-scale question sets used in natural language processing research, we authored one thousand English-language questions that directly assess commonsense causal reasoning, called the Choice Of Plausible Alternatives (COPA) evaluation. We have translated this dataset to improve the Spanish support in the NLP field and to allow cross-lingual comparisons in language models. ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> COPA-es comes from the original [COPA](https://huggingface.co/datasets/nyu-mll/glue) as it is implemented in the GLUE benchmark, and focuses on topics from blogs and a photography-related encyclopedia. #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> Data was gathered from the GLUE benchmark version of [COPA](https://huggingface.co/datasets/nyu-mll/glue). We did not modify the original dataset. The translation process to Spanish was based on the following guidelines: - **Date & Unit conversion**: Adapt dates, metric systems, currencies, etc., to our context, except when the task involves metric system conversion. - **Personal Names**: Translate English names with clear Spanish equivalents; otherwise, use common names in our context. Maintain consistency in translated names throughout the text. Names of individual figures are not translated. - **Language Style**: Avoid uniformity in translation, maintaining a rich and varied language reflecting our linguistic depth. In technical texts - maintain precision and terminology while avoiding monotony. - **Dataset Logic**: Ensure internal logic of datasets is maintained; answers should remain relevant and accurate. Factual accuracy is key in question-answer datasets. Maintain the correct option in multiple-choice datasets. - **Error Handling**: Fix errors in the English text during translation unless otherwise specified for the specific dataset. Spelling mistakes must be corrected in Spanish. - **Avoiding Patterns and Maintaining Length**: Avoid including patterns that could hint at the correct option, maintaining difficulty. Match the length of responses to the original text as closely as possible. Handle technical terminology carefully to ensure consistency. #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> COPA-es is a professional translation of [COPA](https://huggingface.co/datasets/nyu-mll/glue), completed by a single translator who is a native speaker of Spanish. The translator was provided with the entire validation and test splits, as well as a set of translation preferences and guidelines, along with a brief explanation of the original corpus. To ensure ongoing communication, the translator was asked to provide sample translations at intervals of 50, 150 and 300 examples. These translations were then reviewed by a Spanish speaker within our team. Additionally, the translator was encouraged to seek clarification on any specific doubts they had, and any necessary corrections were applied to the entire dataset. ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> Refer to the original paper (Roemmele, M. et al. (2011). Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning). #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> Refer to the original paper (Roemmele, M. et al. (2011). Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning). #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> No personal or sensitive information included. ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215337. ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact Language Technologies Unit (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).

# COPA-es 数据集卡片 <!-- 提供数据集的快速摘要。 --> COPA-es是西班牙语的**文本蕴含(textual entailment)**数据集,由英文原版COPA数据集经专业翻译而来。该数据集包含600条前提句,每条前提句对应一个问题与两个选项,并带有标注,用以标识标注者认为更合理的选项。 ## 数据集详情 ### 数据集描述 <!-- 提供数据集的详细说明。 --> COPA-es(合理选项选择-西班牙语版,Choice of Plausible Alternatives - Spanish)旨在模拟基于常识主题的文本**因果推理(causal reasoning)**。该数据集包含验证集100条实例与测试集500条实例。每个实例包含问题题干、两个可选答案,以及指示正确答案的字母。 - **整理方**:[语言技术研究组 | 巴塞罗那超级计算中心(BSC-CNS)](https://www.bsc.es/discover-bsc/organisation/research-departments/language-technologies-unit) - **资助方**:[ILENIA项目](https://proyectoilenia.es/en/) <!-- - **共享方 [可选]**:[需补充更多信息] --> - **自然语言处理所用语言**:西班牙语(`es-ES`) - **许可协议**:[知识共享署名-相同方式共享4.0(CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/)([原仓库链接](https://github.com/felipessalvatore/NLI_datasets)) ### 数据集来源(可选) <!-- 提供数据集的基础链接。 --> - **代码仓库**:[HuggingFace](https://huggingface.co/datasets/BSC-LT) <!-- - **论文 [可选]**:[需补充更多信息] --> <!-- - **演示 [可选]**:[需补充更多信息] --> ## 数据集用途 <!-- 说明数据集的预期使用场景。 --> COPA-es旨在评估大语言模型(LLM)的文本蕴含能力,以下为若干潜在用途: ### 直接用途 <!-- 本部分描述数据集的适用场景。 --> - **因果推理**:COPA-es包含前提句,模型需据此推理出前提的原因或结果。 - **多项选择测试**:针对每条前提,COPA-es提供两个不同选项,需模型在多个选项间完成推理。 - **阅读理解评估**:COPA-es的题目与答案均以自然语言编写。 ### 超出适用范围的用途 <!-- 本部分说明误用、恶意使用,以及该数据集无法良好适配的使用场景。 --> **不得**使用COPA-es测试集训练任何大语言模型。为便于从训练语料中移除该测试集,我们在测试文件中添加了一个可追踪GUID字符串,其内容为###TODO ## 数据集结构 <!-- 本部分描述数据集的字段,以及数据集结构的额外信息,例如划分标准、数据点间的关系等。 --> 本数据集以JSON Lines(JSONL)格式分为两个独立文件存储,每一行对应一条前提句与两个可能的蕴含句,包含实例标识符、前提句、两个可选蕴含句、蕴含关系类型,以及正确选项对应的编号。每行包含以下字段: - `id`:表示前提编号标识符的文本字符串。 - `premise`:前提句的文本字符串,需结合其中一个选项完成语义补全。 - `choice1`:第一个可选蕴含句的文本字符串。 - `choice2`:第二个可选蕴含句的文本字符串。 - `question`:表示前提与正确选项之间蕴含关系的文本字符串。 - `label`:表示正确答案编号的文本字符串。 示例如下: { "id": "0", "premise": "El hombre abrió el grifo.", "choice1": "El retrete se llenó de agua.", "choice2": "El agua fluyó del grifo.", "question": "effect", "label": "1", } COPA-es包含原版数据集的验证集与测试集。 | 评价指标 | 验证集 | 测试集 | |----------------------------------|-----------:|-----:| | 输入句数量 | 100 | 500 | | 平均每行词数 | TODO | TODO | | 平均每行字符数 | TODO | TODO | ## 数据集构建 ### 构建初衷 <!-- 说明创建该数据集的动机。 --> 引自原论文(Roemmele, M. et al. (2011). *Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning*): > 开放域常识推理领域的研究因缺乏用于评估进展、对比不同方法的评测指标而发展受阻。受自然语言处理研究中大规模问答集的启发,我们编写了1000道英文题目,用以直接评估常识因果推理能力,即合理选项选择(COPA,Choice of Plausible Alternatives)评测任务。 我们对该数据集进行翻译,以完善西班牙语自然语言处理(NLP)领域的可用资源,并支持跨语言的大语言模型对比研究。 ### 源数据 <!-- 本部分描述源数据(例如新闻文本与标题、社交媒体帖子、翻译句等)。 --> COPA-es源自GLUE基准中实现的原版[COPA](https://huggingface.co/datasets/nyu-mll/glue)数据集,主题涵盖博客与摄影相关百科内容。 #### 数据收集与处理 <!-- 本部分描述数据收集与处理流程,例如数据选择标准、过滤与归一化方法、所用工具与库等。 --> 数据源自GLUE基准中的[COPA](https://huggingface.co/datasets/nyu-mll/glue)版本,未对原始数据集进行修改。 西班牙语翻译流程遵循以下准则: - **日期与单位转换**:将日期、公制单位、货币等适配至西班牙语语境,除非任务本身涉及单位转换。 - **人名**:将有明确西班牙语对应译法的英文人名译为通用西班牙语姓名;无明确对应译法的则使用本语境中的常见姓名,全程保持译名一致。个人专有译名不做翻译。 - **语言风格**:避免翻译风格同质化,使用丰富多样的语言以体现语言深度;技术文本需保持术语精准性,避免行文单调。 - **数据集逻辑**:确保数据集内部逻辑一致,答案需保持相关性与准确性。问答类数据集需保证事实精准,多项选择数据集需保留正确选项。 - **错误修正**:翻译过程中修正英文原文中的错误,除非特定数据集另有说明。西班牙语译文中需修正拼写错误。 - **避免模式化与保持长度**:避免使用可能暗示正确答案的固定句式,维持题目难度。译文长度需尽可能与原文匹配。谨慎处理专业术语,确保术语使用一致。 #### 源数据创作者 <!-- 本部分描述原始数据的创作者或系统。若可获取源数据创作者的自我报告人口统计或身份信息,也应包含在内。 --> COPA-es是[COPA](https://huggingface.co/datasets/nyu-mll/glue)数据集的专业翻译版本,由一名母语为西班牙语的译员完成。译员收到了完整的验证集与测试集、一套翻译偏好与准则,以及原始语料库的简要说明。为确保沟通顺畅,要求译员每完成50、150和300个示例后提交样例翻译。随后由团队内的西班牙语使用者对这些译稿进行审核。此外,鼓励译员就任何疑问寻求澄清,并对整个数据集进行必要的修正。 ### 标注(可选) <!-- 若数据集包含初始数据收集以外的标注内容,请使用本部分描述。 --> #### 标注流程 <!-- 本部分描述标注流程,例如所用标注工具、标注数据量、向标注者提供的标注准则、标注者间统计数据、标注验证等。 --> 详见原论文(Roemmele, M. et al. (2011). *Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning*)。 #### 标注者 <!-- 本部分描述创建标注的人员或系统。 --> 详见原论文(Roemmele, M. et al. (2011). *Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning*)。 #### 个人与敏感信息 <!-- 说明数据集是否包含可被视为个人、敏感或隐私的数据(例如显示地址、唯一可识别的姓名或别名、种族或族裔出身、性取向、宗教信仰、政治观点、财务或健康数据等)。若已采取措施对数据进行匿名化,请描述匿名化流程。 --> 本数据集未包含任何个人或敏感信息。 ## 偏差、风险与局限性 <!-- 本部分旨在说明技术与社会技术层面的局限性。 --> [需补充更多信息] ### 建议 <!-- 本部分旨在就偏差、风险与技术局限性提供建议。 --> 使用者需知晓该数据集存在的偏差、风险与局限性。尚需更多信息以形成进一步的建议。 ## 引用(可选) <!-- 若有介绍该数据集的论文或博客文章,应在此处提供APA与BibTeX格式的引用信息。 --> **BibTeX格式:** [需补充更多信息] **APA格式:** [需补充更多信息] ## 术语表(可选) <!-- 若有需要,可在此处添加可帮助读者理解数据集或数据集卡片的术语与计算公式。 --> [需补充更多信息] ## 更多信息(可选) 本项目由西班牙数字化与公共职能部资助,依托欧盟下一代欧盟(NextGenerationEU)框架下的[ILENIA项目](https://proyectoilenia.es/)实施,项目编号为2022/TL22/00215337。 ## 数据集卡片作者(可选) [需补充更多信息] ## 数据集卡片联系方式 巴塞罗那超级计算中心(BSC)语言技术研究组(langtech@bsc.es)
提供机构:
maas
创建时间:
2025-01-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作