five

sapienzanlp/winogrande_italian

收藏
Hugging Face2025-12-02 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/sapienzanlp/winogrande_italian
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - it - en size_categories: - 10K<n<100K configs: - config_name: winogrande_xl data_files: - split: train path: winogrande_xl.train.json - split: validation path: winogrande_xl.validation.json --- # Winogrande - Italian (IT) This dataset is an Italian translation of [Winogrande](https://arxiv.org/abs/1907.10641). Winogrande is a large-scale dataset for coreference resolution, commonsense reasoning, and world knowledge. It is based on the original Winograd Schema Challenge dataset. ## Dataset Details The dataset consists of almost 40K examples, each containing a sentence with a blank and two possible fill-in-the-blank options. The task is to choose the correct option that correctly fills in the blank based on the context provided in the sentence, so that the sentence makes sense. This dataset contains the following splits translated to Italian: * **Winogrande XL:** * Train: 35,547 rows * Validation: 1,164 rows ### Differences with the original dataset * The number of instances in this dataset is smaller than the original dataset due to the translation process, during which some instances were filtered out. ### Languages This dataset is **fully parallel** between English and Italian. This allows us to have comparable evaluation setups and results across the two languages. ### Translation Process The translation has been carried out using [🍱 OBenTO-LLM](https://github.com/c-simone/llm-data-translation), an open-source tool for LLM-based translation. The main motivation for using an open-source LLM is to encourage free, open, reproducible, and transparent research in LLM evaluation. See [🍱 OBenTO-LLM](https://github.com/c-simone/llm-data-translation) for more details on the translation process. **Model used to translate:** [Unbabel/TowerInstruct-7B-v0.2] (https://huggingface.co/Unbabel/TowerInstruct-7B-v0.2) ### Other Information - **Original dataset by:** [Sakaguchi et al.](https://arxiv.org/abs/1907.10641) - **Translation by:** [Simone Conia](https://scholar.google.com/citations?user=S1tqbTcAAAAJ) - **Languages:** Italian, English - **License:** Apache 2.0 ## Dataset Format This is an example that shows the format of the dataset, where: * `id`: a unique ID for each sample in the split; * `category`: type of task. * `input_text`: the original English sentence in the dataset; * `input_text_translation`: the translation of the sentence in Italian; * `choices`: the original English choices; * `choice_translations`: the translation of the choices in Italian; * `gold_index`: the index of the correct answer. ```json { "id": "winogrande_3", "category": "fill_in_the_blank", "input_text": "Terry tried to bake the eggplant in the toaster oven but the _ was too big.", "input_text_translation": "Terry ha provato a cuocere la melanzana nel tostapane, ma la _ era troppo grande.", "choices": [ "eggplant", "toaster" ], "choice_translations": [ "melanzana", "tostapane" ], "gold_index": 0, "metadata": {} } ``` ## License The dataset is distributed under the Apache 2.0 license. ## Acknowledgements I would like to thank the authors of the original dataset for making it available to the research community. I would also like to thank [Future AI Research](https://future-ai-research.it/) for supporting this work and funding my research. ### Special Thanks My special thanks go to: * Pere-Lluís Huguet Cabot and Riccardo Orlando for their help with [🍱 OBenTO-LLM](https://github.com/c-simone/llm-data-translation). ## Dataset Card Authors * [Simone Conia](https://scholar.google.com/citations?user=S1tqbTcAAAAJ): simone.conia@uniroma1.it

This dataset is an Italian translation of Winogrande, used for coreference resolution, commonsense reasoning, and world knowledge. It contains almost 40K examples, each with a sentence containing a blank and two possible fill-in-the-blank options. The task is to choose the correct option based on the context provided in the sentence, making the sentence sensible. The dataset is fully parallel between English and Italian, allowing for comparable evaluations across the two languages. The translation process used the open-source LLM tool OBenTO-LLM.
提供机构:
sapienzanlp
原始信息汇总

Winogrande - Italian (IT)

数据集概述

  • 任务类别: 文本生成
  • 语言: 意大利语, 英语
  • 数据规模: 10K<n<100K
  • 配置:
    • config_name: winogrande_xl
    • 数据文件:
      • train: winogrande_xl.train.json
      • validation: winogrande_xl.validation.json

数据集详情

  • 数据集大小: 约40K个示例
  • 任务描述: 每个示例包含一个带有空白的句子及两个可能的填空选项,任务是根据上下文选择正确的选项以使句子通顺。
  • 数据集分割:
    • Winogrande XL:
      • Train: 35,547行
      • Validation: 1,164行

数据集特点

  • 语言: 数据集在英语和意大利语之间完全平行,便于跨语言的评估。
  • 翻译工具: 使用🍱 OBenTO-LLM进行翻译。
  • 数据格式:
    • id: 每个样本的唯一ID
    • category: 任务类型
    • input_text: 原始英语句子
    • input_text_translation: 意大利语翻译的句子
    • choices: 原始英语选项
    • choice_translations: 意大利语翻译的选项
    • gold_index: 正确答案的索引

数据集示例

json { "id": "winogrande_3", "category": "fill_in_the_blank", "input_text": "Terry tried to bake the eggplant in the toaster oven but the _ was too big.", "input_text_translation": "Terry ha provato a cuocere la melanzana nel tostapane, ma la _ era troppo grande.", "choices": [ "eggplant", "toaster" ], "choice_translations": [ "melanzana", "tostapane" ], "gold_index": 0, "metadata": {} }

许可证

  • 许可证: Apache 2.0
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作