sapienzanlp/winogrande_italian
收藏Hugging Face2025-12-02 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/sapienzanlp/winogrande_italian
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- it
- en
size_categories:
- 10K<n<100K
configs:
- config_name: winogrande_xl
data_files:
- split: train
path: winogrande_xl.train.json
- split: validation
path: winogrande_xl.validation.json
---
# Winogrande - Italian (IT)
This dataset is an Italian translation of [Winogrande](https://arxiv.org/abs/1907.10641). Winogrande is a large-scale dataset for coreference resolution, commonsense reasoning, and world knowledge. It is based on the original Winograd Schema Challenge dataset.
## Dataset Details
The dataset consists of almost 40K examples, each containing a sentence with a blank and two possible fill-in-the-blank options. The task is to choose the correct option that correctly fills in the blank based on the context provided in the sentence, so that the sentence makes sense.
This dataset contains the following splits translated to Italian:
* **Winogrande XL:**
* Train: 35,547 rows
* Validation: 1,164 rows
### Differences with the original dataset
* The number of instances in this dataset is smaller than the original dataset due to the translation process, during which some instances were filtered out.
### Languages
This dataset is **fully parallel** between English and Italian. This allows us to have comparable evaluation setups and results across the two languages.
### Translation Process
The translation has been carried out using [🍱 OBenTO-LLM](https://github.com/c-simone/llm-data-translation), an open-source tool for LLM-based translation.
The main motivation for using an open-source LLM is to encourage free, open, reproducible, and transparent research in LLM evaluation.
See [🍱 OBenTO-LLM](https://github.com/c-simone/llm-data-translation) for more details on the translation process.
**Model used to translate:** [Unbabel/TowerInstruct-7B-v0.2] (https://huggingface.co/Unbabel/TowerInstruct-7B-v0.2)
### Other Information
- **Original dataset by:** [Sakaguchi et al.](https://arxiv.org/abs/1907.10641)
- **Translation by:** [Simone Conia](https://scholar.google.com/citations?user=S1tqbTcAAAAJ)
- **Languages:** Italian, English
- **License:** Apache 2.0
## Dataset Format
This is an example that shows the format of the dataset, where:
* `id`: a unique ID for each sample in the split;
* `category`: type of task.
* `input_text`: the original English sentence in the dataset;
* `input_text_translation`: the translation of the sentence in Italian;
* `choices`: the original English choices;
* `choice_translations`: the translation of the choices in Italian;
* `gold_index`: the index of the correct answer.
```json
{
"id": "winogrande_3",
"category": "fill_in_the_blank",
"input_text": "Terry tried to bake the eggplant in the toaster oven but the _ was too big.",
"input_text_translation": "Terry ha provato a cuocere la melanzana nel tostapane, ma la _ era troppo grande.",
"choices": [
"eggplant",
"toaster"
],
"choice_translations": [
"melanzana",
"tostapane"
],
"gold_index": 0,
"metadata": {}
}
```
## License
The dataset is distributed under the Apache 2.0 license.
## Acknowledgements
I would like to thank the authors of the original dataset for making it available to the research community.
I would also like to thank [Future AI Research](https://future-ai-research.it/) for supporting this work and funding my research.
### Special Thanks
My special thanks go to:
* Pere-Lluís Huguet Cabot and Riccardo Orlando for their help with [🍱 OBenTO-LLM](https://github.com/c-simone/llm-data-translation).
## Dataset Card Authors
* [Simone Conia](https://scholar.google.com/citations?user=S1tqbTcAAAAJ): simone.conia@uniroma1.it
This dataset is an Italian translation of Winogrande, used for coreference resolution, commonsense reasoning, and world knowledge. It contains almost 40K examples, each with a sentence containing a blank and two possible fill-in-the-blank options. The task is to choose the correct option based on the context provided in the sentence, making the sentence sensible. The dataset is fully parallel between English and Italian, allowing for comparable evaluations across the two languages. The translation process used the open-source LLM tool OBenTO-LLM.
提供机构:
sapienzanlp
原始信息汇总
Winogrande - Italian (IT)
数据集概述
- 任务类别: 文本生成
- 语言: 意大利语, 英语
- 数据规模: 10K<n<100K
- 配置:
- config_name: winogrande_xl
- 数据文件:
- train: winogrande_xl.train.json
- validation: winogrande_xl.validation.json
数据集详情
- 数据集大小: 约40K个示例
- 任务描述: 每个示例包含一个带有空白的句子及两个可能的填空选项,任务是根据上下文选择正确的选项以使句子通顺。
- 数据集分割:
- Winogrande XL:
- Train: 35,547行
- Validation: 1,164行
- Winogrande XL:
数据集特点
- 语言: 数据集在英语和意大利语之间完全平行,便于跨语言的评估。
- 翻译工具: 使用🍱 OBenTO-LLM进行翻译。
- 数据格式:
id: 每个样本的唯一IDcategory: 任务类型input_text: 原始英语句子input_text_translation: 意大利语翻译的句子choices: 原始英语选项choice_translations: 意大利语翻译的选项gold_index: 正确答案的索引
数据集示例
json { "id": "winogrande_3", "category": "fill_in_the_blank", "input_text": "Terry tried to bake the eggplant in the toaster oven but the _ was too big.", "input_text_translation": "Terry ha provato a cuocere la melanzana nel tostapane, ma la _ era troppo grande.", "choices": [ "eggplant", "toaster" ], "choice_translations": [ "melanzana", "tostapane" ], "gold_index": 0, "metadata": {} }
许可证
- 许可证: Apache 2.0



