mideind/icelandic-winogrande
收藏Icelandic WinoGrande 数据集概述
数据集描述
- 名称: Icelandic WinoGrande 数据集
- 来源: 该数据集在 IceBERT 论文中被描述,论文链接为 https://aclanthology.org/2022.lrec-1.464.pdf。
翻译与本地化
- 翻译方式: 数据集中的记录从英语手动翻译并本地化,无法本地化的部分被跳过。对于单个而非句子对的情况,添加了相应的句子。
- 翻译准确性: 翻译并非完全精确,因为准确保留原始语义并不重要。例如,某些词汇由于性别、数量和格的限制,难以匹配所有约束,或者词汇选择不合适。
- 翻译工具: 由于词形变化,每个候选词的选择都需要极高的精确度,因此未使用机器翻译,无论是作为起点还是参考。
评估
- 评估工具: 提供了一个示例评估脚本
eval.py,用于设置数据集上的简单基准任务,以评估未经指令调整的模型。
引用信息
- 引用格式:
@inproceedings{snaebjarnarson-etal-2022-warm, title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models", author = "Sn{ae}bjarnarson, V{e}steinn and S{\i}monarson, Haukur Barri and Ragnarsson, P{e}tur Orri and Ing{o}lfsd{o}ttir, Svanhv{\i}t Lilja and J{o}nsson, Haukur and Thorsteinsson, Vilhjalmur and Einarsson, Hafsteinn", editor = "Calzolari, Nicoletta and B{e}chet, Fr{e}d{e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{e}l{`e}ne and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.464", pages = "4356--4366", abstract = "We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain .is. Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we manually translate and adapt the WinoGrande commonsense reasoning dataset. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing models using existing multilingual models can lead to state-of-the-art results for some downstream tasks.", }



