five

mideind/icelandic-winogrande

收藏
Hugging Face2024-06-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mideind/icelandic-winogrande
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个冰岛语的WinoGrande数据集,源自IceBERT论文。数据集的内容是通过人工翻译和本地化从英语转换而来,翻译过程中保留了原始语义的大致意思,但由于冰岛语的语法特性,某些词汇的选择和翻译需要极高的精确度。数据集还包括一个评估脚本`eval.py`,用于在未调整的模型上对数据集进行基准测试。

This is the Icelandic WinoGrande dataset described in the IceBERT paper. The records were manually translated and localized from English, preserving the original semantics approximately. Due to the inflections in Icelandic, the selection and translation of certain words required extreme precision. The dataset also includes an evaluation script `eval.py` for benchmarking untuned models on the dataset.
提供机构:
mideind
原始信息汇总

Icelandic WinoGrande 数据集概述

数据集描述

翻译与本地化

  • 翻译方式: 数据集中的记录从英语手动翻译并本地化,无法本地化的部分被跳过。对于单个而非句子对的情况,添加了相应的句子。
  • 翻译准确性: 翻译并非完全精确,因为准确保留原始语义并不重要。例如,某些词汇由于性别、数量和格的限制,难以匹配所有约束,或者词汇选择不合适。
  • 翻译工具: 由于词形变化,每个候选词的选择都需要极高的精确度,因此未使用机器翻译,无论是作为起点还是参考。

评估

  • 评估工具: 提供了一个示例评估脚本 eval.py,用于设置数据集上的简单基准任务,以评估未经指令调整的模型。

引用信息

  • 引用格式:

@inproceedings{snaebjarnarson-etal-2022-warm, title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models", author = "Sn{ae}bjarnarson, V{e}steinn and S{\i}monarson, Haukur Barri and Ragnarsson, P{e}tur Orri and Ing{o}lfsd{o}ttir, Svanhv{\i}t Lilja and J{o}nsson, Haukur and Thorsteinsson, Vilhjalmur and Einarsson, Hafsteinn", editor = "Calzolari, Nicoletta and B{e}chet, Fr{e}d{e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{e}l{`e}ne and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.464", pages = "4356--4366", abstract = "We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain .is. Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we manually translate and adapt the WinoGrande commonsense reasoning dataset. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing models using existing multilingual models can lead to state-of-the-art results for some downstream tasks.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作