danish-foundation-models/icelandic-dynaword
收藏Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/danish-foundation-models/icelandic-dynaword
下载链接
链接失效反馈官方服务:
资源简介:
冰岛Dynaword是一个包含来自不同领域的冰岛语自由文本数据集的集合。所有数据集都是公开许可的,适合用于训练大型语言模型。数据集持续开发中,会随着新数据集的可用性而更新。数据集包含345个样本,32.40M个token,平均文档长度为93.92K tokens。数据集的主要语言是冰岛语,可能包含少量英语和其他语言的引用。数据集的结构包括数据实例、数据字段和数据分割。数据集的创建目的是为了提供公开许可的冰岛语数据,用于语言模型开发和其他用途。数据集没有注释,只有元数据。数据来源包括冰岛议会演讲等。数据集持续更新,欢迎贡献。
The Icelandic dynaword is a collection of Icelandic free-form text datasets from various domains. All of the datasets in the Icelandic Dynaword are openly licensed and deemed permissible for training large language models. Icelandic dynaword is continually developed, which means that the dataset will actively be updated as new datasets become available. The dataset contains 345 samples, 32.40M tokens, with an average document length of 93.92K tokens. The primary language is Icelandic, with possible small amounts of English and other languages in quotations or embedded references. The dataset structure includes data instances, data fields, and data splits. The dataset was created to make openly licensed Icelandic data available for language model development and other uses. The data generally contains no annotation besides the metadata attached to each sample. Source data includes Icelandic parliamentary speeches, among others. The dataset is continually updated, and contributions are welcome.
提供机构:
danish-foundation-models



