alexandrainst/scandi-wiki

Name: alexandrainst/scandi-wiki
Creator: alexandrainst
Published: 2023-01-16 13:55:38
License: 暂无描述

Hugging Face2023-01-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/alexandrainst/scandi-wiki

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: ScandiWiki language: - da - sv - no - nb - nn - is - fo license: - cc-by-sa-4.0 multilinguality: - multilingual size_categories: - 1M<n<10M source_datasets: - wikipedia task_categories: - fill-mask - text-generation - feature-extraction task_ids: - language-modeling --- # Dataset Card for ScandiWiki ## Dataset Description - **Point of Contact:** [Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk) - **Total amount of disk used:** 4485.90 MB ### Dataset Summary ScandiWiki is a parsed and deduplicated Wikipedia dump in Danish, Norwegian Bokmål, Norwegian Nynorsk, Swedish, Icelandic and Faroese. ### Supported Tasks and Leaderboards This dataset is intended for general language modelling. ### Languages The dataset is available in Danish (`da`), Swedish (`sv`), Norwegian Bokmål (`nb`), Norwegian Nynorsk (`nn`), Icelandic (`is`) and Faroese (`fo`). ## Dataset Structure ### Data Instances - **Total amount of disk used:** 4485.90 MB An example from the `train` split of the `fo` subset looks as follows. ``` { 'id': '3380', 'url': 'https://fo.wikipedia.org/wiki/Enk%C3%B6pings%20kommuna', 'title': 'Enköpings kommuna', 'text': 'Enköpings kommuna (svenskt: Enköpings kommun), er ein kommuna í Uppsala län í Svøríki. Enköpings kommuna hevur umleið 40.656 íbúgvar (2013).\n\nKeldur \n\nKommunur í Svøríki' } ``` ### Data Fields The data fields are the same among all splits. - `id`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `text`: a `string` feature. ### Data Subsets | name | samples | |----------|----------:| | sv | 2,469,978 | | nb | 596,593 | | da | 287,216 | | nn | 162,776 | | is | 55,418 | | fo | 12,582 | ## Dataset Creation ### Curation Rationale It takes quite a long time to parse the Wikipedia dump as well as to deduplicate it, so this dataset is primarily for convenience. ### Source Data The original data is from the [wikipedia dataset](https://huggingface.co/datasets/wikipedia). ## Additional Information ### Dataset Curators [Dan Saattrup Nielsen](https://saattrupdan.github.io/) from the [The Alexandra Institute](https://alexandra.dk/) curated this dataset. ### Licensing Information The dataset is licensed under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/), in accordance with the same license of the [wikipedia dataset](https://huggingface.co/datasets/wikipedia).

提供机构：

alexandrainst

原始信息汇总

ScandiWiki数据集概述

数据集描述

数据集名称： ScandiWiki
语言： 丹麦语 (da), 瑞典语 (sv), 挪威语（博克马尔语 nb，新挪威语 nn）, 冰岛语 (is), 法罗语 (fo)
许可证： CC BY-SA 4.0
多语言性： 多语言
大小分类： 1M<n<10M
源数据集： Wikipedia
任务类别： 填充掩码, 文本生成, 特征提取
任务ID： 语言建模

数据集摘要

ScandiWiki是一个解析和去重后的Wikipedia数据集，包含丹麦语、挪威语（博克马尔语和新挪威语）、瑞典语、冰岛语和法罗语。

支持的任务和排行榜

该数据集适用于通用语言建模。

数据集结构

数据实例

总磁盘使用量： 4485.90 MB

数据字段

id: 字符串类型
url: 字符串类型
title: 字符串类型
text: 字符串类型

数据子集

名称	样本数
sv	2,469,978
nb	596,593
da	287,216
nn	162,776
is	55,418
fo	12,582

数据集创建

精选理由

此数据集主要为了方便，因为解析Wikipedia转储并去重需要较长时间。

源数据

原始数据来自Wikipedia数据集。

许可证信息

数据集遵循CC BY-SA 4.0许可证，与Wikipedia数据集的许可证一致。

5,000+

优质数据集

54 个

任务类型

进入经典数据集