alexandrainst/scandi-wiki
收藏Hugging Face2023-01-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/alexandrainst/scandi-wiki
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: ScandiWiki
language:
- da
- sv
- no
- nb
- nn
- is
- fo
license:
- cc-by-sa-4.0
multilinguality:
- multilingual
size_categories:
- 1M<n<10M
source_datasets:
- wikipedia
task_categories:
- fill-mask
- text-generation
- feature-extraction
task_ids:
- language-modeling
---
# Dataset Card for ScandiWiki
## Dataset Description
- **Point of Contact:** [Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk)
- **Total amount of disk used:** 4485.90 MB
### Dataset Summary
ScandiWiki is a parsed and deduplicated Wikipedia dump in Danish, Norwegian Bokmål,
Norwegian Nynorsk, Swedish, Icelandic and Faroese.
### Supported Tasks and Leaderboards
This dataset is intended for general language modelling.
### Languages
The dataset is available in Danish (`da`), Swedish (`sv`), Norwegian Bokmål (`nb`),
Norwegian Nynorsk (`nn`), Icelandic (`is`) and Faroese (`fo`).
## Dataset Structure
### Data Instances
- **Total amount of disk used:** 4485.90 MB
An example from the `train` split of the `fo` subset looks as follows.
```
{
'id': '3380',
'url': 'https://fo.wikipedia.org/wiki/Enk%C3%B6pings%20kommuna',
'title': 'Enköpings kommuna',
'text': 'Enköpings kommuna (svenskt: Enköpings kommun), er ein kommuna í Uppsala län í Svøríki. Enköpings kommuna hevur umleið 40.656 íbúgvar (2013).\n\nKeldur \n\nKommunur í Svøríki'
}
```
### Data Fields
The data fields are the same among all splits.
- `id`: a `string` feature.
- `url`: a `string` feature.
- `title`: a `string` feature.
- `text`: a `string` feature.
### Data Subsets
| name | samples |
|----------|----------:|
| sv | 2,469,978 |
| nb | 596,593 |
| da | 287,216 |
| nn | 162,776 |
| is | 55,418 |
| fo | 12,582 |
## Dataset Creation
### Curation Rationale
It takes quite a long time to parse the Wikipedia dump as well as to deduplicate it, so
this dataset is primarily for convenience.
### Source Data
The original data is from the [wikipedia
dataset](https://huggingface.co/datasets/wikipedia).
## Additional Information
### Dataset Curators
[Dan Saattrup Nielsen](https://saattrupdan.github.io/) from the [The Alexandra
Institute](https://alexandra.dk/) curated this dataset.
### Licensing Information
The dataset is licensed under the [CC BY-SA 4.0
license](https://creativecommons.org/licenses/by-sa/4.0/), in accordance with the same
license of the [wikipedia dataset](https://huggingface.co/datasets/wikipedia).
提供机构:
alexandrainst
原始信息汇总
ScandiWiki数据集概述
数据集描述
- 数据集名称: ScandiWiki
- 语言: 丹麦语 (
da), 瑞典语 (sv), 挪威语(博克马尔语nb,新挪威语nn), 冰岛语 (is), 法罗语 (fo) - 许可证: CC BY-SA 4.0
- 多语言性: 多语言
- 大小分类: 1M<n<10M
- 源数据集: Wikipedia
- 任务类别: 填充掩码, 文本生成, 特征提取
- 任务ID: 语言建模
数据集摘要
ScandiWiki是一个解析和去重后的Wikipedia数据集,包含丹麦语、挪威语(博克马尔语和新挪威语)、瑞典语、冰岛语和法罗语。
支持的任务和排行榜
该数据集适用于通用语言建模。
数据集结构
数据实例
- 总磁盘使用量: 4485.90 MB
数据字段
id: 字符串类型url: 字符串类型title: 字符串类型text: 字符串类型
数据子集
| 名称 | 样本数 |
|---|---|
| sv | 2,469,978 |
| nb | 596,593 |
| da | 287,216 |
| nn | 162,776 |
| is | 55,418 |
| fo | 12,582 |
数据集创建
精选理由
此数据集主要为了方便,因为解析Wikipedia转储并去重需要较长时间。
源数据
原始数据来自Wikipedia数据集。
许可证信息
数据集遵循CC BY-SA 4.0许可证,与Wikipedia数据集的许可证一致。



