alexandrainst/wiki40b-da

Name: alexandrainst/wiki40b-da
Creator: alexandrainst
Published: 2023-10-27 19:08:09
License: 暂无描述

Hugging Face2023-10-27 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/alexandrainst/wiki40b-da

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: wikidata_id dtype: string - name: text dtype: string - name: version_id dtype: string splits: - name: train num_bytes: 220855898 num_examples: 109486 - name: validation num_bytes: 12416304 num_examples: 6173 - name: test num_bytes: 12818380 num_examples: 6219 download_size: 150569852 dataset_size: 246090582 license: cc-by-sa-4.0 task_categories: - text-generation language: - da pretty_name: Wiki40b-da size_categories: - 100K<n<1M --- # Dataset Card for "wiki40b-da" ## Dataset Description - **Point of Contact:** [Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk) - **Size of downloaded dataset files:** 150.57 MB - **Size of the generated dataset:** 246.09 MB - **Total amount of disk used:** 396.66 MB ### Dataset Summary This dataset is an upload of the Danish part of the [Wiki40b dataset](https://aclanthology.org/2020.lrec-1.297), being a cleaned version of a dump of Wikipedia. The dataset is identical in content to [this dataset on the Hugging Face Hub](https://huggingface.co/datasets/wiki40b), but that one requires both `apache_beam`, `tensorflow` and `mwparserfromhell`, which can lead to dependency issues since these are not compatible with several newer packages. The training, validation and test splits are the original ones. ### Languages The dataset is available in Danish (`da`). ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 150.57 MB - **Size of the generated dataset:** 246.09 MB - **Total amount of disk used:** 396.66 MB An example from the dataset looks as follows. ``` { 'wikidata_id': 'Q17341862', 'text': "\n_START_ARTICLE_\nÆgyptiske tekstiler\n_START_PARAGRAPH_\nTekstiler havde mange (...)", 'version_id': '9018011197452276273' } ``` ### Data Fields The data fields are the same among all splits. - `wikidata_id`: a `string` feature. - `text`: a `string` feature. - `version_id`: a `string` feature. ### Dataset Statistics There are 109,486 samples in the training split, 6,173 samples in the validation split and 6,219 in the test split. #### Document Length Distribution ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60d368a613f774189902f555/dn-7_ugJObyF-CkD6XoO-.png) ## Additional Information ### Dataset Curators [Dan Saattrup Nielsen](https://saattrupdan.github.io/) from the [The Alexandra Institute](https://alexandra.dk/) uploaded it to the Hugging Face Hub. ### Licensing Information The dataset is licensed under the [CC-BY-SA license](https://creativecommons.org/licenses/by-sa/4.0/).

提供机构：

alexandrainst

原始信息汇总

数据集卡片 "wiki40b-da"

数据集描述

数据集概要

该数据集是Wiki40b数据集的丹麦部分，是维基百科转储的清理版本。

训练、验证和测试拆分是原始的。

语言

该数据集提供丹麦语（da）版本。

数据集结构

数据实例

一个数据集示例如下： json { "wikidata_id": "Q17341862", "text": " START_ARTICLE Ægyptiske tekstiler START_PARAGRAPH Tekstiler havde mange (...)", "version_id": "9018011197452276273" }

数据字段

所有拆分中的数据字段相同：

wikidata_id: 字符串类型特征。
text: 字符串类型特征。
version_id: 字符串类型特征。

数据集统计

训练拆分有109,486个样本，验证拆分有6,173个样本，测试拆分有6,219个样本。

附加信息

数据集许可

该数据集根据CC-BY-SA许可进行许可。

5,000+

优质数据集

54 个

任务类型

进入经典数据集