jealk/wiki40b-da-clean

Name: jealk/wiki40b-da-clean
Creator: jealk
Published: 2024-11-17 20:31:54
License: 暂无描述

Hugging Face2024-11-17 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/jealk/wiki40b-da-clean

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是[Wiki40b-da数据集](https://huggingface.co/datasets/alexandrainst/wiki40b-da/)的一个略微修改和过滤的版本，该版本是[Hugging Face Hub上的这个数据集](https://huggingface.co/datasets/wiki40b)的一个分支。数据集包含两个子集，原始列wikidata_id和version_id已从两者中删除：text包含过滤后的维基百科段落文本，格式已被移除（_START_ARTICLE_、_START_PARAGRAPH_和已移除）；sentences包含所有text数据集中的句子，过滤后仅包含5到100个单词的句子（在所有标点符号（!，?，.）后分割，后跟空格和大写字母）。数据集经过策划，使用text配置进行掩码下一个标记预测（MNTP），并使用sentences配置进行SimCSE，以训练编码器和解码器模型。训练、验证和测试分割是原始的分割。数据集的语言为丹麦语（`da`）。

This dataset is a slightly modified and filtered version of the Wiki40b-da dataset, containing two subsets: text and sentences. The text subset contains filtered Wikipedia paragraph text with formatting removed. The sentences subset contains sentences extracted from the text subset, filtered to include only sentences with lengths between 5 and 100 words. The dataset is suitable for masked next token prediction (MNTP) and SimCSE model training. The dataset is in Danish and includes train, validation, and test splits.

提供机构：

jealk

5,000+

优质数据集

54 个

任务类型

进入经典数据集