Ba2han/merged-datasets_11-12

Name: Ba2han/merged-datasets_11-12
Creator: Ba2han
Published: 2025-12-12 01:28:18
License: 暂无描述

Hugging Face2025-12-12 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Ba2han/merged-datasets_11-12

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是多个来源的文本数据的合并，主要包含土耳其语、英语、阿塞拜疆语和土库曼语的文本。数据经过长度和质量过滤，包括删除字符数少于300或250的例子，以及超过5800字符的例子。对某些特定数据集进行了特殊处理，如BILGE-Wiki的长文章按句子分割，Ultra-FineWeb根据分数过滤并限制在250万条样本。数据集统计信息详细列出了每个子数据集的样本数量、平均长度、中位数长度、最大长度和最小长度。

This dataset is a compilation of various sources filtered for length and quality, primarily containing texts in Turkish, English, Azerbaijani, and Turkmen. The data processing includes dropping examples with fewer than 300 or 250 characters and those exceeding 5800 characters. Special handling was applied to certain datasets, such as splitting long articles in BILGE-Wiki by sentences and filtering Ultra-FineWeb by score and capping at 2.5 million samples. Detailed statistics are provided for each sub-dataset, including sample count, mean length, median length, maximum length, and minimum length.

提供机构：

Ba2han

5,000+

优质数据集

54 个

任务类型

进入经典数据集