ytu-ce-cosmos/Cosmos-Turkish-Corpus-v1.0

Name: ytu-ce-cosmos/Cosmos-Turkish-Corpus-v1.0
Creator: ytu-ce-cosmos
Published: 2025-12-02 15:05:34
License: 暂无描述

Hugging Face2025-12-02 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/ytu-ce-cosmos/Cosmos-Turkish-Corpus-v1.0

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 56947576117 num_examples: 9075453 download_size: 22825493949 dataset_size: 56947576117 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-4.0 language: - tr pretty_name: c --- This is the Turkish pretraining corpus of the Cosmos AI Research Group. It contains ~15B tokens and demonstrates competitive performance across various Turkish benchmarks when used in continual pretraining setups. Cosmos-Turkish-Corpus is collected from a wide range of Turkish websites, including forums, news sources, Wikipedia, and more. URL-based deduplication has been applied; however, additional content-level deduplication and filtering may be required before use.

数据集信息：特征： - 名称：统一资源定位符（URL），数据类型：字符串 - 名称：文本，数据类型：字符串数据集划分： - 名称：训练集，字节数：56947576117，样本数量：9075453 下载体积：22825493949 数据集总体积：56947576117 配置项： - 配置名称：默认，数据文件： - 划分：训练集，存储路径：data/train-* 开源许可证：知识共享署名4.0（CC BY 4.0）支持语言： - 土耳其语（tr）展示名称：c 本数据集为宇宙人工智能研究小组（Cosmos AI Research Group）构建的土耳其语预训练语料库。该语料库包含约150亿个Token（Token），在持续预训练场景下应用时，可在多项土耳其语基准测试中展现出具有竞争力的性能。 Cosmos-Turkish-Corpus（宇宙土耳其语语料库）的采集范围涵盖各类土耳其语站点，包括论坛、新闻源、维基百科等多种渠道。已完成基于统一资源定位符（URL）的去重处理，但在实际使用前，可能仍需执行额外的内容级去重与过滤操作。

提供机构：

ytu-ce-cosmos

5,000+

优质数据集

54 个

任务类型

进入经典数据集