alperngll/Cosmos-Turkish-Corpus-v1.0
收藏Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alperngll/Cosmos-Turkish-Corpus-v1.0
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 56947576117
num_examples: 9075453
download_size: 22825493949
dataset_size: 56947576117
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-4.0
language:
- tr
pretty_name: c
---
This is the Turkish pretraining corpus of the Cosmos AI Research Group.
It contains ~15B tokens and demonstrates competitive performance across various Turkish benchmarks when used in continual pretraining setups.
Cosmos-Turkish-Corpus is collected from a wide range of Turkish websites, including forums, news sources, Wikipedia, and more.
URL-based deduplication has been applied; however, additional content-level deduplication and filtering may be required before use.
提供机构:
alperngll



