lumees/turkish-corpus-100b
收藏Hugging Face2025-11-30 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/lumees/turkish-corpus-100b
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- tr
license: apache-2.0
task_categories:
- text-generation
- question-answering
- translation
pretty_name: Lumees Turkish Corpus 100B
size_categories:
- 100B<n<1T
tags:
- turkish
- llm
- foundation-model
- pretraining
- sft
- fineweb
- cosmos
configs:
- config_name: pretrain
data_files: "pretrain/*.parquet"
- config_name: sft
data_files: "sft/*.parquet"
---
# Lumees Turkish Corpus 100B (LTC-100B)
## Dataset Summary
The **Lumees Turkish Corpus 100B (LTC-100B)** is a massive-scale, deduplicated, and cleaned dataset designed for training Foundation Models in Turkish. Comprising approximately **105 Billion tokens** (measured with Qwen/Llama3 tokenizer), it represents one of the largest open resources for Turkish LLM pretraining.
The dataset is engineered for a two-stage training pipeline:
1. **Pretrain Subset (~103B Tokens):** A diverse mix of high-quality web data, synthetic reasoning, encyclopedic knowledge, and flattened instructions for continual pretraining.
2. **SFT Subset (~2.2B Tokens):** A massive collection of instruction-following, mathematical reasoning, and translation pairs for "Instruction Pretraining" or large-scale SFT.
### 🚀 Pilot Subset Available (10B)
For researchers and organizations running single-node experiments (e.g., 1x H100), we provide a **10 Billion Token Pilot Subset**. This subset uses weighted priority sampling (keeping 100% of Synthetic/Wiki data and downsampling the Web data) to ensure high density.
---
## Dataset Statistics
*Estimates based on Qwen/Llama-3 Tokenization.*
| Subset | Format | File Type | Token Count
| :--- | :--- | :--- | :---
| **Pretrain** | Universal Schema | Parquet (ZSTD) | **~103.26 Billion**
| **SFT** | ChatML | JSONL | **~2.27 Billion**
| **Total** | - | - | **~105.53 Billion**
---
## Data Structure
### 1. Pretraining Subset (`pretrain`)
Optimized for high-throughput streaming with libraries like `datatrove` or `nanotron`.
| Column | Type | Description |
| :--- | :--- | :--- |
| `id` | `string` | Unique UUIDv4 (Vital for deduplication tracking). |
| `text` | `string` | The cleaned, deduplicated content. |
| `source` | `string` | Origin dataset (e.g., `fineweb-2`, `cosmos`). |
| `language` | `string` | ISO Code (`tr`). |
| `meta` | `string` | Original metadata (URL, date, title) serialized as JSON string. |
### 2. SFT Subset (`sft`)
Optimized for "Instruction Pretraining" or Fine-Tuning.
| Column | Type | Description |
| :--- | :--- | :--- |
| `messages` | `list` | Standard OpenAI format: `[{"role": "user", ...}, {"role": "assistant", ...}]` |
| `source` | `string` | Origin task (e.g., `instruc_turca`, `open_math`). |
---
## Data Composition
This corpus was built using a **Weighted Priority** strategy, blending massive web scale with high-density reasoning data.
| Source | Type | Usage Phase | Description |
| :--- | :--- | :--- | :--- |
| **[FineWeb-2 (Turkish)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)** | Web Crawl | Pretrain | The backbone of the corpus (cleaned web text). |
| **[Cosmos Synthetic](https://huggingface.co/datasets/Berkesule/COSMOS-Sentetic-Turkish-Corpus-2GB)** | Synthetic | Pretrain | Textbook-quality reasoning and explanations. |
| **[FineWiki TR](https://huggingface.co/datasets/HuggingFaceFW/finewiki)** | Knowledge | Pretrain | Full Turkish Wikipedia dump. |
| **[Turkish News](https://huggingface.co/datasets/habanoz/news-tr-1.8M)** | Formal Text | Pretrain | High-quality, grammatically correct news articles. |
| **[Instruc Turca (90%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | Instructions | Pretrain | Flattened instruction pairs (User/Assistant) treated as raw text. |
| **[Instruc Turca (10%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | Chat | SFT | High-quality conversational data. |
| **[Open Math TR](https://huggingface.co/datasets/oztrkoguz/Open_Math_Instruct_Turkish)** | Reasoning | SFT | Step-by-step mathematical problem solving. |
| **[XP3X](https://huggingface.co/datasets/CohereLabs/xP3x)** | NLP Tasks | SFT | Multilingual generalization tasks. |
| **[En-Tr Translation](https://huggingface.co/datasets/cturan/high-quality-english-turkish-sentences)** | Translation | SFT | Parallel translation pairs. |
-----
## Processing Pipeline
This dataset was engineered for **Foundation Model** training standards:
1. **Normalization:** All 60+ raw data sources were mapped to a single `id, text, source, meta` schema.
2. **Disk-Based Deduplication:** Exact deduplication (MD5) was performed across the entire \~100M document collection to reduce training flop waste.
3. **PII Sanitization:**
* **Regex Cleaning:** Automated removal of Email addresses, IP addresses, and Turkish phone numbers (+90...).
* *Note:* Synthetic sources (Cosmos) and FineWeb were excluded from aggressive regexing to preserve token distribution.
4. **Sharding:** Data is split into `2.0 GB` Parquet shards for optimal GPU cluster streaming.
## Limitations
* **Web Bias:** A significant portion of the data (FineWeb) comes from the open internet and may reflect societal biases.
* **Synthetic Nature:** The `Cosmos` subset is synthetic; while high quality, it may contain hallucinated reasoning patterns common to LLM outputs.
-----
## Citation & Attribution
If you use this dataset in your research or product, please cite:
```bibtex
@misc{lumees2025turkish100b,
author = {Hasan KURŞUN, Kerem Berkay YANIK},
title = {Lumees Turkish Corpus 100B},
year = {2025},
publisher = {Lumees AI},
howpublished = {\url{[https://lumees.io](https://lumees.io)}},
email = {hello@lumees.io}
}
```
---
language:
- 土耳其语(tr)
license: Apache 2.0许可证
task_categories:
- 文本生成
- 问答
- 机器翻译
pretty_name: Lumees土耳其语语料库100B
size_categories:
- 100B < 总令牌数 < 1T
tags:
- 土耳其语
- 大语言模型(LLM)
- 基础模型(Foundation Model)
- 预训练
- 监督微调(Supervised Fine-Tuning,SFT)
- FineWeb
- Cosmos
configs:
- config_name: pretrain
data_files: "pretrain/*.parquet"
- config_name: sft
data_files: "sft/*.parquet"
---
# Lumees土耳其语语料库100B(LTC-100B)
## 数据集概述
**Lumees土耳其语语料库100B(LTC-100B)** 是一款大规模、去重且经过清洗的数据集,专为土耳其语基础模型(Foundation Model)训练设计。该语料库包含约**1050亿令牌**(采用Qwen/Llama3分词器统计),是目前规模最大的开源土耳其语大语言模型(LLM)预训练资源之一。
本数据集针对两阶段训练流程优化设计:
1. **预训练子集(约1030亿令牌)**:融合高质量网页数据、合成推理数据、百科知识与扁平化指令数据,用于持续预训练。
2. **监督微调(SFT)子集(约22亿令牌)**:包含海量指令遵循、数学推理与翻译配对数据,适用于“指令预训练”或大规模监督微调。
### 🚀 100亿令牌试点子集开放
针对开展单节点实验(如1张H100显卡)的研究人员与机构,我们提供**100亿令牌试点子集**。该子集采用加权优先采样策略:保留100%的合成/百科数据,对网页数据进行下采样,以确保数据的高信息密度。
---
## 数据集统计
*统计基于Qwen/Llama-3分词结果。*
| 子集名称 | 数据格式 | 文件类型 | 令牌数量 |
| :--- | :--- | :--- | :--- |
| **预训练子集** | 通用Schema | Parquet(ZSTD压缩) | **约1032.6亿** |
| **SFT子集** | ChatML格式 | JSONL | **约22.7亿** |
| **总计** | - | - | **约1055.3亿** |
---
## 数据结构
### 1. 预训练子集(`pretrain`)
针对`datatrove`或`nanotron`等流式加载库优化,支持高吞吐量数据读取。
| 字段名 | 数据类型 | 字段说明 |
| :--- | :--- | :--- |
| `id` | `string` | 唯一UUIDv4标识符,用于追踪去重过程。 |
| `text` | `string` | 经清洗、去重后的文本内容。 |
| `source` | `string` | 数据来源数据集(如`fineweb-2`、`cosmos`)。 |
| `language` | `string` | ISO语言代码(`tr`,即土耳其语)。 |
| `meta` | `string` | 原始元数据(URL、日期、标题),以JSON字符串序列化存储。 |
### 2. SFT子集(`sft`)
专为“指令预训练”或微调优化设计。
| 字段名 | 数据类型 | 字段说明 |
| :--- | :--- | :--- |
| `messages` | `list` | OpenAI标准格式:`[{"role": "user", ...}, {"role": "assistant", ...}]` |
| `source` | `string` | 任务来源(如`instruc_turca`、`open_math`)。 |
---
## 数据构成
本语料库采用**加权优先**策略构建,兼顾大规模网页数据与高信息密度的推理类数据。
| 数据来源 | 数据类型 | 应用阶段 | 数据说明 |
| :--- | :--- | :--- | :--- |
| **[FineWeb-2(土耳其语版)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)** | 网页爬取数据 | 预训练 | 本语料库的核心支撑数据,为清洗后的网页文本。 |
| **[Cosmos Synthetic](https://huggingface.co/datasets/Berkesule/COSMOS-Sentetic-Turkish-Corpus-2GB)** | 合成数据 | 预训练 | 具备教科书级质量的推理与解释文本。 |
| **[FineWiki TR](https://huggingface.co/datasets/HuggingFaceFW/finewiki)** | 百科知识数据 | 预训练 | 完整的土耳其语维基百科存档。 |
| **[Turkish News](https://huggingface.co/datasets/habanoz/news-tr-1.8M)** | 正式文本 | 预训练 | 高质量、语法规范的新闻文章。 |
| **[Instruc Turca(90%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | 指令数据 | 预训练 | 扁平化的用户/助手指令配对数据,作为原始文本使用。 |
| **[Instruc Turca(10%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | 对话数据 | SFT | 高质量的对话类数据。 |
| **[Open Math TR](https://huggingface.co/datasets/oztrkoguz/Open_Math_Instruct_Turkish)** | 推理数据 | SFT | 分步式数学问题求解数据。 |
| **[XP3X](https://huggingface.co/datasets/CohereLabs/xP3x)** | NLP任务数据 | SFT | 多语言泛化任务数据集。 |
| **[En-Tr 翻译数据集](https://huggingface.co/datasets/cturan/high-quality-english-turkish-sentences)** | 翻译数据 | SFT | 高质量的英-土语平行翻译配对数据。 |
-----
## 处理流程
本数据集严格遵循**基础模型**训练标准构建:
1. **标准化映射**:将60余个原始数据源统一映射为`id, text, source, meta`标准Schema。
2. **基于磁盘的去重**:对总计约1亿份文档执行精确去重(MD5哈希),避免训练算力浪费。
3. **个人可识别信息(PII)脱敏**:
- **正则清洗**:自动移除电子邮箱、IP地址与土耳其手机号(格式为+90...)。
- *注*:合成数据源(Cosmos)与FineWeb未执行激进正则清洗,以保留原始令牌分布。
4. **分片存储**:数据被拆分为每份2.0GB的Parquet分片,适配GPU集群的流式读取需求。
## 局限性
* **网页数据偏差**:语料库中占比较大的FineWeb数据来自开放互联网,可能反映社会固有偏见。
* **合成数据特性**:`Cosmos`子集为合成生成数据,尽管质量较高,但可能存在大语言模型输出中常见的幻觉推理模式。
-----
## 引用与署名
若您在研究或产品中使用本数据集,请引用以下文献:
bibtex
@misc{lumees2025turkish100b,
author = {Hasan KURŞUN, Kerem Berkay YANIK},
title = {Lumees土耳其语语料库100B},
year = {2025},
publisher = {Lumees AI},
howpublished = {url{https://lumees.io}},
email = {hello@lumees.io}
}
提供机构:
lumees



