lumees/turkish-corpus-100b

Name: lumees/turkish-corpus-100b
Creator: lumees
Published: 2025-11-30 03:10:19
License: 暂无描述

Hugging Face2025-11-30 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/lumees/turkish-corpus-100b

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - tr license: apache-2.0 task_categories: - text-generation - question-answering - translation pretty_name: Lumees Turkish Corpus 100B size_categories: - 100B<n<1T tags: - turkish - llm - foundation-model - pretraining - sft - fineweb - cosmos configs: - config_name: pretrain data_files: "pretrain/*.parquet" - config_name: sft data_files: "sft/*.parquet" --- # Lumees Turkish Corpus 100B (LTC-100B) ## Dataset Summary The **Lumees Turkish Corpus 100B (LTC-100B)** is a massive-scale, deduplicated, and cleaned dataset designed for training Foundation Models in Turkish. Comprising approximately **105 Billion tokens** (measured with Qwen/Llama3 tokenizer), it represents one of the largest open resources for Turkish LLM pretraining. The dataset is engineered for a two-stage training pipeline: 1. **Pretrain Subset (~103B Tokens):** A diverse mix of high-quality web data, synthetic reasoning, encyclopedic knowledge, and flattened instructions for continual pretraining. 2. **SFT Subset (~2.2B Tokens):** A massive collection of instruction-following, mathematical reasoning, and translation pairs for "Instruction Pretraining" or large-scale SFT. ### 🚀 Pilot Subset Available (10B) For researchers and organizations running single-node experiments (e.g., 1x H100), we provide a **10 Billion Token Pilot Subset**. This subset uses weighted priority sampling (keeping 100% of Synthetic/Wiki data and downsampling the Web data) to ensure high density. --- ## Dataset Statistics *Estimates based on Qwen/Llama-3 Tokenization.* | Subset | Format | File Type | Token Count | :--- | :--- | :--- | :--- | **Pretrain** | Universal Schema | Parquet (ZSTD) | **~103.26 Billion** | **SFT** | ChatML | JSONL | **~2.27 Billion** | **Total** | - | - | **~105.53 Billion** --- ## Data Structure ### 1. Pretraining Subset (`pretrain`) Optimized for high-throughput streaming with libraries like `datatrove` or `nanotron`. | Column | Type | Description | | :--- | :--- | :--- | | `id` | `string` | Unique UUIDv4 (Vital for deduplication tracking). | | `text` | `string` | The cleaned, deduplicated content. | | `source` | `string` | Origin dataset (e.g., `fineweb-2`, `cosmos`). | | `language` | `string` | ISO Code (`tr`). | | `meta` | `string` | Original metadata (URL, date, title) serialized as JSON string. | ### 2. SFT Subset (`sft`) Optimized for "Instruction Pretraining" or Fine-Tuning. | Column | Type | Description | | :--- | :--- | :--- | | `messages` | `list` | Standard OpenAI format: `[{"role": "user", ...}, {"role": "assistant", ...}]` | | `source` | `string` | Origin task (e.g., `instruc_turca`, `open_math`). | --- ## Data Composition This corpus was built using a **Weighted Priority** strategy, blending massive web scale with high-density reasoning data. | Source | Type | Usage Phase | Description | | :--- | :--- | :--- | :--- | | **[FineWeb-2 (Turkish)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)** | Web Crawl | Pretrain | The backbone of the corpus (cleaned web text). | | **[Cosmos Synthetic](https://huggingface.co/datasets/Berkesule/COSMOS-Sentetic-Turkish-Corpus-2GB)** | Synthetic | Pretrain | Textbook-quality reasoning and explanations. | | **[FineWiki TR](https://huggingface.co/datasets/HuggingFaceFW/finewiki)** | Knowledge | Pretrain | Full Turkish Wikipedia dump. | | **[Turkish News](https://huggingface.co/datasets/habanoz/news-tr-1.8M)** | Formal Text | Pretrain | High-quality, grammatically correct news articles. | | **[Instruc Turca (90%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | Instructions | Pretrain | Flattened instruction pairs (User/Assistant) treated as raw text. | | **[Instruc Turca (10%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | Chat | SFT | High-quality conversational data. | | **[Open Math TR](https://huggingface.co/datasets/oztrkoguz/Open_Math_Instruct_Turkish)** | Reasoning | SFT | Step-by-step mathematical problem solving. | | **[XP3X](https://huggingface.co/datasets/CohereLabs/xP3x)** | NLP Tasks | SFT | Multilingual generalization tasks. | | **[En-Tr Translation](https://huggingface.co/datasets/cturan/high-quality-english-turkish-sentences)** | Translation | SFT | Parallel translation pairs. | ----- ## Processing Pipeline This dataset was engineered for **Foundation Model** training standards: 1. **Normalization:** All 60+ raw data sources were mapped to a single `id, text, source, meta` schema. 2. **Disk-Based Deduplication:** Exact deduplication (MD5) was performed across the entire \~100M document collection to reduce training flop waste. 3. **PII Sanitization:** * **Regex Cleaning:** Automated removal of Email addresses, IP addresses, and Turkish phone numbers (+90...). * *Note:* Synthetic sources (Cosmos) and FineWeb were excluded from aggressive regexing to preserve token distribution. 4. **Sharding:** Data is split into `2.0 GB` Parquet shards for optimal GPU cluster streaming. ## Limitations * **Web Bias:** A significant portion of the data (FineWeb) comes from the open internet and may reflect societal biases. * **Synthetic Nature:** The `Cosmos` subset is synthetic; while high quality, it may contain hallucinated reasoning patterns common to LLM outputs. ----- ## Citation & Attribution If you use this dataset in your research or product, please cite: ```bibtex @misc{lumees2025turkish100b, author = {Hasan KURŞUN, Kerem Berkay YANIK}, title = {Lumees Turkish Corpus 100B}, year = {2025}, publisher = {Lumees AI}, howpublished = {\url{[https://lumees.io](https://lumees.io)}}, email = {hello@lumees.io} } ```

--- language: - 土耳其语（tr） license: Apache 2.0许可证 task_categories: - 文本生成 - 问答 - 机器翻译 pretty_name: Lumees土耳其语语料库100B size_categories: - 100B < 总令牌数 < 1T tags: - 土耳其语 - 大语言模型（LLM） - 基础模型（Foundation Model） - 预训练 - 监督微调（Supervised Fine-Tuning，SFT） - FineWeb - Cosmos configs: - config_name: pretrain data_files: "pretrain/*.parquet" - config_name: sft data_files: "sft/*.parquet" --- # Lumees土耳其语语料库100B（LTC-100B） ## 数据集概述 **Lumees土耳其语语料库100B（LTC-100B）** 是一款大规模、去重且经过清洗的数据集，专为土耳其语基础模型（Foundation Model）训练设计。该语料库包含约**1050亿令牌**（采用Qwen/Llama3分词器统计），是目前规模最大的开源土耳其语大语言模型（LLM）预训练资源之一。本数据集针对两阶段训练流程优化设计： 1. **预训练子集（约1030亿令牌）**：融合高质量网页数据、合成推理数据、百科知识与扁平化指令数据，用于持续预训练。 2. **监督微调（SFT）子集（约22亿令牌）**：包含海量指令遵循、数学推理与翻译配对数据，适用于“指令预训练”或大规模监督微调。 ### 🚀 100亿令牌试点子集开放针对开展单节点实验（如1张H100显卡）的研究人员与机构，我们提供**100亿令牌试点子集**。该子集采用加权优先采样策略：保留100%的合成/百科数据，对网页数据进行下采样，以确保数据的高信息密度。 --- ## 数据集统计 *统计基于Qwen/Llama-3分词结果。* | 子集名称 | 数据格式 | 文件类型 | 令牌数量 | | :--- | :--- | :--- | :--- | | **预训练子集** | 通用Schema | Parquet（ZSTD压缩） | **约1032.6亿** | | **SFT子集** | ChatML格式 | JSONL | **约22.7亿** | | **总计** | - | - | **约1055.3亿** | --- ## 数据结构 ### 1. 预训练子集（`pretrain`）针对`datatrove`或`nanotron`等流式加载库优化，支持高吞吐量数据读取。 | 字段名 | 数据类型 | 字段说明 | | :--- | :--- | :--- | | `id` | `string` | 唯一UUIDv4标识符，用于追踪去重过程。 | | `text` | `string` | 经清洗、去重后的文本内容。 | | `source` | `string` | 数据来源数据集（如`fineweb-2`、`cosmos`）。 | | `language` | `string` | ISO语言代码（`tr`，即土耳其语）。 | | `meta` | `string` | 原始元数据（URL、日期、标题），以JSON字符串序列化存储。 | ### 2. SFT子集（`sft`）专为“指令预训练”或微调优化设计。 | 字段名 | 数据类型 | 字段说明 | | :--- | :--- | :--- | | `messages` | `list` | OpenAI标准格式：`[{"role": "user", ...}, {"role": "assistant", ...}]` | | `source` | `string` | 任务来源（如`instruc_turca`、`open_math`）。 | --- ## 数据构成本语料库采用**加权优先**策略构建，兼顾大规模网页数据与高信息密度的推理类数据。 | 数据来源 | 数据类型 | 应用阶段 | 数据说明 | | :--- | :--- | :--- | :--- | | **[FineWeb-2（土耳其语版）](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)** | 网页爬取数据 | 预训练 | 本语料库的核心支撑数据，为清洗后的网页文本。 | | **[Cosmos Synthetic](https://huggingface.co/datasets/Berkesule/COSMOS-Sentetic-Turkish-Corpus-2GB)** | 合成数据 | 预训练 | 具备教科书级质量的推理与解释文本。 | | **[FineWiki TR](https://huggingface.co/datasets/HuggingFaceFW/finewiki)** | 百科知识数据 | 预训练 | 完整的土耳其语维基百科存档。 | | **[Turkish News](https://huggingface.co/datasets/habanoz/news-tr-1.8M)** | 正式文本 | 预训练 | 高质量、语法规范的新闻文章。 | | **[Instruc Turca（90%）](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | 指令数据 | 预训练 | 扁平化的用户/助手指令配对数据，作为原始文本使用。 | | **[Instruc Turca（10%）](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | 对话数据 | SFT | 高质量的对话类数据。 | | **[Open Math TR](https://huggingface.co/datasets/oztrkoguz/Open_Math_Instruct_Turkish)** | 推理数据 | SFT | 分步式数学问题求解数据。 | | **[XP3X](https://huggingface.co/datasets/CohereLabs/xP3x)** | NLP任务数据 | SFT | 多语言泛化任务数据集。 | | **[En-Tr 翻译数据集](https://huggingface.co/datasets/cturan/high-quality-english-turkish-sentences)** | 翻译数据 | SFT | 高质量的英-土语平行翻译配对数据。 | ----- ## 处理流程本数据集严格遵循**基础模型**训练标准构建： 1. **标准化映射**：将60余个原始数据源统一映射为`id, text, source, meta`标准Schema。 2. **基于磁盘的去重**：对总计约1亿份文档执行精确去重（MD5哈希），避免训练算力浪费。 3. **个人可识别信息（PII）脱敏**： - **正则清洗**：自动移除电子邮箱、IP地址与土耳其手机号（格式为+90...）。 - *注*：合成数据源（Cosmos）与FineWeb未执行激进正则清洗，以保留原始令牌分布。 4. **分片存储**：数据被拆分为每份2.0GB的Parquet分片，适配GPU集群的流式读取需求。 ## 局限性 * **网页数据偏差**：语料库中占比较大的FineWeb数据来自开放互联网，可能反映社会固有偏见。 * **合成数据特性**：`Cosmos`子集为合成生成数据，尽管质量较高，但可能存在大语言模型输出中常见的幻觉推理模式。 ----- ## 引用与署名若您在研究或产品中使用本数据集，请引用以下文献： bibtex @misc{lumees2025turkish100b, author = {Hasan KURŞUN, Kerem Berkay YANIK}, title = {Lumees土耳其语语料库100B}, year = {2025}, publisher = {Lumees AI}, howpublished = {url{https://lumees.io}}, email = {hello@lumees.io} }

提供机构：

lumees

5,000+

优质数据集

54 个

任务类型

进入经典数据集