five

lumees/turkish-corpus-100b

收藏
Hugging Face2025-11-30 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/lumees/turkish-corpus-100b
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - tr license: apache-2.0 task_categories: - text-generation - question-answering - translation pretty_name: Lumees Turkish Corpus 100B size_categories: - 100B<n<1T tags: - turkish - llm - foundation-model - pretraining - sft - fineweb - cosmos configs: - config_name: pretrain data_files: "pretrain/*.parquet" - config_name: sft data_files: "sft/*.parquet" --- # Lumees Turkish Corpus 100B (LTC-100B) ## Dataset Summary The **Lumees Turkish Corpus 100B (LTC-100B)** is a massive-scale, deduplicated, and cleaned dataset designed for training Foundation Models in Turkish. Comprising approximately **105 Billion tokens** (measured with Qwen/Llama3 tokenizer), it represents one of the largest open resources for Turkish LLM pretraining. The dataset is engineered for a two-stage training pipeline: 1. **Pretrain Subset (~103B Tokens):** A diverse mix of high-quality web data, synthetic reasoning, encyclopedic knowledge, and flattened instructions for continual pretraining. 2. **SFT Subset (~2.2B Tokens):** A massive collection of instruction-following, mathematical reasoning, and translation pairs for "Instruction Pretraining" or large-scale SFT. ### 🚀 Pilot Subset Available (10B) For researchers and organizations running single-node experiments (e.g., 1x H100), we provide a **10 Billion Token Pilot Subset**. This subset uses weighted priority sampling (keeping 100% of Synthetic/Wiki data and downsampling the Web data) to ensure high density. --- ## Dataset Statistics *Estimates based on Qwen/Llama-3 Tokenization.* | Subset | Format | File Type | Token Count | :--- | :--- | :--- | :--- | **Pretrain** | Universal Schema | Parquet (ZSTD) | **~103.26 Billion** | **SFT** | ChatML | JSONL | **~2.27 Billion** | **Total** | - | - | **~105.53 Billion** --- ## Data Structure ### 1. Pretraining Subset (`pretrain`) Optimized for high-throughput streaming with libraries like `datatrove` or `nanotron`. | Column | Type | Description | | :--- | :--- | :--- | | `id` | `string` | Unique UUIDv4 (Vital for deduplication tracking). | | `text` | `string` | The cleaned, deduplicated content. | | `source` | `string` | Origin dataset (e.g., `fineweb-2`, `cosmos`). | | `language` | `string` | ISO Code (`tr`). | | `meta` | `string` | Original metadata (URL, date, title) serialized as JSON string. | ### 2. SFT Subset (`sft`) Optimized for "Instruction Pretraining" or Fine-Tuning. | Column | Type | Description | | :--- | :--- | :--- | | `messages` | `list` | Standard OpenAI format: `[{"role": "user", ...}, {"role": "assistant", ...}]` | | `source` | `string` | Origin task (e.g., `instruc_turca`, `open_math`). | --- ## Data Composition This corpus was built using a **Weighted Priority** strategy, blending massive web scale with high-density reasoning data. | Source | Type | Usage Phase | Description | | :--- | :--- | :--- | :--- | | **[FineWeb-2 (Turkish)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)** | Web Crawl | Pretrain | The backbone of the corpus (cleaned web text). | | **[Cosmos Synthetic](https://huggingface.co/datasets/Berkesule/COSMOS-Sentetic-Turkish-Corpus-2GB)** | Synthetic | Pretrain | Textbook-quality reasoning and explanations. | | **[FineWiki TR](https://huggingface.co/datasets/HuggingFaceFW/finewiki)** | Knowledge | Pretrain | Full Turkish Wikipedia dump. | | **[Turkish News](https://huggingface.co/datasets/habanoz/news-tr-1.8M)** | Formal Text | Pretrain | High-quality, grammatically correct news articles. | | **[Instruc Turca (90%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | Instructions | Pretrain | Flattened instruction pairs (User/Assistant) treated as raw text. | | **[Instruc Turca (10%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | Chat | SFT | High-quality conversational data. | | **[Open Math TR](https://huggingface.co/datasets/oztrkoguz/Open_Math_Instruct_Turkish)** | Reasoning | SFT | Step-by-step mathematical problem solving. | | **[XP3X](https://huggingface.co/datasets/CohereLabs/xP3x)** | NLP Tasks | SFT | Multilingual generalization tasks. | | **[En-Tr Translation](https://huggingface.co/datasets/cturan/high-quality-english-turkish-sentences)** | Translation | SFT | Parallel translation pairs. | ----- ## Processing Pipeline This dataset was engineered for **Foundation Model** training standards: 1. **Normalization:** All 60+ raw data sources were mapped to a single `id, text, source, meta` schema. 2. **Disk-Based Deduplication:** Exact deduplication (MD5) was performed across the entire \~100M document collection to reduce training flop waste. 3. **PII Sanitization:** * **Regex Cleaning:** Automated removal of Email addresses, IP addresses, and Turkish phone numbers (+90...). * *Note:* Synthetic sources (Cosmos) and FineWeb were excluded from aggressive regexing to preserve token distribution. 4. **Sharding:** Data is split into `2.0 GB` Parquet shards for optimal GPU cluster streaming. ## Limitations * **Web Bias:** A significant portion of the data (FineWeb) comes from the open internet and may reflect societal biases. * **Synthetic Nature:** The `Cosmos` subset is synthetic; while high quality, it may contain hallucinated reasoning patterns common to LLM outputs. ----- ## Citation & Attribution If you use this dataset in your research or product, please cite: ```bibtex @misc{lumees2025turkish100b, author = {Hasan KURŞUN, Kerem Berkay YANIK}, title = {Lumees Turkish Corpus 100B}, year = {2025}, publisher = {Lumees AI}, howpublished = {\url{[https://lumees.io](https://lumees.io)}}, email = {hello@lumees.io} } ```

--- language: - 土耳其语(tr) license: Apache 2.0许可证 task_categories: - 文本生成 - 问答 - 机器翻译 pretty_name: Lumees土耳其语语料库100B size_categories: - 100B < 总令牌数 < 1T tags: - 土耳其语 - 大语言模型(LLM) - 基础模型(Foundation Model) - 预训练 - 监督微调(Supervised Fine-Tuning,SFT) - FineWeb - Cosmos configs: - config_name: pretrain data_files: "pretrain/*.parquet" - config_name: sft data_files: "sft/*.parquet" --- # Lumees土耳其语语料库100B(LTC-100B) ## 数据集概述 **Lumees土耳其语语料库100B(LTC-100B)** 是一款大规模、去重且经过清洗的数据集,专为土耳其语基础模型(Foundation Model)训练设计。该语料库包含约**1050亿令牌**(采用Qwen/Llama3分词器统计),是目前规模最大的开源土耳其语大语言模型(LLM)预训练资源之一。 本数据集针对两阶段训练流程优化设计: 1. **预训练子集(约1030亿令牌)**:融合高质量网页数据、合成推理数据、百科知识与扁平化指令数据,用于持续预训练。 2. **监督微调(SFT)子集(约22亿令牌)**:包含海量指令遵循、数学推理与翻译配对数据,适用于“指令预训练”或大规模监督微调。 ### 🚀 100亿令牌试点子集开放 针对开展单节点实验(如1张H100显卡)的研究人员与机构,我们提供**100亿令牌试点子集**。该子集采用加权优先采样策略:保留100%的合成/百科数据,对网页数据进行下采样,以确保数据的高信息密度。 --- ## 数据集统计 *统计基于Qwen/Llama-3分词结果。* | 子集名称 | 数据格式 | 文件类型 | 令牌数量 | | :--- | :--- | :--- | :--- | | **预训练子集** | 通用Schema | Parquet(ZSTD压缩) | **约1032.6亿** | | **SFT子集** | ChatML格式 | JSONL | **约22.7亿** | | **总计** | - | - | **约1055.3亿** | --- ## 数据结构 ### 1. 预训练子集(`pretrain`) 针对`datatrove`或`nanotron`等流式加载库优化,支持高吞吐量数据读取。 | 字段名 | 数据类型 | 字段说明 | | :--- | :--- | :--- | | `id` | `string` | 唯一UUIDv4标识符,用于追踪去重过程。 | | `text` | `string` | 经清洗、去重后的文本内容。 | | `source` | `string` | 数据来源数据集(如`fineweb-2`、`cosmos`)。 | | `language` | `string` | ISO语言代码(`tr`,即土耳其语)。 | | `meta` | `string` | 原始元数据(URL、日期、标题),以JSON字符串序列化存储。 | ### 2. SFT子集(`sft`) 专为“指令预训练”或微调优化设计。 | 字段名 | 数据类型 | 字段说明 | | :--- | :--- | :--- | | `messages` | `list` | OpenAI标准格式:`[{"role": "user", ...}, {"role": "assistant", ...}]` | | `source` | `string` | 任务来源(如`instruc_turca`、`open_math`)。 | --- ## 数据构成 本语料库采用**加权优先**策略构建,兼顾大规模网页数据与高信息密度的推理类数据。 | 数据来源 | 数据类型 | 应用阶段 | 数据说明 | | :--- | :--- | :--- | :--- | | **[FineWeb-2(土耳其语版)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)** | 网页爬取数据 | 预训练 | 本语料库的核心支撑数据,为清洗后的网页文本。 | | **[Cosmos Synthetic](https://huggingface.co/datasets/Berkesule/COSMOS-Sentetic-Turkish-Corpus-2GB)** | 合成数据 | 预训练 | 具备教科书级质量的推理与解释文本。 | | **[FineWiki TR](https://huggingface.co/datasets/HuggingFaceFW/finewiki)** | 百科知识数据 | 预训练 | 完整的土耳其语维基百科存档。 | | **[Turkish News](https://huggingface.co/datasets/habanoz/news-tr-1.8M)** | 正式文本 | 预训练 | 高质量、语法规范的新闻文章。 | | **[Instruc Turca(90%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | 指令数据 | 预训练 | 扁平化的用户/助手指令配对数据,作为原始文本使用。 | | **[Instruc Turca(10%)](https://huggingface.co/datasets/turkish-nlp-suite/InstrucTurca)** | 对话数据 | SFT | 高质量的对话类数据。 | | **[Open Math TR](https://huggingface.co/datasets/oztrkoguz/Open_Math_Instruct_Turkish)** | 推理数据 | SFT | 分步式数学问题求解数据。 | | **[XP3X](https://huggingface.co/datasets/CohereLabs/xP3x)** | NLP任务数据 | SFT | 多语言泛化任务数据集。 | | **[En-Tr 翻译数据集](https://huggingface.co/datasets/cturan/high-quality-english-turkish-sentences)** | 翻译数据 | SFT | 高质量的英-土语平行翻译配对数据。 | ----- ## 处理流程 本数据集严格遵循**基础模型**训练标准构建: 1. **标准化映射**:将60余个原始数据源统一映射为`id, text, source, meta`标准Schema。 2. **基于磁盘的去重**:对总计约1亿份文档执行精确去重(MD5哈希),避免训练算力浪费。 3. **个人可识别信息(PII)脱敏**: - **正则清洗**:自动移除电子邮箱、IP地址与土耳其手机号(格式为+90...)。 - *注*:合成数据源(Cosmos)与FineWeb未执行激进正则清洗,以保留原始令牌分布。 4. **分片存储**:数据被拆分为每份2.0GB的Parquet分片,适配GPU集群的流式读取需求。 ## 局限性 * **网页数据偏差**:语料库中占比较大的FineWeb数据来自开放互联网,可能反映社会固有偏见。 * **合成数据特性**:`Cosmos`子集为合成生成数据,尽管质量较高,但可能存在大语言模型输出中常见的幻觉推理模式。 ----- ## 引用与署名 若您在研究或产品中使用本数据集,请引用以下文献: bibtex @misc{lumees2025turkish100b, author = {Hasan KURŞUN, Kerem Berkay YANIK}, title = {Lumees土耳其语语料库100B}, year = {2025}, publisher = {Lumees AI}, howpublished = {url{https://lumees.io}}, email = {hello@lumees.io} }
提供机构:
lumees
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作