Ba2han/1711-mix-pt-tr

Name: Ba2han/1711-mix-pt-tr
Creator: Ba2han
Published: 2025-11-17 14:15:33
License: 暂无描述

Hugging Face2025-11-17 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Ba2han/1711-mix-pt-tr

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation size_categories: - 10M<n<100M --- # Turkish Language Dataset Mix (1711-mix-pt-tr) This dataset is a curated collection of Turkish language texts from multiple sources, processed and filtered for pretraining language models. ## Dataset Composition This dataset combines the following sources: 1. **hcsolakoglu/turkish-wikipedia-qa-4-million** - Turkish Wikipedia Q&A pairs (original_text column) 2. **turkish-nlp-suite/ForumSohbetleri** - Turkish forum discussions from: - donanimarsivi - donanimhaber - memurlar - wardom - technopatsosyal 3. **turkish-nlp-suite/OzenliDerlem** - Curated Turkish corpus (all subsets) 4. **PleIAs/SYNTH** - Selected synthetic data (synth_009-012.parquet files) 5. **musabg/wikipedia-tr-summarization** - Turkish Wikipedia summaries 6. **HuggingFaceFW/finewiki** - Turkish Wikipedia subset (tr/trwiki) ## Processing Pipeline 1. **Column Normalization**: All text columns renamed to "text" 2. **Chunking**: Large texts split using delimiters: `["# ", "## ", "### ", ".\n\n", ".\n"]` 3. **Filtering**: Texts kept only if 150 ≤ length ≤ 9000 characters 4. **Deduplication**: Exact match deduplication applied 5. **Splitting**: Dataset split into 250k row chunks for easier handling ## Statistics - **Total Examples**: 13,550,333 - **Splits**: 55 - **Character Range**: 150-9000 characters per example - **Language**: Turkish (tr) ## Usage ```python from datasets import load_dataset # Load full dataset dataset = load_dataset("Ba2han/1711-mix-pt-tr") # Load specific split dataset = load_dataset("Ba2han/1711-mix-pt-tr", split="train_000") ``` ## License This dataset combines multiple sources with various licenses. Please check individual source datasets for specific licensing terms. ## Citation If you use this dataset, please cite the original sources listed above.

许可证：Apache-2.0 任务类别： - 文本生成规模类别： - 1000万 < 样本数 < 1亿 # 土耳其语混合数据集（1711-mix-pt-tr）本数据集为经精选整合的多源土耳其语文本集合，经过预处理与筛选，旨在用于大语言模型（Large Language Model）的预训练。 ## 数据集构成本数据集整合了以下来源的文本： 1. **hcsolakoglu/turkish-wikipedia-qa-4-million**：土耳其维基百科问答对（对应`original_text`列） 2. **turkish-nlp-suite/ForumSohbetleri**：来自以下平台的土耳其语论坛讨论： - donanimarsivi - donanimhaber - memurlar - wardom - technopatsosyal 3. **turkish-nlp-suite/OzenliDerlem**：经精选的土耳其语语料库（包含全部子集） 4. **PleIAs/SYNTH**：精选合成数据（对应`synth_009-012.parquet`文件） 5. **musabg/wikipedia-tr-summarization**：土耳其维基百科摘要文本 6. **HuggingFaceFW/finewiki**：土耳其维基百科子集（`tr/trwiki`） ## 处理流程 1. **列标准化**：将所有文本列统一重命名为`text` 2. **分段切割**：使用分隔符`["# ", "## ", "### ", ". ", ". "]`对长文本进行拆分 3. **筛选过滤**：仅保留字符长度在150至9000之间的文本 4. **去重处理**：对文本执行精确匹配去重 5. **分片拆分**：将数据集拆分为每份25万行的分片，便于后续处理 ## 统计信息 - 总样本数：13,550,333 - 分片数量：55 - 单样本字符长度范围：150至9000字符 - 语言：土耳其语（tr） ## 使用方法 python from datasets import load_dataset # 加载完整数据集 dataset = load_dataset("Ba2han/1711-mix-pt-tr") # 加载指定分片 dataset = load_dataset("Ba2han/1711-mix-pt-tr", split="train_000") ## 许可证说明本数据集整合了多个带有不同许可证的源数据集，请查阅各独立源数据集以获取具体许可条款。 ## 引用说明若您使用本数据集，请引用上述列出的各原始源数据集。

提供机构：

Ba2han

5,000+

优质数据集

54 个

任务类型

进入经典数据集