Ba2han/1711-mix-pt-tr
收藏Hugging Face2025-11-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/1711-mix-pt-tr
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
size_categories:
- 10M<n<100M
---
# Turkish Language Dataset Mix (1711-mix-pt-tr)
This dataset is a curated collection of Turkish language texts from multiple sources, processed and filtered for pretraining language models.
## Dataset Composition
This dataset combines the following sources:
1. **hcsolakoglu/turkish-wikipedia-qa-4-million** - Turkish Wikipedia Q&A pairs (original_text column)
2. **turkish-nlp-suite/ForumSohbetleri** - Turkish forum discussions from:
- donanimarsivi
- donanimhaber
- memurlar
- wardom
- technopatsosyal
3. **turkish-nlp-suite/OzenliDerlem** - Curated Turkish corpus (all subsets)
4. **PleIAs/SYNTH** - Selected synthetic data (synth_009-012.parquet files)
5. **musabg/wikipedia-tr-summarization** - Turkish Wikipedia summaries
6. **HuggingFaceFW/finewiki** - Turkish Wikipedia subset (tr/trwiki)
## Processing Pipeline
1. **Column Normalization**: All text columns renamed to "text"
2. **Chunking**: Large texts split using delimiters: `["# ", "## ", "### ", ".\n\n", ".\n"]`
3. **Filtering**: Texts kept only if 150 ≤ length ≤ 9000 characters
4. **Deduplication**: Exact match deduplication applied
5. **Splitting**: Dataset split into 250k row chunks for easier handling
## Statistics
- **Total Examples**: 13,550,333
- **Splits**: 55
- **Character Range**: 150-9000 characters per example
- **Language**: Turkish (tr)
## Usage
```python
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("Ba2han/1711-mix-pt-tr")
# Load specific split
dataset = load_dataset("Ba2han/1711-mix-pt-tr", split="train_000")
```
## License
This dataset combines multiple sources with various licenses. Please check individual source datasets for specific licensing terms.
## Citation
If you use this dataset, please cite the original sources listed above.
许可证:Apache-2.0
任务类别:
- 文本生成
规模类别:
- 1000万 < 样本数 < 1亿
# 土耳其语混合数据集(1711-mix-pt-tr)
本数据集为经精选整合的多源土耳其语文本集合,经过预处理与筛选,旨在用于大语言模型(Large Language Model)的预训练。
## 数据集构成
本数据集整合了以下来源的文本:
1. **hcsolakoglu/turkish-wikipedia-qa-4-million**:土耳其维基百科问答对(对应`original_text`列)
2. **turkish-nlp-suite/ForumSohbetleri**:来自以下平台的土耳其语论坛讨论:
- donanimarsivi
- donanimhaber
- memurlar
- wardom
- technopatsosyal
3. **turkish-nlp-suite/OzenliDerlem**:经精选的土耳其语语料库(包含全部子集)
4. **PleIAs/SYNTH**:精选合成数据(对应`synth_009-012.parquet`文件)
5. **musabg/wikipedia-tr-summarization**:土耳其维基百科摘要文本
6. **HuggingFaceFW/finewiki**:土耳其维基百科子集(`tr/trwiki`)
## 处理流程
1. **列标准化**:将所有文本列统一重命名为`text`
2. **分段切割**:使用分隔符`["# ", "## ", "### ", ".
", ".
"]`对长文本进行拆分
3. **筛选过滤**:仅保留字符长度在150至9000之间的文本
4. **去重处理**:对文本执行精确匹配去重
5. **分片拆分**:将数据集拆分为每份25万行的分片,便于后续处理
## 统计信息
- 总样本数:13,550,333
- 分片数量:55
- 单样本字符长度范围:150至9000字符
- 语言:土耳其语(tr)
## 使用方法
python
from datasets import load_dataset
# 加载完整数据集
dataset = load_dataset("Ba2han/1711-mix-pt-tr")
# 加载指定分片
dataset = load_dataset("Ba2han/1711-mix-pt-tr", split="train_000")
## 许可证说明
本数据集整合了多个带有不同许可证的源数据集,请查阅各独立源数据集以获取具体许可条款。
## 引用说明
若您使用本数据集,请引用上述列出的各原始源数据集。
提供机构:
Ba2han



