Ba2han/en-tur_corpus
收藏Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/en-tur_corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
---
# English - Turkish Filtered Corpus
This dataset was compiled from multiple English and Turkish sources, heavily filtered, and deduplicated.
## Processing Applied:
1. **Length Filtering:** Min characters = 100, Max characters = 2600
2. **Deduplication:** Exact deduplication followed by fast heuristic fingerprinting (normalized, whitespace & punctuation removed)
3. **Shuffled:** Seed = 42
## Dataset Composition:
| Dataset Source | Original Rows (Loaded) | Final Rows (After Filters & Dedup) |
|---|---|---|
| jordiclive/wikipedia-summary-dataset | 7,750,007 | 5,993,863 |
| LocalDoc/news_azerbaijan | 447,197 | 411,617 |
| ccdv/pubmed-summarization | 119,924 | 119,783 |
| turkish-nlp-suite/temiz-Wiki | 360,175 | 290,193 |
| Ba2han/vngrs-web-filtered | 1,000,000 | 183,604 |
| argilla/cnn-dailymail-summaries | 287,113 | 287,051 |
| turkish-nlp-suite/Havadis | 744,868 | 334,579 |
| bigscience-data/roots_en_the_pile_uspto | 1,000,000 | 395,842 |
| SocialGrep/one-million-reddit-jokes | 10,449 | 6,650 |
| karpathy/fineweb-edu-100b-shuffle | 1,000,000 | 423,426 |
| musabg/wikipedia-tr-summarization | 119,110 | 118,550 |
| common-pile/arxiv_abstracts_filtered | 2,504,679 | 2,498,158 |
| HuggingFaceFW/finetranslations | 1,000,000 | 378,533 |
| **TOTAL** | **16,343,522** | **11,441,849** |
提供机构:
Ba2han



