five

Ba2han/en-tur_corpus

收藏
Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/en-tur_corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-generation --- # English - Turkish Filtered Corpus This dataset was compiled from multiple English and Turkish sources, heavily filtered, and deduplicated. ## Processing Applied: 1. **Length Filtering:** Min characters = 100, Max characters = 2600 2. **Deduplication:** Exact deduplication followed by fast heuristic fingerprinting (normalized, whitespace & punctuation removed) 3. **Shuffled:** Seed = 42 ## Dataset Composition: | Dataset Source | Original Rows (Loaded) | Final Rows (After Filters & Dedup) | |---|---|---| | jordiclive/wikipedia-summary-dataset | 7,750,007 | 5,993,863 | | LocalDoc/news_azerbaijan | 447,197 | 411,617 | | ccdv/pubmed-summarization | 119,924 | 119,783 | | turkish-nlp-suite/temiz-Wiki | 360,175 | 290,193 | | Ba2han/vngrs-web-filtered | 1,000,000 | 183,604 | | argilla/cnn-dailymail-summaries | 287,113 | 287,051 | | turkish-nlp-suite/Havadis | 744,868 | 334,579 | | bigscience-data/roots_en_the_pile_uspto | 1,000,000 | 395,842 | | SocialGrep/one-million-reddit-jokes | 10,449 | 6,650 | | karpathy/fineweb-edu-100b-shuffle | 1,000,000 | 423,426 | | musabg/wikipedia-tr-summarization | 119,110 | 118,550 | | common-pile/arxiv_abstracts_filtered | 2,504,679 | 2,498,158 | | HuggingFaceFW/finetranslations | 1,000,000 | 378,533 | | **TOTAL** | **16,343,522** | **11,441,849** |
提供机构:
Ba2han
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作