Ethosoft/Turkish_corpus
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Ethosoft/Turkish_corpus
下载链接
链接失效反馈官方服务:
资源简介:
Turkish Corpus是一个大规模、经过清洗的土耳其语文本数据集,专为土耳其自然语言处理研究、语言模型预训练、分词器训练、嵌入模型、检索系统和一般土耳其语理解任务设计。该数据集通过收集公共土耳其语语料库和多语言数据集中的土耳其语部分构建而成,经过清洗、标准化和过滤处理,保留了源数据的元数据以便追溯。数据集包含约665万行文本,文件大小约9.31GB,采用Parquet格式,主要列名为text。
Turkish Corpus is a large-scale cleaned Turkish text dataset designed for Turkish Natural Language Processing research, language model pretraining, tokenizer training, embedding models, retrieval systems, and general Turkish language understanding tasks. It is created by collecting public Turkish corpora and extracting Turkish-language portions from multilingual datasets, with text that has been cleaned, normalized, and filtered while preserving source metadata for traceability. The dataset contains approximately 6.65M rows, with a file size of about 9.31 GB, in Parquet format, and the main column is text.
提供机构:
Ethosoft



