LVSTCK/macedonian-corpus-cleaned-dedup
收藏Hugging Face2025-07-06 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/LVSTCK/macedonian-corpus-cleaned-dedup
下载链接
链接失效反馈官方服务:
资源简介:
马其顿语语料库 - 清洗和去重版,是一个包含书籍、学术论文、网络内容等多样文本资源的数据集。该数据集经过严格的清洗和去重处理,以确保文本质量,并为自然语言处理、语言学研究以及教育应用提供高质量的语料。数据集总大小为16.78 GB,包含大约14.7亿个单词。
The Macedonian Corpus - Cleaned and Deduplicated is a dataset that consolidates various text resources including books, academic papers, and web content. It has undergone rigorous cleaning and deduplication processes to ensure high-quality text, providing valuable data for natural language processing, linguistic research, and educational applications. The dataset is 16.78 GB in size and contains approximately 1.47 billion words.
提供机构:
LVSTCK



