LVSTCK/macedonian-corpus-raw
收藏Hugging Face2025-07-06 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/LVSTCK/macedonian-corpus-raw
下载链接
链接失效反馈官方服务:
资源简介:
马其顿语料库-原始版是一个包含超过10个来源的原始马其顿语文本数据集,总大小为37.6GB,包含约35亿个单词。这个数据集包括了学术文本、公共档案和在线资源等。数据集经过最少的预处理,仅应用了基本的匿名化处理(例如,移除电子邮件、电话号码和其他敏感信息),并且没有去重或过滤噪声。该数据集旨在支持各种用途,包括预训练或微调大型语言模型、语言学分析、机器翻译和文档检索等。数据集根据来源分为几个类别,包括HPLT-2、HuggingFace (fineweb-2)、CLARIN (MaCoCu-mk 2.0)等。数据集遵循知识共享署名4.0(CC BY 4.0)许可。
The Macedonian Corpus - Raw is a dataset consisting of raw Macedonian text data from over 10 sources, totaling 37.6GB in size and containing approximately 3.53 billion words. This dataset includes academic texts, public archives, and online resources. It has undergone minimal preprocessing, with basic anonymization applied (e.g., removal of emails, phone numbers, and other sensitive information), and it has not been deduplicated or filtered for noise. The corpus is intended for a variety of uses, including pretraining or fine-tuning LLMs, linguistic analysis, machine translation, and document retrieval. The dataset is split into categories based on the origin of the data, including HPLT-2, HuggingFace (fineweb-2), CLARIN (MaCoCu-mk 2.0), etc., and is licensed under the Creative Commons Attribution 4.0 (CC BY 4.0).
提供机构:
LVSTCK



