five

smartcat/MaCoCu_sr_en

收藏
Hugging Face2024-10-03 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/smartcat/MaCoCu_sr_en
下载链接
链接失效反馈
官方服务:
资源简介:
MaCoCu-sr 1.0塞尔维亚语网络语料库是通过在2021年和2022年爬取.rs和.срб互联网顶级域名构建的,并动态扩展到其他域名。该高质量网络语料库具有丰富的元数据,非常适合用于语料库语言学研究,以及训练语言模型和其他语言技术。数据收集和处理过程中,使用了多种工具进行文本清洗,包括去除样板文本、去除近重复段落、丢弃非常短的文本和非目标语言的文本,并应用了广泛的元数据过滤。

The Serbian web corpus MaCoCu-sr 1.0 was built by crawling the .rs and .срб internet top-level domains in 2021 and 2022, extending the crawl dynamically to other domains. This high-quality web corpus is characterized by extensive metadata, making it highly useful for corpus linguistics studies, as well as for training language models and other language technologies. The source data for the Serbian translations is derived from the MaCoCu-sr 1.0 corpus, which was built by crawling the .rs and .срб internet top-level domains in 2021 and 2022, extending the crawl dynamically to other domains. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted to cleaning the extracted text to provide a high-quality web corpus. This was achieved by: removing boilerplate using Justext, removing near-duplicated paragraphs using Onion, discarding very short texts and texts not in the target language, and applying extensive metadata filtering using Monotextor.
提供机构:
smartcat
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作