bswac
收藏huggingface.co2025-01-15 收录
下载链接:
https://huggingface.co/datasets/community-datasets/bswac
下载链接
链接失效反馈官方服务:
资源简介:
The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Bosnian vs. Croatian vs. Serbian).
Version 1.0 of this corpus is described in http://www.aclweb.org/anthology/W14-0405. Version 1.1 contains newer and better linguistic annotations.
bsWaC 波斯尼亚语网络语料库于2014年通过爬取.ba顶级域名构建而成。该语料库在段落层面进行了近似去重,并通过音标恢复实现了规范化,同时进行了形态句法标注和词形还原。语料库的段落顺序已被打乱。每个段落均包含关于URL、域名以及语言识别(波斯尼亚语、克罗地亚语或塞尔维亚语)的元数据。
该语料库的1.0版本在http://www.aclweb.org/anthology/W14-0405中进行了描述。1.1版本包含更新且更完善的语言标注。
提供机构:
huggingface.co



