Montenegrin web corpus MaCoCu-cnr 1.0
收藏hdl.handle.net2025-01-16 收录
下载链接:
http://hdl.handle.net/11356/1809
下载链接
链接失效反馈官方服务:
资源简介:
The Montenegrin web corpus MaCoCu-cnr 1.0 was built by crawling the ".me" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler.
Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies.
In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer).
The corpus can be easily read with the prevert parser (https://pypi.org/project/prevert/).
Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.
This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.
Montenegrin 网络语料库 MaCoCu-cnr 1.0 版本于 2021 年和 2022 年通过爬取 ".me" 顶级域名构建而成,并动态扩展至其他域名。该爬虫可在 https://github.com/macocu/MaCoCu-crawler 处获取。
在提取文本的清理过程中,投入了大量的精力以确保语料库的高质量。这通过去除模板化内容(https://corpus.tools/wiki/Justext)和近乎重复的段落(https://corpus.tools/wiki/Onion),摒弃过短文本以及非目标语言的文本来实现。该数据集以其丰富的元数据为特点,允许根据文本质量和其他标准(https://github.com/bitextor/monotextor)对数据进行过滤,使得语料库在语料库语言学研究和训练语言模型以及其他语言技术方面极为有用。
在 XML 格式下,每份文档均附带以下元数据:标题、抓取日期、URL、域名、原始文档的文件类型、文档内部的语言分布以及基于语言模型的流畅度评分。每份文档的文本被划分为段落,并附带关于段落是否为标题的元数据、关于段落质量的元数据(如“短”或“良好”等标签,基于段落长度、URL 和停用词密度通过 jusText 工具分配 - https://corpus.tools/wiki/Justext)以及流畅度(0 到 1 之间的评分,通过 Monocleaner 工具分配 - https://github.com/bitextor/monocleaner),自动识别的段落内文本语言,以及段落是否包含敏感信息的标识(通过 Biroamer 工具识别 - https://github.com/bitextor/biroamer)。
该语料库可使用预置的解析器(https://pypi.org/project/prevert/)轻松读取。
请注意并记录:如果您认为我们的数据包含您拥有的材料,并且不应在此处再现,请:(1)明确表明您的身份,并提供详细的联系方式,如地址、电话号码或电子邮件地址;(2)明确标识声称被侵权版权作品;(3)明确标识声称侵犯的材料,并提供足够的信息以便我们定位材料;(4)请写信给此资源的联系人,联系人的电子邮件地址可在完整项目记录中找到。我们将遵守合法请求,在语料库的下一次发布中移除受影响的来源。
本行动已获得欧盟 2014-2020 年“连接欧洲设施”- CEF 电信的资金支持,项目协议号 INEA/CEF/ICT/A2020/2278341。本通讯仅反映作者的观点。该机构不对信息可能被使用的任何用途负责。
提供机构:
hdl.handle.net



