Web corpus MaCoCu

SSH Open MarketPlace2024-09-30 更新2024-10-05 收录

下载链接：

https://marketplace.sshopencloud.eu/dataset/liaLlA

下载链接

链接失效反馈

官方服务：

资源简介：

These corpora are a collection containing web texts and were built by crawling national internet top-level domains (specified below) and by extending the crawl dynamically to other domains as well. The crawler is available at [MaCoCu GitHub channel](https://github.com/macocu/MaCoCu-crawler). Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing [boilerplate](https://corpus.tools/wiki/Justext) and [near-duplicated paragraphs](https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. Furthermore, samples from the largest 1,500 domains were manually checked and bad domains, such as machine-translated domains, were removed. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and [other criteria](https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the [jusText tool](https://corpus.tools/wiki/Justext)) and fluency (score between 0 and 1, assigned with the [Monocleaner tool](https://github.com/bitextor/monocleaner)), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the [Biroamer tool](https://github.com/bitextor/biroamer)). As opposed to the previous version in the case of corpora in version 2.0, this version has more accurate metadata on languages of the texts, which was achieved by using [Google's Compact Language Detector 2 (CLD2)](https://github.com/CLD2Owners/cld2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner, as well as larger corpus. The corpus is available for download from the Slovenian repository CLARIN.SI and can be easily read with the [prevert parser](https://pypi.org/project/prevert/).

创建时间：

2024-09-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集