Czech Web Corpus 2017 (csTenTen17)
收藏hdl.handle.net2017-11-01 更新2025-03-25 收录
下载链接:
http://hdl.handle.net/11234/1-4835
下载链接
链接失效反馈官方服务:
资源简介:
The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing).
The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language.
The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/.
Text sources: General web, Wikipedia.
Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017.
Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents (<doc/>, usually corresponding to web pages), paragraphs (<p/>), sentences (<s/>) and word join markers (<g/>, a "glue" tag indicating that there was no space between the surrounding tokens in the original text). Document metadata: src (the source of the data), title (the title of the web page), url (the URL of the document), crawl_date (the date of downloading the document). Paragraph metadata: heading ("1" if the paragraph is a heading, usually <h1> to <h6> elements in the original HTML data). Block elements in the case of an HTML source or double blank lines in the case of other source formats were used as paragraph separators. An internal heuristic tool was used to mark sentence breaks. The tab-separated positional attributes are: word form, morphological annotation, lem-POS (the base form of the word, i.e. the lemma, with a part of speech suffix) and gender respecting lemma (nouns and adjectives only).
Please cite the following paper when using the corpus for your research: Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018. (https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119)
2017年的捷克网络语料库(csTenTen17)由互联网上的文本构成,主要来源于捷克国家顶级域名“.cz”。该数据集由网络爬虫SpiderLing(https://corpus.tools/wiki/SpiderLing)进行采集。数据经过清洗,去除了模板文本(利用https://corpus.tools/wiki/Justext),移除了近乎重复的段落(通过https://corpus.tools/wiki/Onion)以及丢弃了非目标语言的段落。语料库通过形态分析器Majka进行词性标注,使用的词性标注集为:https://www.sketchengine.eu/tagset-reference-for-czech/。文本来源包括通用网页和维基百科。爬取时间跨度为2017年5月、10月和11月,2016年10月和11月,2015年10月和11月。捷克维基百科部分于2017年11月下载。数据格式为纯文本,垂直排列(每行一个标记),gzip压缩。垂直结构包括文档(<doc/>,通常对应网页)、段落(<p/>)、句子(<s/>)和词组连接标记(<g/>,表示原始文本中相邻标记之间没有空格的“粘合”标签)。文档元数据包括src(数据来源)、title(网页标题)、url(文档URL)和crawl_date(文档下载日期)。段落元数据包括标题(若段落为标题,则标记为“1”,通常在原始HTML数据中为<h1>到<h6>元素)。在HTML源中,使用块级元素或双空行作为段落分隔符。内部启发式工具用于标记句子分隔。位置属性以制表符分隔,包括词形、形态标注、lem-POS(词的词干形式,即词元,带词性后缀)和尊重性别的词元(仅限名词和形容词)。在使用该语料库进行研究时,请引用以下论文:Suchomel, Vít. csTenTen17,一种最新的捷克网络语料库。载于《斯拉夫自然语言处理最新进展》,第111-123页。2018。(https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119)
提供机构:
hdl.handle.net



