Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/3372484
下载链接
链接失效反馈官方服务:
资源简介:
The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.
The corpus has following structure:
wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content
without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)
The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.
维基百科文本复用(text reuse)语料库2018(Webis-Wikipedia-Text-Reuse-18),包含从维基百科内部以及维基百科与通用爬虫(Common Crawl)样本集之间提取的文本复用案例。
该语料库包含以下文件结构:
wikipedia.jsonl.bz2:每个行对应一篇维基百科文章,存储为包含文章ID(article_id)、文章标题(article_title)与文章正文(article_body)的JSON数组。
within-wikipedia-tr-01.jsonl.bz2:每个行代表一个文本复用案例,存储为包含源文章ID(s_id,source article id)、目标文章ID(t_id,target article id)、源文本(s_text,source text)与目标文本(t_text,target text)的JSON数组。
within-wikipedia-tr-02.jsonl.bz2:每个行代表一个文本复用案例,存储为包含源文章ID(s_id,source article id)、目标文章ID(t_id,target article id)、源文本(s_text,source text)与目标文本(t_text,target text)的JSON数组。
preprocessed-web-sample.jsonl.xz:每个行代表一个网页,存储为包含网页ID(d_id,web page id)、网页URL(d_url,web page url)与网页内容(content)的JSON对象。
without-wikipedia-tr.jsonl.bz2:每个行代表一个文本复用案例,存储为包含维基百科文章ID(s_id,Wikipedia article id)、网页ID(d_id,web page id)、文章文本(s_text,article text)与网页内容(d_content,web page content)的JSON数组。
该数据集由Alshomary等人2018年的研究工作提取而来,该研究旨在大规模探究与维基百科相关的文本复用现象。研究团队开发了一套大规模文本复用提取流水线,并将其应用于维基百科与通用爬虫(Common Crawl)数据集。
创建时间:
2022-08-29



