Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/3372484

下载链接

链接失效反馈

官方服务：

资源简介：

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl. The corpus has following structure: wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text) within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text) preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content) The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

维基百科文本复用（text reuse）语料库2018（Webis-Wikipedia-Text-Reuse-18），包含从维基百科内部以及维基百科与通用爬虫（Common Crawl）样本集之间提取的文本复用案例。该语料库包含以下文件结构： wikipedia.jsonl.bz2：每个行对应一篇维基百科文章，存储为包含文章ID（article_id）、文章标题（article_title）与文章正文（article_body）的JSON数组。 within-wikipedia-tr-01.jsonl.bz2：每个行代表一个文本复用案例，存储为包含源文章ID（s_id，source article id）、目标文章ID（t_id，target article id）、源文本（s_text，source text）与目标文本（t_text，target text）的JSON数组。 within-wikipedia-tr-02.jsonl.bz2：每个行代表一个文本复用案例，存储为包含源文章ID（s_id，source article id）、目标文章ID（t_id，target article id）、源文本（s_text，source text）与目标文本（t_text，target text）的JSON数组。 preprocessed-web-sample.jsonl.xz：每个行代表一个网页，存储为包含网页ID（d_id，web page id）、网页URL（d_url，web page url）与网页内容（content）的JSON对象。 without-wikipedia-tr.jsonl.bz2：每个行代表一个文本复用案例，存储为包含维基百科文章ID（s_id，Wikipedia article id）、网页ID（d_id，web page id）、文章文本（s_text，article text）与网页内容（d_content，web page content）的JSON数组。该数据集由Alshomary等人2018年的研究工作提取而来，该研究旨在大规模探究与维基百科相关的文本复用现象。研究团队开发了一套大规模文本复用提取流水线，并将其应用于维基百科与通用爬虫（Common Crawl）数据集。

创建时间：

2022-08-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集