jpwahle/machine-paraphrase-dataset
收藏数据集概述
数据集名称
- 名称: Machine Paraphrase Dataset (MPC)
- 别名: Machine Paraphrase Dataset (SpinnerChief/SpinBot)
数据集属性
- 语言: 英语
- 多语言性: 单语种
- 许可证: CC-BY-4.0
- 规模: 100K<n<1M
- 源数据: 原始数据
- 标签: spinbot, spinnerchief, plagiarism, paraphrase, academic integrity, arxiv, wikipedia, theses
- 任务类别: text-classification, text-generation
- 论文代码ID: identifying-machine-paraphrased-plagiarism
数据集结构
- 数据实例: 包含文本、标签、数据集来源和方法
- 数据字段:
text: 文本内容label: 是否为改写(1)或原文(0)dataset: 数据来源(Wikipedia, arXiv, 或 theses)method: 使用的方法(SpinBot, SpinnerChief, 或 original)
- 数据分割:
- 训练集: Wikipedia x Spinbot
- 测试集: [Wikipedia, arXiv, theses] x [SpinBot, SpinnerChief]
数据集创建
- 来源数据:
- 英文维基百科精选文章段落
- arXMLiv全文PDF段落
- 捷克学生论文(学士、硕士、博士)全文PDF段落
- 许可证: CC BY-NC 4.0
引用信息
bib @inproceedings{10.1007/978-3-030-96957-8_34, title = {Identifying Machine-Paraphrased Plagiarism}, author = {Wahle, Jan Philip and Ruas, Terry and Folt{y}nek, Tom{a}{v{s}} and Meuschke, Norman and Gipp, Bela}, year = 2022, booktitle = {Information for a Better World: Shaping the Global Future}, publisher = {Springer International Publishing}, address = {Cham}, pages = {393--413}, isbn = {978-3-030-96957-8}, editor = {Smits, Malte}, abstract = {Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research papers, graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best performing technique, Longformer, achieved an average F1 score of 80.99{%} (F1 = 99.68{%} for SpinBot and F1 = 71.64{%} for SpinnerChief cases), while human evaluators achieved F1 = 78.4{%} for SpinBot and F1 = 65.6{%} for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems, such as Turnitin and PlagScan.} }



