five

Detecting Machine-obfuscated Plagiarism

收藏
DataCite Commons2022-04-10 更新2024-07-13 收录
下载链接:
http://deepblue.lib.umich.edu/data/concern/data_sets/2801pg45f
下载链接
链接失效反馈
官方服务:
资源简介:
This data set is comprised of multiple folders. The corpus folder contains raw text used for training and testing in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool (https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file);The human judgement folder contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results. ;The models folder contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6). The grid search for hyperparameter adjustments is described in the paper. ;The vector folders (train and test) contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.
提供机构:
University of Michigan
创建时间:
2019-12-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作