PAN Arabic Intrinsic Plagiarism Detection Shared Task Corpus

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/6609195

下载链接

链接失效反馈

官方服务：

资源简介：

Evaluation corpus for ARAbic INtrinsic plagiarism detection (InAra Corpus) This corpus has been used in AraPlagDet 2015 shared task More details could be found in : https://araplagdet.misc-lab.org/ or https://pan.webis.de/fire15/pan15-web/index.html I. SYNOPSIS InAra corpus comprises 2048 documents; 80% of them contain passages borrowed from other documents to simulate documents that contain plagiarized fragments. The corpus involves 2 parts: Training and test. II. DESCRIPTION Each part of the corpus (training and test) consists mainly of 2 datasets: textual files and XML files. The textual files represent the suspicious documents i.e., the documents that contain artificial plagiarism; and the XML files are the plagiarism annotation i.e. they provide for each plagiarized passage its starting offset in the suspicious document and its length (offset and length are both expressed in characters). A suspicious document file and its plagiarism annotation file share the same name. III. PURPOSE The purpose of InAra corpus is to evaluate automatic plagiarism detection methods, notably methods of the intrinsic approach. This approach consists in uncovering the plagiarized passages on the basis of the writing style inconsistency in a given suspicious document. As opposed to the external approach, the intrinsic approach does not necessitate any comparison of the suspicious document against the potential sources of plagiarism. Hence, InAra corpus is not appropriate for the evaluation of the external plagiarism detection because the source of plagiarism are not provided. It should be noted that some documents in InAra corpus contain religious quotations (e.g., Quran and Hadith). These quotations have a peculiar writing style and then a simple intrinsic plagiarism detection software can consider them as plagiarism. However, quotations are not plagiarism, and they are not annotated in the XML files in InAra. Hence, it is an important feature for the plagiarism detection systems evaluated on InAra to not consider religious quotations as plagiarism cases unless they appear as part of a larger plagiarism case. IV. BUILDING METHODS The documents that compose InAra corpus do not contain actual plagiarism cases. They are rather artificial suspicious documents in which plagiarism was created automatically by a software that takes fragments of text from one or more sources documents and inserts them in another one according to a set of parameters, namely the percentage of plagiarism and the plagiarized passages lengths. This building method is the same used to construct PAN 2009-2011 corpora of plagiarism detection (see http://pan.webis.de for more information on PAN competition and its corpora). V. LANGUAGE AND ENCODING All the textual documents of this corpus are written in Arabic language and encoded in UTF-8 without BOM. VI. SOURCES OF TEXTS Texts used to build this corpus, either suspicious documents or the inserted passages, are taken mainly from the open library Arabic Wikisource (http://ar.wikisource.org), one of Wikimedia Foundation projects. A few numbers of documents were taken from other websites, namely: Create your own country blog: http://diycountry.blogspot.com Corpus of Classical Arabic (KSUCCA): http://ksucorpus.ksu.edu.sa Islamic book web site: http://www.islamicbook.ws VII. COPYRIGHT AND AVAILABILITY We were very careful to build the corpus with copyright-free texts only, to be able to make it publicly available without any sort of problems with texts owners. VIII. HOW TO CITE THE CORPUS ? If you publish a paper about your experimentations using InAra corpus, please cite the following paper: Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S.: Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection. In P. Majumder, M. Mitra, M. Agrawal, & P. Mehta (Eds.), Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation (FIRE 2015), Gandhinagar, India, December 4-6, CEUR proceedings vol. 1587 (pp. 111–122). CEUR-WS.org (2015). We encourage you to compare your method tested on InAra with the methods of AraPlagDet competition described in the paper above. Additional information on the corpus building are in the papers: Bensalem, I., Rosso, P., Chikhi, S.: A New Corpus for the Evaluation of Arabic Intrinsic Plagiarism Detection. In: Forner, P., Müller, H., Paredes, R., Rosso, P., and Stein, B. (eds.) CLEF 2013, LNCS, vol. 8138. pp. 53–58. Springer, Heidelberg (2013). Bensalem, I., Rosso, P., Chikhi, S.: Building Arabic Corpora from Wikisource. 10th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA’13),May 27-30 Fes/Ifran, Morocco (2013).IEEE. You may wish to compare the results of your experiments with the result of the following papers that used InAra corpus: Bensalem I, Rosso P, Chikhi S (2019) On the use of character n-grams as the only intrinsic evidence of plagiarism. Language Resources and Evaluation 53:363–396. doi: 10.1007/s10579-019-09444-w Mahgoub AY, Magooda A, Rashwan M, et al (2015) RDI System for Intrinsic Plagiarism Detection (RDI_RID), Working Notes for PAN-AraPlagDet at FIRE 2015. In: Majumder P, Mitra M, Agrawal M, Mehta P (eds) Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation (FIRE 2015), Gandhinagar, India, December 4-6, CEUR proceedings vol. 1587. CEUR-WS.org, pp 129–130 IX. WARNING It should be noted that the Arabic texts may contain quotations from the Quran and the Hadith; and due to the fact that text insertion is automatic and in random positions, it is possible that the plagiarized text is inserted unintentionally between Quranic verses or sentences of a Hadith cited in a document. Hence, the inserted passages may alter the meaning of the original text. For these reasons, this corpus must not be used outside the purpose for which it was built. Examples of the inappropriate use include using the corpus documents as a source of knowledge or distributing them without mentioning that they contain borrowed texts. If you are not interested in plagiarism detection and you are retaining the corpus because it contains books you want to read, then this corpus is not the right source. Please, you should refer to the sources mentioned in Section VI where you can find the original content of the books you are looking for. We emphasize that we are not responsible for the results of any use of this corpus other than the evaluation of the intrinsic plagiarism detection methods. X. CONTACT US We will be happy to hear from you about your experience in using InAra corpus. Please do not hesitate to contact us with the following email address: bens.imene@gmail.com Imene Bensalem¹, Paolo Rosso², Salim Chikhi¹ ¹MISC Lab. Constantine 2 university, Algeria ²PRHLT, Universitat Politècnica de València, Spain

创建时间：

2022-06-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集