five

PAN Arabic Intrinsic Plagiarism Detection Shared Task Corpus

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6609195
下载链接
链接失效反馈
官方服务:
资源简介:
Evaluation corpus for ARAbic INtrinsic plagiarism detection (InAra Corpus)    This corpus has been used in AraPlagDet 2015 shared task  More details could be found in : https://araplagdet.misc-lab.org/ or https://pan.webis.de/fire15/pan15-web/index.html    I. SYNOPSIS  InAra corpus comprises 2048 documents; 80% of them contain passages borrowed from other documents to simulate documents that contain plagiarized fragments. The corpus involves 2 parts: Training and test.   II. DESCRIPTION  Each part of the corpus (training and test) consists mainly of 2 datasets: textual files and XML files. The textual files represent the suspicious documents i.e., the documents that contain artificial plagiarism; and the XML files are the plagiarism annotation i.e. they provide for each plagiarized passage its starting offset in the suspicious document and its length (offset and length are both expressed in characters). A suspicious document file and its plagiarism annotation file share the same name.   III. PURPOSE  The purpose of InAra corpus is to evaluate automatic plagiarism detection methods, notably methods of the intrinsic approach. This approach consists in uncovering the plagiarized passages on the basis of the writing style inconsistency in a given suspicious document. As opposed to the external approach, the intrinsic approach does not necessitate any comparison of the suspicious document against the potential sources of plagiarism. Hence, InAra corpus is not appropriate for the evaluation of the external plagiarism detection because the source of plagiarism are not provided. It should be noted that some documents in InAra corpus contain religious quotations (e.g., Quran and Hadith). These quotations have a peculiar writing style and then a simple intrinsic plagiarism detection software can consider them as plagiarism. However, quotations are not plagiarism, and they are not annotated in the XML files in InAra. Hence, it is an important feature for the plagiarism detection systems evaluated on InAra to not consider religious quotations as plagiarism cases unless they appear as part of a larger  plagiarism case.   IV. BUILDING METHODS  The documents that compose InAra corpus do not contain actual plagiarism cases. They are rather artificial suspicious documents in which plagiarism was created automatically by a software that takes fragments of text from one or more sources documents and inserts them in another one according to a set of parameters, namely the percentage of plagiarism and the plagiarized passages lengths. This building method is the same used to construct PAN 2009-2011 corpora of plagiarism detection (see http://pan.webis.de for more information on PAN competition and its corpora).    V. LANGUAGE AND ENCODING  All the textual documents of this corpus are written in Arabic language and encoded in UTF-8 without BOM.   VI. SOURCES OF TEXTS  Texts used to build this corpus, either suspicious documents or the inserted passages, are taken mainly from the open library Arabic Wikisource (http://ar.wikisource.org), one of Wikimedia Foundation projects. A few numbers of documents were taken from other websites, namely:  Create your own country blog: http://diycountry.blogspot.com  Corpus of Classical Arabic (KSUCCA): http://ksucorpus.ksu.edu.sa  Islamic book web site: http://www.islamicbook.ws    VII. COPYRIGHT AND AVAILABILITY  We were very careful to build the corpus with copyright-free texts only, to be able to make it publicly available without any sort of problems with texts owners.    VIII. HOW TO CITE THE CORPUS ? If you publish a paper about your experimentations using InAra corpus, please cite the following paper: Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S.: Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection. In P. Majumder, M. Mitra, M. Agrawal, & P. Mehta (Eds.), Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation (FIRE 2015), Gandhinagar, India, December 4-6, CEUR proceedings vol. 1587 (pp. 111–122). CEUR-WS.org (2015). We encourage you to compare your method tested on InAra with the methods of AraPlagDet competition described in the paper above. Additional information on the corpus building are in the papers: Bensalem, I., Rosso, P., Chikhi, S.: A New Corpus for the Evaluation of Arabic Intrinsic Plagiarism Detection. In: Forner, P., Müller, H., Paredes, R., Rosso, P., and Stein, B. (eds.) CLEF 2013, LNCS, vol. 8138. pp. 53–58. Springer, Heidelberg (2013). Bensalem, I., Rosso, P., Chikhi, S.: Building Arabic Corpora from Wikisource. 10th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA’13),May 27-30 Fes/Ifran, Morocco (2013).IEEE.    You may wish to compare the results of your experiments with the result of the following papers that used InAra corpus: Bensalem I, Rosso P, Chikhi S (2019) On the use of character n-grams as the only intrinsic evidence of plagiarism. Language Resources and Evaluation 53:363–396. doi: 10.1007/s10579-019-09444-w Mahgoub AY, Magooda A, Rashwan M, et al (2015) RDI System for Intrinsic Plagiarism Detection (RDI_RID), Working Notes for PAN-AraPlagDet at FIRE 2015. In: Majumder P, Mitra M, Agrawal M,  Mehta P (eds) Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation (FIRE 2015), Gandhinagar, India, December 4-6, CEUR proceedings vol. 1587. CEUR-WS.org, pp 129–130   IX. WARNING  It should be noted that the Arabic texts may contain quotations from the Quran and the Hadith; and due to the fact that text insertion is automatic and in random positions, it is possible that the plagiarized text is inserted unintentionally between Quranic verses or sentences of a Hadith cited in a document. Hence, the inserted passages may alter the meaning of the original text. For these reasons, this corpus must not be used outside the purpose for which it was built. Examples of the inappropriate use include using the corpus documents as a source of knowledge or distributing them without mentioning that they contain borrowed texts. If you are not interested in plagiarism detection and you are retaining the corpus because it contains books you want to read, then this corpus is not the right source. Please, you should refer to the  sources mentioned in Section VI where you can find the original content of the books you are looking for. We emphasize that we are not responsible for the results of any use of this corpus other than the evaluation of the intrinsic plagiarism detection methods.    X. CONTACT US We will be happy to hear from you about your experience in using InAra corpus. Please do not hesitate to contact us with the following email address: bens.imene@gmail.com   Imene Bensalem¹, Paolo Rosso², Salim Chikhi¹ ¹MISC Lab. Constantine 2 university, Algeria ²PRHLT, Universitat Politècnica de València, Spain
创建时间:
2022-06-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作