AntiplagiatCompany/CL4Lang

Name: AntiplagiatCompany/CL4Lang
Creator: AntiplagiatCompany
Published: 2024-10-10 12:35:24
License: 暂无描述

Hugging Face2024-10-10 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/AntiplagiatCompany/CL4Lang

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - ru - hy - es - en size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: collection path: - collection.csv - split: query path: - query.csv tags: - paraphrase - crosslingual --- # Cross-lingual plagiarism detection: Two are better than one The widespread availability of scientific documents in multiple languages, coupled with the development of automatic translation and editing tools, has created a demand for efficient methods that can detect plagiarism across different languages. A dataset for cross-lingual plagiarism evaluation. Collection consists of a subset of Wikipedia articles on 4 languages (ru, hy, es, en). Quary consists of wikipedia documents in each of the four languages with translated sentences with Google Translate API from collection, and also XML-markup for them. # Usage of Dataset ## Load Data ```python from datasets import load_dataset ds = load_dataset("AntiplagiatCompany/CL4Lang") ``` ## Create Index of collection ```python # The list consists of dictionaries with document id, text of the document and text language information (also present xml data, but it used only for querys, not for indexing) collection = ds['collection'].to_list() # The list of object can be indexing by using different methods (vector search methods or classical BM25 indexing methods) index = make_index(collection) ``` ## Evaluate The Query Result ```python # The list consists of dictionaries with document id, text of the document, text language information, and XML information about text reuses in query from collection. queries = ds['query'].to_list() real, predict = [], [] for query in queries: real.append(query['xml']) predict.append( convert_answer_to_xml( index.search(text=query['text'], lang=query['lang']) ) ) # More information about the XML markup description and evaluation see http://pan.webis.de/clef13/pan13-web/plagiarism-detection.html evaluate_system(real, predict) ``` # Citation If you use that results in your research, please cite our paper: ```bibtex @article{10.1134/S0361768823040138, author = {Avetisyan, K. and Gritsay, G. and Grabovoy, A.}, title = {Cross-Lingual Plagiarism Detection: Two Are Better Than One}, year = {2023}, issue_date = {Aug 2023}, publisher = {Plenum Press}, address = {USA}, volume = {49}, number = {4}, issn = {0361-7688}, url = {https://doi.org/10.1134/S0361768823040138}, doi = {10.1134/S0361768823040138}, journal = {Program. Comput. Softw.}, month = aug, pages = {346–354}, numpages = {9}, keywords = {cross-lingual plagiarism detection, cross-lingual plagiarism detection benchmark, under-resourced languages, sequential merger approach} } ```

提供机构：

AntiplagiatCompany

5,000+

优质数据集

54 个

任务类型

进入经典数据集