A dataset of Tibetan-Chinese cross-language text plagiarism detection

Name: A dataset of Tibetan-Chinese cross-language text plagiarism detection
Creator: www.doi.org
License: 暂无描述

www.doi.org2025-03-24 收录

下载链接：

https://www.doi.org/10.11922/sciencedb.j00001.00355

下载链接

链接失效反馈

官方服务：

资源简介：

Proceeding from the actual problems of Chinese minority language information processing, this paper establishes a Tibetan-Chinese cross-language text plagiarism detection corpus containing 150,000 sentence pairs, based on SemEval 2014 English evaluation corpus and data enhancement method, to solve the problem of lack of corpus in Tibetan-Chinese cross-language text plagiarism detection. This dataset provides fundamental basis for Tibetan-Chinese cross-language text plagiarism detection. Also the dataset can be used in Tibetan-Chinese semantic computing and other natural language processing tasks. In addition, data enhancement method in the process of data set construction also provides a solution for other low-resource languages to solve the problem of lack of corpus in natural language processing tasks.

本论文立足于解决中国少数民族语言信息处理的实际问题，构建了一个包含15万句对藏汉跨语言文本抄袭检测语料库，该语料库基于SemEval 2014英语评测语料库和数据增强方法建立，旨在解决藏汉跨语言文本抄袭检测领域语料库匮乏的问题。该数据集为藏汉跨语言文本抄袭检测提供了基本依据，同时可用于藏汉语义计算及其他自然语言处理任务。此外，数据集构建过程中的数据增强方法亦为其他低资源语言在自然语言处理任务中解决语料库不足的问题提供了可行方案。

提供机构：

www.doi.org

5,000+

优质数据集

54 个

任务类型

进入经典数据集