德国文本相似性检测语料库
收藏arXiv2017-03-11 更新2024-06-21 收录
下载链接:
http://simtex.talne.eu
下载链接
链接失效反馈官方服务:
资源简介:
德国文本相似性检测语料库是由阿维尼翁大学及沃克吕兹地区信息学实验室创建,旨在评估文本相似性检测算法。该数据集包含15篇文档,分为三个子集:基本、复杂和控制,用于评估不同级别的文本相似性。数据集的创建过程涉及对原始文档的多种修改,包括词汇替换、结构调整等,以生成不同复杂度的文本对。该数据集主要应用于自然语言处理领域,如抄袭检测、文档聚类等,以解决文本相似性评估的问题。
The German Text Similarity Detection Corpus was developed by Avignon University and the Informatics Laboratory of the Vaucluse Region, with the goal of evaluating text similarity detection algorithms. This dataset includes 15 documents, which are split into three subsets: Basic, Complex, and Control, to assess text similarity across different levels. The construction of the dataset involves various modifications to the original documents, such as lexical replacement and structural adjustment, to generate text pairs with varying complexity levels. This dataset is primarily applied in the field of natural language processing (NLP), including tasks like plagiarism detection and document clustering, to address the challenges of text similarity evaluation.
提供机构:
阿维尼翁大学及沃克吕兹地区信息学实验室
创建时间:
2017-03-11



