Finnish Paraphrase Corpus

arXiv2021-03-24 更新2024-07-25 收录

下载链接：

https://github.com/TurkuNLP/Turku-paraphrase-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

Finnish Paraphrase Corpus是由图尔库NLP小组创建的第一个完全人工标注的芬兰语释义语料库，包含53,572个释义对，主要来源于电影和电视节目的替代字幕以及新闻标题。该数据集98%的释义对在特定上下文中至少是释义，如果不是所有上下文的话。数据集的创建过程涉及完全手动选择释义候选，以避免偏向短的、词汇重叠的候选。该数据集主要用于NLP中的释义检测和生成，以及其他应用如问答、抄袭检测和机器翻译，旨在提高芬兰语高质量释义数据的可用性。

Finnish Paraphrase Corpus is the first fully manually annotated Finnish paraphrase corpus developed by the Turku NLP Group. It contains 53,572 paraphrase pairs, primarily sourced from alternative subtitles of films and television programs, as well as news headlines. Approximately 98% of these paraphrase pairs qualify as valid paraphrases in at least one specific context, if not all contexts. The construction of this corpus involved fully manual selection of paraphrase candidates to avoid bias toward short, lexically overlapping candidates. This dataset is mainly applied to paraphrase detection and generation tasks in natural language processing (NLP), as well as other applications including question answering, plagiarism detection, and machine translation, aiming to enhance the availability of high-quality paraphrase data for the Finnish language.

提供机构：

图尔库NLP小组技术学院计算机系图尔库大学芬兰

创建时间：

2021-03-24

原始信息汇总

Turku-paraphrase-corpus

数据集概述

内容：包含超过100,000个手动标注的释义对，来源于替代字幕、新闻标题、新闻文章、讨论论坛消息、学生翻译和论文。
特点：释义对与其文档上下文一起提供。
语言：主要为芬兰语，包含一小部分瑞典语测试集。

文件结构

主要文件：train,dev,test.json，包含手动标注的语料库数据。
辅助文件：opus-parsebank-sample-annotated.tsv，从OPUS和Turku互联网Parsebank中选取的句子对样本，带有手动标注。

文件格式

格式：JSON格式，包含数据项列表。
数据项键：
- txt1 和 txt2：从文本中提取的释义。
- rewrites：标注过程中产生的重写对列表。
- label：主要标签和附加标志。
- fold：数据分为100部分，尊重文档边界。
- goeswith：释义提取的文档标识。
- context：释义在原始文档中的位置。

标志

i：可追踪的小差异。
s：风格差异。
<：txt1比txt2更通用；txt2比txt1更具体。
>：txt2比txt1更通用；txt1比txt2更具体。

5,000+

优质数据集

54 个

任务类型

进入经典数据集

Finnish Paraphrase Corpus

Turku-paraphrase-corpus

数据集概述

文件结构

文件格式

标签

标志