merionum/ru_paraphraser

Name: merionum/ru_paraphraser
Creator: merionum
Published: 2022-07-28 15:01:08
License: 暂无描述

Hugging Face2022-07-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/merionum/ru_paraphraser

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced - expert-generated - machine-generated language_creators: - crowdsourced language: - ru license: - mit multilinguality: - monolingual paperswithcode_id: null pretty_name: ParaPhraser size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-classification - text-generation - text2text-generation - sentence-similarity task_ids: - semantic-similarity-scoring --- # Dataset Card for ParaPhraser ### Dataset Summary ParaPhraser is a news headlines corpus annotated according to the following schema: ``` 1: precise paraphrases 0: near paraphrases -1: non-paraphrases ``` The _Plus_ part is also available. It contains clusters of news headline paraphrases labeled automatically by a fine-tuned paraphrase detection BERT model. In order to load it: ```python from datasets import load_dataset corpus = load_dataset('merionum/ru_paraphraser', data_files='plus.jsonl') ``` ## Dataset Structure ``` train: 7,227 pairs test: 1,924 pairs plus: 1,725,393 clusters (total: ~7m texts) ``` ### Citation Information ``` @inproceedings{pivovarova2017paraphraser, title={ParaPhraser: Russian paraphrase corpus and shared task}, author={Pivovarova, Lidia and Pronoza, Ekaterina and Yagunova, Elena and Pronoza, Anton}, booktitle={Conference on artificial intelligence and natural language}, pages={211--225}, year={2017}, organization={Springer} } ``` ``` @inproceedings{gudkov-etal-2020-automatically, title = "Automatically Ranked {R}ussian Paraphrase Corpus for Text Generation", author = "Gudkov, Vadim and Mitrofanova, Olga and Filippskikh, Elizaveta", booktitle = "Proceedings of the Fourth Workshop on Neural Generation and Translation", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.ngt-1.6", doi = "10.18653/v1/2020.ngt-1.6", pages = "54--59", abstract = "The article is focused on automatic development and ranking of a large corpus for Russian paraphrase generation which proves to be the first corpus of such type in Russian computational linguistics. Existing manually annotated paraphrase datasets for Russian are limited to small-sized ParaPhraser corpus and ParaPlag which are suitable for a set of NLP tasks, such as paraphrase and plagiarism detection, sentence similarity and relatedness estimation, etc. Due to size restrictions, these datasets can hardly be applied in end-to-end text generation solutions. Meanwhile, paraphrase generation requires a large amount of training data. In our study we propose a solution to the problem: we collect, rank and evaluate a new publicly available headline paraphrase corpus (ParaPhraser Plus), and then perform text generation experiments with manual evaluation on automatically ranked corpora using the Universal Transformer architecture.", } ``` ### Contributions Dataset maintainer: Vadim Gudkov: [@merionum](https://github.com/merionum)

提供机构：

merionum

原始信息汇总

数据集概述

数据集名称

ParaPhraser

数据集描述

ParaPhraser是一个新闻标题语料库，根据以下标注方案进行注释：

1: 精确复述 0: 近似复述 -1: 非复述

此外，还提供了一个名为“Plus”的部分，包含由微调的复述检测BERT模型自动标记的新闻标题复述集群。

数据集结构

训练集: 7,227对
测试集: 1,924对
Plus部分: 1,725,393个集群（总计约700万文本）

数据集任务

文本分类
文本生成
文本到文本生成
句子相似度
语义相似度评分

数据集语言

俄语（ru）

数据集许可证

MIT许可证

数据集大小

1M<n<10M

数据集来源

原始数据

数据集维护者

Vadim Gudkov: @merionum

5,000+

优质数据集

54 个

任务类型

进入经典数据集