five

MFRocket/MFRPC

收藏
Hugging Face2022-03-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/MFRocket/MFRPC
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - conditional-text-generation - paraphrase - gpt-3 - crowdsourced --- # MF Rocket Paraphrase Corpus (MFRPC) - A State of the Art Paraphrasing Solution ## Dataset Description MF Rocket Paraphrase Corpus (MFRPC) ) is a corpus consisting of 10,000 sentence pairs. Each sentence pair contains a source sentence and the paraphrased version of the source sentence. The source sentences are created manually and are intended to represent typical sentences found in online articles. They are limited to general topics and are not restricted to a specific domain. The paraphrased sentences were created partly using GPT-3 and partly manually. In this way, we hope to investigate the performance of GPT-3 in a typical real-world setting and improve the quality of the paraphrased sentences through manual corrections. By finetuning a model we Pegasus with this data, we create a paraphraser that performs very well. The results are indistinguishable from human parahrased sentences in a blind test. We are currently working on a data set with complete paragraphs or articles. For more information, our Contact form can be used at https://mf-rocket.de. ### Languages The BCP-47 code for the dataset's language is en. ## Dataset Structure ### Data Instances A sample from this dataset looks as follows: ```json [ { "text": "To overcome these difficulties, you must select an activity or goal that you are enthusiastic about [...]", "target": "To overcome these challenges, you need to find an activity or goal that you are passionate about and[...]" }, { "text": "If you are unsure about what to do next, seek advice from a close friend or family member you can tr[...]", "target": "If you are feeling lost, ask a trusted friend or family member for their opinion about what you shou[...]" } ] ``` ### Dataset Fields The dataset has the following fields (also called "features"): ```json { "text": "Value(dtype='string', id=None)", "target": "Value(dtype='string', id=None)" } ``` ### Dataset Splits This dataset is split into a train and validation split. The split sizes are as follow: | Split name | Num samples | | ------------ | ------------------- | | train | 8000 | | valid | 2000 |
提供机构:
MFRocket
原始信息汇总

MF Rocket Paraphrase Corpus (MFRPC)

数据集描述

MF Rocket Paraphrase Corpus (MFRPC) 是一个包含10,000个句子对的语料库。每个句子对包含一个源句子和该源句子的释义版本。源句子是手动创建的,旨在代表在线文章中常见的句子,不限于特定领域,涵盖一般主题。释义句子部分使用GPT-3生成,部分手动创建。通过这种方式,我们希望在典型的现实世界环境中研究GPT-3的性能,并通过手动校正提高释义句子的质量。

通过使用此数据微调Pegasus模型,我们创建了一个表现非常出色的释义器。在盲测中,结果与人工释义句子难以区分。

我们目前正在开发一个包含完整段落或文章的数据集。

语言

该数据集的语言BCP-47代码为en。

数据集结构

数据实例

数据集中的样本如下所示:

json [ { "text": "To overcome these difficulties, you must select an activity or goal that you are enthusiastic about [...]", "target": "To overcome these challenges, you need to find an activity or goal that you are passionate about and[...]" }, { "text": "If you are unsure about what to do next, seek advice from a close friend or family member you can tr[...]", "target": "If you are feeling lost, ask a trusted friend or family member for their opinion about what you shou[...]" } ]

数据字段

数据集包含以下字段(也称为“特征”):

json { "text": "Value(dtype=string, id=None)", "target": "Value(dtype=string, id=None)" }

数据集分割

该数据集分为训练集和验证集。分割大小如下:

分割名称 样本数量
train 8000
valid 2000
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作