ParaZh-22M中文文本复述数据集

千言数据集2024-05-15 收录

下载链接：

https://www.luge.ai/#/luge/dataDetail?id=63

下载链接

链接失效反馈

官方服务：

资源简介：

通过机器翻译将WMT21中-英翻译数据的英文一侧翻译成中文，将中文译文和中文原文组合成一个复述句对, 共包含22M(两千两百万）的句对。 tgt.para 和out.para这两个文件的对应行为一个复述对，数据已通过BPE(32k合并操作)做了分词处理，并提供了zh.vcb.bpe和 zh.cds两个文件用于BPE处理。此数据集可用于复述生成模型的训练，进而可通过数据增强用于其他NLP任务，例如，用于低资源翻译任务、CLUE长文本/短文本分类任务等。另外，我们的复述集在生成时对WMT21中的原始句子进行了过滤，只保留了30M个句子中的22M个，为了避免在复述集的使用上造成麻烦，我们不仅提供了生成的复述句(out.para)，也提供了对应的原始中文句(tgt.para)。WMT21数据集下载地址为https://www.statmt.org/wmt21/translation-task.html。

We translated the English side of the WMT21 Chinese-English translation dataset into Chinese via machine translation, and paired the generated Chinese translations with the original Chinese sentences to form paraphrase sentence pairs, totaling 22 million (22M) pairs. Corresponding lines in the two files `tgt.para` and `out.para` constitute a single paraphrase pair. The dataset has been tokenized through BPE (32k merge operations), and two files `zh.vcb.bpe` and `zh.cds` are provided for BPE processing. This dataset can be utilized for training paraphrase generation models, and further applied to other NLP tasks via data augmentation, such as low-resource translation tasks, CLUE long/short text classification tasks, and so on. Additionally, we filtered the original sentences from the WMT21 dataset during the creation of this paraphrase set, retaining 22 million pairs out of the initial 30 million sentences. To prevent potential issues during the usage of the paraphrase set, we provide both the generated paraphrased sentences (`out.para`) and their corresponding original Chinese sentences (`tgt.para`). The download link for the WMT21 dataset is https://www.statmt.org/wmt21/translation-task.html.

提供机构：

郑州大学郑州大学天津大学郑州大学、鹏城实验室郑州大学

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成