shijli/wmt16-roen

Name: shijli/wmt16-roen
Creator: shijli
Published: 2023-09-14 07:14:22
License: 暂无描述

Hugging Face2023-09-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/shijli/wmt16-roen

下载链接

链接失效反馈

官方服务：

资源简介：

# WMT 2016 Romanian-English Translation Dataset The original dataset can be downloaded from [here](https://github.com/nyu-dl/dl4mt-nonauto) You can create this dataset by simply run: ```commandline git clone https://huggingface.co/datasets/shijli/wmt16-roen cd wmt16-roen/data bash prepare-wmt16.sh ``` `binarized.dist.ro-en.zip` and `binarized.dist.en-ro.zip` are distilled datasets generated by a transformer base model. It can be built by running: ```commandline bash prepare-wmt16-distill.sh /path/to/fairseq/model source-lang target-lang ``` To build this dataset, you need to create `binarized.zip` first. Note that the distilled dataset only uses model-generated target sentences, which means that different translation directions result in different datasets. Therefore, you need to specify `source-lang` and `target-lang` explicitly. Also, you need to replace `/path/to/fairseq/model` with the path of your pretrained model.

# WMT 2016罗马尼亚语-英语机器翻译数据集原始数据集可通过[此处](https://github.com/nyu-dl/dl4mt-nonauto)下载。您可通过以下命令构建该数据集： commandline git clone https://huggingface.co/datasets/shijli/wmt16-roen cd wmt16-roen/data bash prepare-wmt16.sh `binarized.dist.ro-en.zip`与`binarized.dist.en-ro.zip`为基于Transformer（Transformer）基础模型生成的蒸馏数据集。该蒸馏数据集可通过以下命令构建： commandline bash prepare-wmt16-distill.sh /path/to/fairseq/model source-lang target-lang 构建该蒸馏数据集前，需先生成`binarized.zip`文件。请注意：蒸馏数据集仅使用模型生成的目标语句，这意味着不同的翻译方向将对应不同的数据集。因此您需明确指定`source-lang`与`target-lang`参数，同时需将`/path/to/fairseq/model`替换为您的预训练模型的存储路径。

提供机构：

shijli

原始信息汇总

WMT 2016 Romanian-English Translation Dataset

数据集创建

通过以下命令克隆并准备数据集： commandline git clone https://huggingface.co/datasets/shijli/wmt16-roen cd wmt16-roen/data bash prepare-wmt16.sh

数据集类型

binarized.dist.ro-en.zip 和 binarized.dist.en-ro.zip 是由 transformer 基础模型生成的蒸馏数据集。

蒸馏数据集创建

通过以下命令创建蒸馏数据集： commandline bash prepare-wmt16-distill.sh /path/to/fairseq/model source-lang target-lang
需要先创建 binarized.zip。
蒸馏数据集仅使用模型生成的目标句子，因此不同的翻译方向会生成不同的数据集。
需要明确指定 source-lang 和 target-lang。
需要将 /path/to/fairseq/model 替换为预训练模型的路径。

5,000+

优质数据集

54 个

任务类型

进入经典数据集