shijli/wmt16-roen
收藏Hugging Face2023-09-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shijli/wmt16-roen
下载链接
链接失效反馈官方服务:
资源简介:
# WMT 2016 Romanian-English Translation Dataset
The original dataset can be downloaded from [here](https://github.com/nyu-dl/dl4mt-nonauto)
You can create this dataset by simply run:
```commandline
git clone https://huggingface.co/datasets/shijli/wmt16-roen
cd wmt16-roen/data
bash prepare-wmt16.sh
```
`binarized.dist.ro-en.zip` and `binarized.dist.en-ro.zip` are distilled datasets generated by a transformer base model.
It can be built by running:
```commandline
bash prepare-wmt16-distill.sh /path/to/fairseq/model source-lang target-lang
```
To build this dataset, you need to create `binarized.zip` first. Note that the distilled dataset only uses
model-generated
target sentences, which means that different translation directions result in different datasets. Therefore, you need to
specify `source-lang` and `target-lang` explicitly. Also, you need to replace `/path/to/fairseq/model` with the path of
your pretrained model.
# WMT 2016罗马尼亚语-英语机器翻译数据集
原始数据集可通过[此处](https://github.com/nyu-dl/dl4mt-nonauto)下载。
您可通过以下命令构建该数据集:
commandline
git clone https://huggingface.co/datasets/shijli/wmt16-roen
cd wmt16-roen/data
bash prepare-wmt16.sh
`binarized.dist.ro-en.zip`与`binarized.dist.en-ro.zip`为基于Transformer(Transformer)基础模型生成的蒸馏数据集。该蒸馏数据集可通过以下命令构建:
commandline
bash prepare-wmt16-distill.sh /path/to/fairseq/model source-lang target-lang
构建该蒸馏数据集前,需先生成`binarized.zip`文件。请注意:蒸馏数据集仅使用模型生成的目标语句,这意味着不同的翻译方向将对应不同的数据集。因此您需明确指定`source-lang`与`target-lang`参数,同时需将`/path/to/fairseq/model`替换为您的预训练模型的存储路径。
提供机构:
shijli
原始信息汇总
WMT 2016 Romanian-English Translation Dataset
数据集创建
- 通过以下命令克隆并准备数据集: commandline git clone https://huggingface.co/datasets/shijli/wmt16-roen cd wmt16-roen/data bash prepare-wmt16.sh
数据集类型
binarized.dist.ro-en.zip和binarized.dist.en-ro.zip是由 transformer 基础模型生成的蒸馏数据集。
蒸馏数据集创建
-
通过以下命令创建蒸馏数据集: commandline bash prepare-wmt16-distill.sh /path/to/fairseq/model source-lang target-lang
-
需要先创建
binarized.zip。 -
蒸馏数据集仅使用模型生成的目标句子,因此不同的翻译方向会生成不同的数据集。
-
需要明确指定
source-lang和target-lang。 -
需要将
/path/to/fairseq/model替换为预训练模型的路径。



