almanach/topxgen-llama-4-scout-and-llama-4-scout
收藏Hugging Face2025-09-30 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/almanach/topxgen-llama-4-scout-and-llama-4-scout
下载链接
链接失效反馈官方服务:
资源简介:
TopXGen数据集是一个为10种低资源语言创建的合成并行数据集,通过使用多语言LLM的TopXGen管道生成,旨在用于机器翻译的微调和少量样本实验。该数据集通过LLM生成具有多样化主题的段落,然后使用MT模型进行句子分割和翻译/回译,并通过类似于self-instruct的方法去除冗余。训练在该TopXGen数据集上的模型能够达到与生成器和回译模型相近的翻译性能。
TopXGen is a synthetic parallel dataset for 10 low-resource languages, created by applying the TopXGen pipeline with recent multilingual LLMs. It is designed for machine translation (MT) fine-tuning and few-shot experiments. The dataset is generated through topic-diverse paragraph generation using an LLM, followed by sentence splitting and translation/back-translation with a MT model, and redundancy removal similar to the self-instruct approach. Models trained on this TopXGen dataset achieve translation performance close to that of the generator and back-translator.
提供机构:
almanach



