five

ICTNLP/XBridge

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ICTNLP/XBridge
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - af - ar - az - bn - cs - de - el - en - es - et - fa - fi - fr - gl - gu - he - hi - hr - id - it - ja - ka - kk - km - lt - lv - mk - ml - mn - mr - my - ne - nl - pl - ps - pt - ro - ru - sl - sv - sw - ta - te - th - tr - uk - ur - vi - xh - zh task_categories: - question-answering size_categories: - 1M<n<10M --- # 💡Data Description Official data repository for our **ACL 2026 Main Conference** paper "*Language on Demand, Knowledge at Core*: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality". ## ✨Trilingual Translation Data `translation_10langs_y2en2x_3.6M.json` contains the trilingual translation data used for Stage 1 (cross-model alignment). - Source: extracted from OPUS-100 - Augmentation: translated using `NLLB-200-3.3B` - Format: *x-en-y* trilingual triples - Size: 50K per *x-y* translation direction, 72 directions It includes the following 10 languages: > Bn, De, En, Es, Fr, Ja, Ru, Sw, Th, Zh ## ✨Instruction-following Data `alpaca-dolly-50langs-2.5M.json` contains multilingual instruction-following data used for Stage 2 (encoder-side adaptation) and Stage 3 (decoder-side adaptation) - Source: constructed from `Bactrian-X` - Filtering: removes off-target samples - Augmentation: responses are expanded into English-centric bilingual outputs using `NLLB-200-3.3B` - Size: 50K per language, 50 langauges Compared to Stage 1, this dataset scales to 50 languages, leveraging the language-agnostic alignment learned in Stage 1. Additional languages include: > Af, Ar, Az, Cs, El, Et, Fa, Fi, Gl, Gu, He, Hi, Hr, Id, It, Ka, Kk, Km, Lt, Lv, Mk, Ml, Mn, Mr, My, Ne, Nl, Pl, Ps , Pt, Ro, Sl, Sv, Ta, Te, Tr, Uk, Ur, Vi, Xh --- See our [paper](https://arxiv.org/abs/2603.17512) for more details, and try our Gradio demo in the [github repository](https://github.com/ictnlp/XBridge)! # 📚Citation If you find this model or our work useful, please cite: ```tex @misc{bu2026languagedemandknowledgecore, title={Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality}, author={Mengyu Bu and Yang Feng}, year={2026}, eprint={2603.17512}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.17512}, } ``` # 📮Contact For questions, please contact: `bumengyu23z@ict.ac.cn`
提供机构:
ICTNLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作