ICTNLP/XBridge

Name: ICTNLP/XBridge
Creator: ICTNLP
Published: 2026-04-20 07:05:09
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/ICTNLP/XBridge

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - af - ar - az - bn - cs - de - el - en - es - et - fa - fi - fr - gl - gu - he - hi - hr - id - it - ja - ka - kk - km - lt - lv - mk - ml - mn - mr - my - ne - nl - pl - ps - pt - ro - ru - sl - sv - sw - ta - te - th - tr - uk - ur - vi - xh - zh task_categories: - question-answering size_categories: - 1M<n<10M --- # 💡Data Description Official data repository for our **ACL 2026 Main Conference** paper "*Language on Demand, Knowledge at Core*: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality". ## ✨Trilingual Translation Data `translation_10langs_y2en2x_3.6M.json` contains the trilingual translation data used for Stage 1 (cross-model alignment). - Source: extracted from OPUS-100 - Augmentation: translated using `NLLB-200-3.3B` - Format: *x-en-y* trilingual triples - Size: 50K per *x-y* translation direction, 72 directions It includes the following 10 languages: > Bn, De, En, Es, Fr, Ja, Ru, Sw, Th, Zh ## ✨Instruction-following Data `alpaca-dolly-50langs-2.5M.json` contains multilingual instruction-following data used for Stage 2 (encoder-side adaptation) and Stage 3 (decoder-side adaptation) - Source: constructed from `Bactrian-X` - Filtering: removes off-target samples - Augmentation: responses are expanded into English-centric bilingual outputs using `NLLB-200-3.3B` - Size: 50K per language, 50 langauges Compared to Stage 1, this dataset scales to 50 languages, leveraging the language-agnostic alignment learned in Stage 1. Additional languages include: > Af, Ar, Az, Cs, El, Et, Fa, Fi, Gl, Gu, He, Hi, Hr, Id, It, Ka, Kk, Km, Lt, Lv, Mk, Ml, Mn, Mr, My, Ne, Nl, Pl, Ps , Pt, Ro, Sl, Sv, Ta, Te, Tr, Uk, Ur, Vi, Xh --- See our [paper](https://arxiv.org/abs/2603.17512) for more details, and try our Gradio demo in the [github repository](https://github.com/ictnlp/XBridge)! # 📚Citation If you find this model or our work useful, please cite: ```tex @misc{bu2026languagedemandknowledgecore, title={Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality}, author={Mengyu Bu and Yang Feng}, year={2026}, eprint={2603.17512}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.17512}, } ``` # 📮Contact For questions, please contact: `bumengyu23z@ict.ac.cn`

提供机构：

ICTNLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集