five

mbasoz/xllora-datasets

收藏
Hugging Face2026-04-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mbasoz/xllora-datasets
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: XL-LoRA Multilingual Triplet Dataset license: cc-by-4.0 task_categories: - sentence-similarity - text-retrieval language: - afr - af - hin - hi - tel - te - mar - mr - ind - id - hau - ha - kor - ko configs: - config_name: afr data_files: xllora-afrikaans.csv - config_name: hin data_files: xllora-hindi.csv - config_name: tel data_files: xllora-telugu.csv - config_name: mar data_files: xllora-marathi.csv - config_name: ind data_files: xllora-indonesian.csv - config_name: hau data_files: xllora-hausa.csv - config_name: kor data_files: xllora-korean.csv --- # XL-LoRA Multilingual Triplet Dataset This dataset contains multilingual sentence triplets generated using the **XL-LoRA** method described in the paper: **[Bootstrapping Embeddings for Low Resource Languages](https://arxiv.org/abs/2603.01732)** Each subset corresponds to a language and can be loaded using its **ISO 639-3 language code**. ## Dataset Structure All subsets share the same column schema: | Column | Description | |------|-------------| | `sent0` | Anchor sentence in the target language | | `sent1` | Positive sentence in English | | `hard_neg` | Hard negative sentence in English | The dataset is designed for **contrastive training of multilingual sentence embeddings**. Anchor sentences are written in the **target language**, while both positive and hard negative sentences are in **English**. ## Available Subsets | Language | ISO 639-3 Code | File | |---------|---------------|------| | Afrikaans | `afr` | `xllora-afrikaans.csv` | | Hindi | `hin` | `xllora-hindi.csv` | | Telugu | `tel` | `xllora-telugu.csv` | | Marathi | `mar` | `xllora-marathi.csv` | | Indonesian | `ind` | `xllora-indonesian.csv` | | Hausa | `hau` | `xllora-hausa.csv` | | Korean | `kor` | `xllora-korean.csv` | ## Usage Load a specific language subset: ```python from datasets import load_dataset dataset = load_dataset("mbasoz/xllora-datasets", "afr") print(dataset["train"][0]) ``` ## Related Resources - **Paper:** [Bootstrapping Embeddings for Low Resource Languages](https://arxiv.org/abs/2603.01732) - **Code:** https://github.com/mbasoz/xllora-embedding ## Citation If you use this dataset, please cite: ``` @article{basoz2026bootstrappingembeddings, title={Bootstrapping Embeddings for Low Resource Languages}, author={Merve Basoz and Andrew Horne and Mattia Opper}, year={2026}, eprint={2603.01732}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.01732}, note={Accepted to the LoResLM Workshop at EACL 2026} } ```
提供机构:
mbasoz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作