mbasoz/xllora-datasets

Name: mbasoz/xllora-datasets
Creator: mbasoz
Published: 2026-04-13 13:09:42
License: 暂无描述

Hugging Face2026-04-13 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/mbasoz/xllora-datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: XL-LoRA Multilingual Triplet Dataset license: cc-by-4.0 task_categories: - sentence-similarity - text-retrieval language: - afr - af - hin - hi - tel - te - mar - mr - ind - id - hau - ha - kor - ko configs: - config_name: afr data_files: xllora-afrikaans.csv - config_name: hin data_files: xllora-hindi.csv - config_name: tel data_files: xllora-telugu.csv - config_name: mar data_files: xllora-marathi.csv - config_name: ind data_files: xllora-indonesian.csv - config_name: hau data_files: xllora-hausa.csv - config_name: kor data_files: xllora-korean.csv --- # XL-LoRA Multilingual Triplet Dataset This dataset contains multilingual sentence triplets generated using the **XL-LoRA** method described in the paper: **[Bootstrapping Embeddings for Low Resource Languages](https://arxiv.org/abs/2603.01732)** Each subset corresponds to a language and can be loaded using its **ISO 639-3 language code**. ## Dataset Structure All subsets share the same column schema: | Column | Description | |------|-------------| | `sent0` | Anchor sentence in the target language | | `sent1` | Positive sentence in English | | `hard_neg` | Hard negative sentence in English | The dataset is designed for **contrastive training of multilingual sentence embeddings**. Anchor sentences are written in the **target language**, while both positive and hard negative sentences are in **English**. ## Available Subsets | Language | ISO 639-3 Code | File | |---------|---------------|------| | Afrikaans | `afr` | `xllora-afrikaans.csv` | | Hindi | `hin` | `xllora-hindi.csv` | | Telugu | `tel` | `xllora-telugu.csv` | | Marathi | `mar` | `xllora-marathi.csv` | | Indonesian | `ind` | `xllora-indonesian.csv` | | Hausa | `hau` | `xllora-hausa.csv` | | Korean | `kor` | `xllora-korean.csv` | ## Usage Load a specific language subset: ```python from datasets import load_dataset dataset = load_dataset("mbasoz/xllora-datasets", "afr") print(dataset["train"][0]) ``` ## Related Resources - **Paper:** [Bootstrapping Embeddings for Low Resource Languages](https://arxiv.org/abs/2603.01732) - **Code:** https://github.com/mbasoz/xllora-embedding ## Citation If you use this dataset, please cite: ``` @article{basoz2026bootstrappingembeddings, title={Bootstrapping Embeddings for Low Resource Languages}, author={Merve Basoz and Andrew Horne and Mattia Opper}, year={2026}, eprint={2603.01732}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.01732}, note={Accepted to the LoResLM Workshop at EACL 2026} } ```

提供机构：

mbasoz

5,000+

优质数据集

54 个

任务类型

进入经典数据集