five

lamarr-org/Lima-X

收藏
Hugging Face2024-10-02 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lamarr-org/Lima-X
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: DE_FR_IT_ES data_files: - path: train_limax_DE_FR_IT_ES.jsonl split: train - path: val_limax_DE_FR_IT_ES.jsonl split: val - config_name: DE data_files: - path: train_limax_DE.jsonl split: train - path: val_limax_DE.jsonl split: val - config_name: IT data_files: - path: train_limax_IT.jsonl split: train - path: val_limax_IT.jsonl split: val - config_name: EN data_files: - path: train_limax_EN.jsonl split: train - path: val_limax_EN.jsonl split: val - config_name: FR data_files: - path: train_limax_FR.jsonl split: train - path: val_limax_FR.jsonl split: val - config_name: EN_DE_FR_IT_ES_sampled data_files: - path: train_limax_EN_DE_FR_IT_ES_sampled.jsonl split: train - path: val_limax_EN_DE_FR_IT_ES_sampled.jsonl split: val - config_name: EN_DE_FR_IT_ES data_files: - path: train_limax_EN_DE_FR_IT_ES.jsonl split: train - path: val_limax_EN_DE_FR_IT_ES.jsonl split: val - config_name: DE_FR_IT_ES_sampled data_files: - path: train_limax_DE_FR_IT_ES_sampled.jsonl split: train - path: val_limax_DE_FR_IT_ES_sampled.jsonl split: val - config_name: ES data_files: - path: train_limax_ES.jsonl split: train - path: val_limax_ES.jsonl split: val license: other language: - en - de - fr - es - it size_categories: - 1K<n<10K --- # Lima-X The Lima-X dataset is an extension of the original [LIMA](https://huggingface.co/datasets/GAIR/lima) dataset, consisting of 1,030 carefully curated samples. Lima-X focuses on Indo-European languages, including English, German, French, Spanish, and Italian. For details about the creation of Lima-X, check out our paper [Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?](https://arxiv.org/abs/2402.13703) ## Usage We provide train and val splits for each language composition under separate dataset configuration names: * DE_FR_IT_ES * DE_FR_IT_ES_sampled * EN_DE_FR_IT_ES * DE_FR_IT_ES_sampled * EN * DE * FR * IT * ES Load both splits per dataset configuration name e.g: ```python from datasets import load_dataset load_dataset(path="lamarr-org/Lima-X", name="DE_FR_IT_ES") ```` ## License If the source data of Lima-X has a stricter license than CC BY-NC-SA, the Lima-X dataset follows the same. Otherwise, it follows the CC BY-NC-SA license. Hereby, we follow the license of [LIMA](https://huggingface.co/datasets/GAIR/lima). ## Citation ``` @misc{ weber2024investigatingmultilingualinstructiontuningpolyglot, title={Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?}, author={Alexander Arno Weber and Klaudia Thellmann and Jan Ebert and Nicolas Flores-Herr and Jens Lehmann and Michael Fromm and Mehdi Ali}, year={2024}, eprint={2402.13703}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2402.13703}, } ```
提供机构:
lamarr-org
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作