proxectonos/cpt_instruction_datasets

Name: proxectonos/cpt_instruction_datasets
Creator: proxectonos
Published: 2026-04-21 16:07:09
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/proxectonos/cpt_instruction_datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: CPT Instruction Datasets language: - gl - pt - en - es license: - cc-by-sa-4.0 - cc-by-4.0 - apache-2.0 - mit - cc-by-nc-sa-4.0 task_categories: - text-generation - text-classification - token-classification - question-answering - summarization - translation tags: - instruction-tuning - multilingual - galician - portuguese - spanish - english - synthetic-data - nlp size_categories: - 100K<n<1M configs: - config_name: en_natural_instructions_named_entity_recognition data_files: - split: train path: en_natural_instructions_named-entity-recognition.jsonl - config_name: en_OpenAssistant_chat data_files: - split: train path: en_OpenAssistant_chat.jsonl - config_name: en_QASC_QA_multiple_choice data_files: - split: train path: en_QASC_QA_multiple-choice.jsonl - config_name: es_alexsis_simplificacion_linguistica data_files: - split: train path: es_alexsis_simplificacion_linguistica.jsonl - config_name: es_COAR_sentiment_analysis data_files: - split: train path: es_COAR_sentiment_analysis.jsonl - config_name: es_corpus_coah_sentiment_analysis data_files: - split: train path: es_corpus_coah_sentiment_analysis.jsonl - config_name: gl_codigocero_summarization_w_source data_files: - split: train path: gl_codigocero_summarization_w_source.jsonl - config_name: gl_EGU_definicions data_files: - split: train path: gl_EGU_definicions.jsonl - config_name: gl_Galcola_correccion data_files: - split: train path: gl_Galcola_correccion.jsonl - config_name: gl_SLI_NERC_named_entity_recognition data_files: - split: train path: gl_SLI_NERC_named-entity-recognition.jsonl - config_name: gl_SLI_PoS_analysis data_files: - split: train path: gl_SLI_PoS_analysis.jsonl - config_name: gl_wikipedia_multiple_choice_qa data_files: - split: train path: gl_wikipedia_multiple-choice_qa.jsonl - config_name: gl_wikipedia_summarization_w_source data_files: - split: train path: gl_wikipedia_summarization_w_source.jsonl - config_name: gl_en_mt_traducion data_files: - split: train path: gl-en_mt_traducion.jsonl - config_name: gl_es_mt_traducion data_files: - split: train path: gl-es_mt_traducion.jsonl - config_name: pt_aya_dataset_train data_files: - split: train path: pt_aya_dataset_train.jsonl - config_name: pt_wikipedia_multiple_choice_qa data_files: - split: train path: pt_wikipedia_multiple-choice_qa.jsonl - config_name: pt_BR_OpenAssistant_chat data_files: - split: train path: pt-BR_OpenAssistant_chat.jsonl - config_name: pt_pt_instruct_boolq data_files: - split: train path: pt-pt_instruct_boolq.jsonl - config_name: pt_pt_instruct_cb data_files: - split: train path: pt-pt_instruct_cb.jsonl - config_name: pt_pt_instruct_multirc_1 data_files: - split: train path: pt-pt_instruct_multirc_1.jsonl - config_name: pt_pt_instruct_multirc_2 data_files: - split: train path: pt-pt_instruct_multirc_2.jsonl - config_name: pt_pt_instruct_multirc_3 data_files: - split: train path: pt-pt_instruct_multirc_3.jsonl - config_name: pt_pt_instruct_stsb data_files: - split: train path: pt-pt_instruct_stsb.jsonl - config_name: pt_pt_instruct_wnli data_files: - split: train path: pt-pt_instruct_wnli.jsonl --- # Instruction datasets Collection of synthetic instruction datasets used during the continued pretraining of Model-small-instr-1, Model-small-instr-2 and Model-small-instr-3. You can currently find these models under: [Llama-3.1-Carballo-Instr1](proxectonos/Llama-3.1-Carballo-Instr1) and [Llama-3.1-Carballo-Instr3](proxectonos/Llama-3.1-Carballo-Instr3). ## Dataset creation Datasets were created using two different techniques: - Adapting already existing datasets or corpora by modifying their format to make them suitable for including instructions during continued pretraining. - Creating the dataset from scratch by using a LLM, [Salamandra-7B](https://huggingface.co/BSC-LT/salamandra-7b-instruct), to generate diverse instructions based on existing data. ## Dataset information The following table presents the basic information of each dataset. | Language | Dataset Name | Type | Entries | Creation Method | File Size (MB) | License | |----------|--------------|------|---------|-----------------|----------------|---------| | GL | EGU (Enciclopedia Galega Universal) | Encyclopedic Knowledge | 47,396 | Manually Adapted | 33.00 | CC BY-SA 4.0 | | GL | MT (GL - ES) | Translations | 275,292 | Manually Adapted | 101.00 | CC BY-SA 4.0 | | GL | MT (GL - EN) | Translations | 421,974 | Manually Adapted | 101.00 | CC BY-SA 4.0 | | GL | SLI NER | Named Entity Recognition | 8,138 | Manually Adapted | 2.40 | CC BY 4.0 | | GL | GalCoLA | Orthographic Correction | 8,160 | Manually Adapted | 1.90 | CC BY-SA 4.0 | | GL | SLI PoS TAGGING | Morphological Analysis | 46,864 | Manually Adapted | 45.15 | CC BY 4.0 | | GL | Wikipedia Multiple-Choice QA | QA Multiple-choice | 1,486 | LLM-Generated | 1.60 | CC BY-SA 4.0 | | GL | CódigoCero Summarization | Summarization | 342 | LLM-Generated | 1.60 | CC BY-SA 4.0 | | PT | Wikipedia Multiple-Choice QA | QA Multiple-choice | 547 | LLM-Generated | 0.59 | CC BY-SA 4.0 | | PT | Extraglue-Instruct (Boolean Questions) | QA Simple | 28,281 | Manually Adapted | 25.00 | MIT License | | PT | Extraglue-Instruct (CB) | Concept Bottleneck | 1,500 | Manually Adapted | 1.20 | MIT License | | PT | Extraglue-Instruct (MultiRC) | Reading Comprehension | 108,972 | Manually Adapted | 221.00 | MIT License | | PT | Extraglue-Instruct (STSB) | Text Similarity | 22,996 | Manually Adapted | 18.00 | MIT License | | PT | Extraglue-Instruct (WNLI) | NLI (Inference) | 3,810 | Manually Adapted | 2.40 | MIT License | | PT | Aya (Train) | QA Simple | 8,997 | Manually Adapted | 3.00 | Apache License 2.0 | | PT | OpenAssistant | Chat / Assistant | 287 | Manually Adapted | 1.90 | Apache License 2.0 | | EN | Natural Instructions - NER | Named Entity Recognition | 1,574 | Manually Adapted | 1.20 | Apache License 2.0 | | EN | QASC | QA Multiple-choice | 9,980 | Manually Adapted | 8.90 | CC BY 4.0 | | EN | OpenAssistant | Chat / Assistant | 154 | Manually Adapted | 1.20 | Apache License 2.0 | | ES | ALEXSIS | Linguistic Simplification | 3,918 | Manually Adapted | 2.20 | CC BY-NC-SA 4.0 | | ES | COAH | Sentiment Analysis | 1,816 | Manually Adapted | 1.60 | CC BY-NC-SA 4.0 | | ES | COAR | Sentiment Analysis | 2,202 | Manually Adapted | 1.30 | CC BY-NC-SA 4.0 | ## Usage Each dataset is exposed as a separate subset/configuration. Example: ```python from datasets import load_dataset ds = load_dataset("proxectonos/cpt_instruction_datasets", "gl_Galcola_correccion", split="train") print(ds[0]) ``` ## Disclaimer and limitations Some datasets were not included due to license restrictions, meaning they cannot be publicly shared or redistributed. The datasets included in this repository follow various licenses, such as CC BY-SA 4.0, CC BY 4.0, CC-BY-NC-ND 4.0, Apache License 2.0, and MIT License. Users should ensure they comply with the respective licenses when using these datasets. If a dataset requires attribution, non-commercial use, or has other restrictions, those conditions must be respected. For more details on specific license terms, please refer to the official documentation or the original data sources. ## Acknowledgements These datasets were developed and compiled within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215336. ## Cite These datasets were created and curated for this paper: **Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study**. Please, cite us if you use any of our resources. ```bibtex @inproceedings{rodriguez-etal-2025-continued, title = "Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A {G}alician Case Study", author = "Rodr{\'i}guez, Pablo and Su{\'a}rez, Silvia Paniagua and Gamallo, Pablo and Docio, Susana Sotelo", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Findings of the Association for Computational Linguistics: ACL 2025", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.findings-acl.240/", doi = "10.18653/v1/2025.findings-acl.240", pages = "4622--4637", ISBN = "979-8-89176-256-5", abstract = "Recent advances in Large Language Models (LLMs) have led to remarkable improvements in language understanding and text generation. However, challenges remain in enhancing their performance for underrepresented languages, ensuring continual learning without catastrophic forgetting, and developing robust evaluation methodologies. This work addresses these issues by investigating the impact of Continued Pretraining (CPT) on multilingual models and proposing a comprehensive evaluation framework for LLMs, focusing on the case of Galician language. Our first contribution explores CPT strategies for languages with limited representation in multilingual models. We analyze how CPT with Galician corpora improves text generation while assessing the trade-offs between linguistic enrichment and task-solving capabilities. Our findings show that CPT with small, high-quality corpora and diverse instructions enhances both task performance and linguistic quality. Our second contribution is a structured evaluation framework based on distinguishing task-based and language-based assessments, leveraging existing and newly developed benchmarks for Galician. Additionally, we contribute new Galician LLMs, datasets for evaluation and instructions, and an evaluation framework." } ``` ## Citations (Existing datasets) ```bibtex @incollection{Molina-Gonzalez2014, author = {M. D. Molina-González and E. Martínez-Cámara and M. T. Martín-Valdivia and L. A. Ureña-López}, title = {Cross-domain sentiment analysis using Spanish opinionated words}, booktitle = {Natural Language Processing and Information Systems}, volume = {8455}, pages = {214--219}, publisher = {Springer International Publishing}, year = {2014}, doi = {10.1007/978-3-319-07983-7_28}, } @inproceedings{naturalinstructions, title={Cross-task generalization via natural language crowdsourcing instructions}, author={Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh}, booktitle={ACL}, year={2022} } ```bibtex @inproceedings{supernaturalinstructions, title={Super-NaturalInstructions:Generalization via Declarative Instructions on 1600+ Tasks}, author={Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and others}, booktitle={EMNLP}, year={2022} } @article{allenai:qasc, author = {Tushar Khot and Peter Clark and Michal Guerquin and Peter Jansen and Ashish Sabharwal}, title = {QASC: A Dataset for Question Answering via Sentence Composition}, journal = {arXiv:1910.11473v2}, year = {2020}, } @misc{köpf2023openassistantconversationsdemocratizing, title={OpenAssistant Conversations -- Democratizing Large Language Model Alignment}, author={Andreas Köpf and Yannic Kilcher and Dimitri von Rütte and Sotiris Anagnostidis and Zhi-Rui Tam and Keith Stevens and Abdullah Barhoum and Nguyen Minh Duc and Oliver Stanley and Richárd Nagyfi and Shahul ES and Sameer Suri and David Glushkov and Arnav Dantuluri and Andrew Maguire and Christoph Schuhmann and Huu Nguyen and Alexander Mattick}, year={2023}, eprint={2304.07327}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2304.07327}, } @inproceedings{agerri-etal-2018-developing, title = "Developing New Linguistic Resources and Tools for the {G}alician Language", author = "Agerri, Rodrigo and G{\'o}mez Guinovart, Xavier and Rigau, German and Solla Portela, Miguel Anxo", editor = "Calzolari, Nicoletta and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Hasida, Koiti and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{\'e}l{\`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios and Tokunaga, Takenobu", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://aclanthology.org/L18-1367/" } @inproceedings{ferres-saggion@LREC2022, title = "ALEXSIS: A Dataset for Lexical Simplification in Spanish.", author = "Ferrés, Daniel and Saggion, Horacio", booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {3582--3594}, url = {https://aclanthology.org/2022.lrec-1.383} } @misc{gervasio, title={Advancing Generative AI for Portuguese with Open Decoder Gervásio PT-*}, author={Rodrigo Santos, João Silva, Luís Gomes, João Rodrigues, António Branco}, year={2024}, eprint={2402.18766}, archivePrefix={arXiv}, primaryClass={cs.CL} } @misc{singh2024aya, title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning}, author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker}, year={2024}, eprint={2402.06619}, archivePrefix={arXiv}, primaryClass={cs.CL} } @dataset{gamallo_pablo_2023_7671278, author = {Gamallo Pablo and Garcia Marcos and de-Dios-Flores Iria and Ramom Pichel Campos José and Sandra Rodríguez Rey and Bardanca Daniel}, title = {Nos\_ES-GL\_aut}, month = mar, year = 2023, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.7671278}, url = {https://doi.org/10.5281/zenodo.7671278}, } ```

提供机构：

proxectonos

5,000+

优质数据集

54 个

任务类型

进入经典数据集