airesearch/wangchanx-seed-free-synthetic-instruct-thai-120k

Name: airesearch/wangchanx-seed-free-synthetic-instruct-thai-120k
Creator: airesearch
Published: 2024-10-03 08:13:27
License: 暂无描述

Hugging Face2024-10-03 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/airesearch/wangchanx-seed-free-synthetic-instruct-thai-120k

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - th - en license: mit size_categories: - 100K<n<1M task_categories: - text-generation - question-answering - summarization pretty_name: Seed-Free Synthetic Instruct Thai 120k tags: - synthetic-data - instruction-tuning - thai-language - multi-task dataset_info: features: - name: instruction dtype: string - name: context dtype: string - name: output dtype: string - name: type dtype: string - name: context_length dtype: int64 - name: rating dtype: float64 - name: qc_rationale dtype: string splits: - name: train num_bytes: 636862793 num_examples: 118898 download_size: 238861070 dataset_size: 636862793 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for WangchanX Seed-Free Synthetic Instruct Thai 120k ## Dataset Summary This dataset contains about 120k synthetic instruction-following samples in Thai, generated using a novel seed-free approach. It covers a wide range of domains derived from Wikipedia, including both general knowledge and Thai-specific cultural topics. The dataset is designed for instruction-tuning Thai language models to improve their ability to understand and generate Thai text in various contexts and task types. ## Dataset Details ### Overview - **Size:** ~120k records - **Generation Technique:** Seed-Free Synthetic Data Generation Framework (ACL SRW 2024) - **Generation Model:** [Qwen2-72B Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) - **Scoring Model:** [Llama 3 70B Instruct](https://huggingface.co/meta-llama/Llama-3-70b-instruct) ### Data Fields - `instruction`: The task or question posed in Thai - `context`: Additional context or information provided for the task (if applicable) - `output`: The model-generated response in Thai - `type`: The category of the task (e.g., conversation, multiple_choice, etc.) - `context_length`: The length of the context provided - `rating`: Quality rating assigned by the scoring model for quality control - `qc_rationale`: The scorer model's output explaining the quality rating ### Task Types Distribution The dataset covers five main task types: 1. Conversation: Single-turn simulated dialogues and chat-like interactions (24,865 samples) 2. Multiple Choice: Questions with several answer options (23,975 samples) 3. Brainstorming: Open-ended idea generation tasks (23,844 samples) 4. Question Answering: Factual and analytical questions requiring specific answers (23,366 samples) 5. Summarization: Tasks involving condensing longer texts into concise summaries (22,848 samples) ### Dataset Description - **Curated by:** AIResearch - **Language(s):** Thai (primary), English (secondary) - **Domains:** General knowledge, Thai culture, history, current events, and more ### Dataset Sources - **Repository:** [https://github.com/parinzee/seed-free-synthetic-instruct](https://github.com/parinzee/seed-free-synthetic-instruct) - **Paper:** [Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai](https://github.com/parinzee/seed-free-synthetic-instruct) (ACL Student Research Workshop 2024) ## Uses This dataset is primarily intended for: 1. Fine-tuning or instruction-tuning pre-trained language models to improve Thai language understanding and generation. 2. Enhancing the conversational abilities of AI models in Thai. 3. Improving model performance on Thai-specific tasks and cultural contexts. 4. Benchmarking language model performance on various Thai language tasks, including conversation, multiple-choice questions, brainstorming, question answering, and summarization. ## Considerations for Using the Data ### Social Impact of Dataset - This dataset aims to improve Thai language AI capabilities across various task types, potentially leading to better language technologies for Thai speakers. - It may help preserve and promote Thai culture through improved AI understanding of Thai-specific contexts. ### Discussion of Biases - The dataset may reflect biases present in the source material (Wikipedia) and in the generation model. - Users should be aware of potential gender, cultural, or historical biases in the generated content. - The distribution of task types may influence model performance on different types of language tasks. ### Other Known Limitations - As a synthetic dataset, it may not perfectly reflect natural human language use or real-world knowledge accuracy. - The quality of instructions and responses is dependent on the capabilities of the generating model. - The dataset's effectiveness may vary across different task types. ## Additional Information ### Licensing Information This dataset is released under the MIT License. ### Citation Information If you use this dataset in your research, please cite our upcoming paper: ``` @inproceedings{pengpun-etal-2024-seed, title = "Seed-Free Synthetic Data Generation Framework for Instruction-Tuning {LLM}s: A Case Study in {T}hai", author = "Pengpun, Parinthapat and Udomcharoenchaikit, Can and Buaphet, Weerayut and Limkonchotiwat, Peerat", editor = "Fu, Xiyan and Fleisig, Eve", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-srw.38", pages = "438--457", abstract = "We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at https://github.com/parinzee/seed-free-synthetic-instruct.", } ```

提供机构：

airesearch

5,000+

优质数据集

54 个

任务类型

进入经典数据集