KoAlpaca-RealQA

Name: KoAlpaca-RealQA
Creator: maas
Published: 2025-11-12 16:17:43
License: 暂无描述

魔搭社区2025-11-12 更新2024-11-16 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/KoAlpaca-RealQA

下载链接

链接失效反馈

官方服务：

资源简介：

# KoAlpaca-RealQA: A Korean Instruction Dataset Reflecting Real User Scenarios ## Dataset Summary The **KoAlpaca-RealQA** dataset is a unique Korean instruction dataset designed to closely reflect real user interactions in the Korean language. Unlike conventional Korean instruction datasets that rely heavily on translated prompts, this dataset is composed of authentic Korean instructions derived from real-world use cases. Specifically, the dataset has been curated from user interactions with the **ChatKoAlpaca** service, which is based on the KoAlpaca model serviced between 2023 and 2024. This dataset provides a more accurate portrayal of typical Korean user behaviors, questions, and language structures, making it highly relevant for developing language models aimed at understanding and responding to Korean speakers. By leveraging GPT4o to generate high-quality answers, KoAlpaca-RealQA aims to offer a robust resource for training models that need to engage with Korean users in a natural and meaningful way. ## Key Features - **Authentic Korean Instructions**: The dataset captures genuine Korean instructions and questions, diverging from traditional datasets that are often machine-translated and may not reflect natural Korean language usage or common inquiries. - **User-Driven Scenarios**: Instructions were sourced from real user interactions with ChatKoAlpaca, ensuring that the data closely represents actual scenarios and requests made by Korean speakers. - **GPT-4o Generated Responses**: Each instruction in the dataset is paired with a response generated by GPT-4o, producing high-quality, contextually appropriate answers that enhance the dataset’s reliability. - **Aligned with Contemporary Korean Linguistic Trends**: The dataset reflects modern Korean language usage patterns, making it relevant for current applications and models. ## Dataset Structure ### Data Instances Each instance in the dataset contains: - **Instruction**: The user's original input in Korean, extracted from the ChatKoAlpaca interactions. - **Response**: The answer generated by GPT-4o, tailored to the instruction and ensuring natural language alignment. Example: ```json { "custom_id": "request-8", "question": "타코와 브리또의 차이가 뭐야?", "answer": "타코와 브리또는 모두 멕시코 요리에서 유래된 음식이지만, 몇 가지 주요한 차이점이 있습니다. 1. **크기와 형태**: - **타코**: 일반적으로 작은 크기의 옥수수 또는 밀가루 토르티야를 사용하며, 포켓처럼 반으로 접어 재료를 담습니다. 타코는 한 손에 들어오는 크기로, 여러 개를 한 끼에 먹는 경우가 많습니다. - **브리또**: 상대적으로 큰 밀가루 토르티야를 사용하여 재료를 완전히 감싸는 형태로 만들어집니다. 원통형으로 말려 있으며, 한 개만으로도 충분한 식사가 될 수 있습니다. 2. **내용물**: - **타코**: 보통 육류(소고기, 닭고기, 돼지고기, 생선 등), 채소(양파, 고수), 살사, 치즈 등을 포함하며 다양한 소스를 곁들여 먹습니다. - **브리또**: 쌀, 콩, 육류, 채소, 치즈, 살사, 사워크림 등을 포함하며 더 풍성한 재료가 들어가기도 합니다. 3. **포장 방식**: - **타코**: 재료가 오픈된 상태로 토르티야 위에 놓입니다. - **브리또**: 재료를 토르티야로 완전히 감싸기 때문에 들고 먹기 간편합니다. 이 차이점들이 각각의 독특한 맛과 식사 경험을 제공합니다." } ``` - Question: Extracted from ChatKoAlpaca service - Answer: Generated with OpenAI `gpt-4o-2024-08-06` model. ### Data Fields - **instruction** (string): A real user input or question in Korean. - **response** (string): The model-generated answer, crafted by GPT-4o to align with the instruction and provide relevant, accurate information. ### Data Splits The dataset may be split into training, validation, and test sets to support different stages of model development. Specific split ratios depend on the user’s requirements for balancing training and evaluation. ## Dataset Creation ### Source of Instructions The instructions were collected from interactions with the ChatKoAlpaca service, where Korean users input various questions, requests, or instructions. This approach ensures that the instructions reflect common topics and phrasing typical of natural Korean conversations. ### Response Generation Each response is generated by OpenAI’s GPT-4o.(`gpt-4o-2024-08-06`) ### Motivation and Use Cases The dataset was developed to address the limitations of translated instruction datasets, which often fail to capture the nuances of the Korean language and cultural context. By basing instructions on real user input and generating responses that adhere closely to Korean language norms, KoAlpaca-RealQA enables the development of more effective, culturally relevant Korean language models. Potential use cases include: - **Training conversational AI systems**: Developers can use KoAlpaca-RealQA to fine-tune models for Korean-speaking audiences, improving naturalness and relevance in responses. - **Evaluation of Korean language comprehension**: The dataset provides a benchmark for assessing the ability of AI models to understand and respond to Korean instructions accurately. - **Linguistic and cultural adaptation studies**: The dataset is valuable for research into linguistic and cultural adaptation of language models for specific demographics. ## Dataset Limitations - **Limited to Korean Language**: This dataset focuses exclusively on Korean language interactions and may not generalize well to other languages or cultural contexts. - **Context Scope**: While the dataset is diverse in instructions, responses are generated based on a single-round interaction format, which may limit depth in multi-turn conversations. ## Finetuned Models - QLoRA tuned: [beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B](https://huggingface.co/beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B) ## Acknowledgements The KoAlpaca-RealQA dataset was developed with the contributions of the ChatKoAlpaca community and the support of GPT-4o’s capabilities. Special thanks to all Korean users who contributed their instructions to improve the quality and relevance of this dataset. ## Citation If you use this dataset, please cite: ```bibtex @dataset{koalpaca_realqa, title = {KoAlpaca-RealQA: A Korean Instruction Dataset Reflecting Real User Scenarios}, year = {2024}, author = {ChatKoAlpaca Community}, howpublished = {\url{https://huggingface.co/datasets/beomi/KoAlpaca-RealQA}}, description = {Dataset capturing real Korean instructions and GPT-4o generated responses.} } ``` ## License CC-BY-SA-4.0

# KoAlpaca-RealQA: 反映真实用户场景的韩语指令数据集 ## 数据集概述 **KoAlpaca-RealQA** 是一款独具特色的韩语指令数据集，旨在精准还原韩语环境下的真实用户交互场景。与多数依赖大量翻译提示词的传统韩语指令数据集不同，本数据集的指令均源自真实使用场景下的地道韩语表达。具体而言，数据集的语料采集自2023至2024年间基于KoAlpaca模型上线的**ChatKoAlpaca**服务的真实用户交互记录。本数据集更准确地刻画了典型韩语用户的行为习惯、提问方式与语言结构，对于开发面向韩语使用者的理解与应答型语言模型具有极高的应用价值。本数据集借助GPT-4o生成高质量应答，旨在为需要以自然、贴合语境的方式与韩语用户互动的模型训练提供可靠的资源支撑。 ## 核心特性 - **地道韩语指令**：数据集收录真实的韩语指令与提问，区别于传统数据集常采用机器翻译生成、无法反映自然韩语使用习惯与常见问询的问题。 - **用户驱动场景**：指令均源自**ChatKoAlpaca**的真实用户交互，确保数据紧密贴合韩语使用者的实际场景与请求。 - **GPT-4o生成应答**：数据集中的每条指令均搭配由GPT-4o生成的应答，产出高质量、符合语境的应答内容，提升数据集的可靠性。 - **贴合当代韩语语言趋势**：数据集反映了现代韩语的使用模式，适用于当前的应用与模型开发。 ## 数据集结构 ### 数据实例数据集中的每条实例包含以下内容： - **指令（Instruction）**：从**ChatKoAlpaca**交互记录中提取的韩语原始用户输入。 - **应答（Response）**：针对该指令生成的、符合自然语言规范的GPT-4o应答。示例： json { "custom_id": "request-8", "question": "타코와 브리또의 차이가 뭐야?", "answer": "타코와 브리또는 모두 멕시코 요리에서 유래된 음식이지만, 몇 가지 주요한 차이점이 있습니다. 1. **크기와 형태**: - **타코**: 일반적으로 작은 크기의 옥수수 또는 밀가루 토르티야를 사용하며, 포켓처럼 반으로 접어 재료를 담습니다. 타코는 한 손에 들어오는 크기로, 여러 개를 한 끼에 먹는 경우가 많습니다. - **브리또**: 상대적으로 큰 밀가루 토르티야를 사용하여 재료를 완전히 감싸는 형태로 만들어집니다. 원통형으로 말려 있으며, 한 개만으로도 충분한 식사가 될 수 있습니다. 2. **내용물**: - **타코**: 보통 육류(소고기, 닭고기, 돼지고기, 생선 등), 채소(양파, 고수), 살사, 치즈 등을 포함하며 다양한 소스를 곁들여 먹습니다. - **브리또**: 쌀, 콩, 육류, 채소, 치즈, 살사, 사워크림 등을 포함하며 더 풍성한 재료가 들어가기도 합니다. 3. **포장 방식**: - **타코**: 재료가 오픈된 상태로 토르티야 위에 놓입니다. - **브리또**: 재료를 토르티야로 완전히 감싸기 때문에 들고 먹기 간편합니다. 이 차이점들이 각각의 독특한 맛과 식사 경험을 제공합니다." } - 问题（Question）：从**ChatKoAlpaca**服务中提取 - 应答（Answer）：由OpenAI的`gpt-4o-2024-08-06`模型生成 ### 数据字段 - **instruction（字符串类型）**：真实的韩语用户输入或提问。 - **response（字符串类型）**：由GPT-4o生成的应答，与指令匹配且提供相关、准确的信息。 ### 数据划分数据集可划分为训练集、验证集与测试集，以支持模型开发的不同阶段。具体划分比例需根据用户对训练与评估平衡的需求确定。 ## 数据集构建 ### 指令来源指令采集自**ChatKoAlpaca**服务的用户交互记录，韩语用户在此提交各类问题、请求或指令。该采集方式确保指令反映了自然韩语对话中常见的话题与措辞习惯。 ### 应答生成每条应答均由OpenAI的GPT-4o（`gpt-4o-2024-08-06`）生成。 ### 研发动机与应用场景本数据集旨在解决翻译指令数据集的局限性——这类数据集往往无法捕捉韩语的语言细节与文化语境。通过基于真实用户输入构建指令，并生成贴合韩语语言规范的应答，KoAlpaca-RealQA可助力开发更高效、更贴合文化语境的韩语语言模型。潜在应用场景包括： - **会话式AI系统训练**：开发者可使用KoAlpaca-RealQA对面向韩语使用者的模型进行微调，提升应答的自然度与相关性。 - **韩语语言理解能力评估**：本数据集可作为基准，用于评估AI模型准确理解并应答韩语指令的能力。 - **语言与文化适配研究**：本数据集可用于研究语言模型针对特定人群的语言与文化适配问题。 ## 数据集局限性 - **仅限韩语语言**：本数据集仅聚焦韩语交互场景，无法很好地泛化至其他语言或文化语境。 - **交互范围限制**：尽管数据集的指令类型多样，但应答均基于单轮交互格式生成，这可能限制其在多轮对话场景中的深度应用。 ## 微调模型 - QLoRA微调模型：[beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B](https://huggingface.co/beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B) ## 致谢 KoAlpaca-RealQA数据集的开发离不开**ChatKoAlpaca**社区的贡献，以及GPT-4o的技术支持。特别感谢所有为本数据集的质量与相关性贡献指令的韩语用户。 ## 引用若使用本数据集，请引用如下文献： bibtex @dataset{koalpaca_realqa, title = {KoAlpaca-RealQA: 反映真实用户场景的韩语指令数据集}, year = {2024}, author = {ChatKoAlpaca社区}, howpublished = {url{https://huggingface.co/datasets/beomi/KoAlpaca-RealQA}}, description = {收录真实韩语指令与GPT-4o生成应答的数据集。} } ## 许可协议 CC-BY-SA-4.0

提供机构：

maas

创建时间：

2024-11-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集