LocalDoc/Bilik-Instruct
收藏Hugging Face2025-12-23 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/Bilik-Instruct
下载链接
链接失效反馈官方服务:
资源简介:
Bilik-Instruct是一个针对阿塞拜疆语的大规模、高质量的监督微调(SFT)数据集。它基于LocalDoc/wikipedia_azerbaijan数据集,并通过OpenAI的GPT-5采用新颖的persona-driven生成技术进行增强。该数据集的目标是超越正式的百科全书式语言,捕捉各种领域的自然、对话式阿塞拜疆语。数据集包含约100万样本,语言为阿塞拜疆语(拉丁字母),生成模型为GPT-5 (SFT)和GPT-5-Mini (Personas)。数据类型包括多轮对话(30%)、问答(25%)、摘要(25%)和一般指令(20%)。数据集采用了persona-driven方法,确保语言多样性。数据集结构包括messages、persona、type和source_row等字段。数据集遵循CC BY-SA 4.0许可,并需要遵守OpenAI的使用政策。
Bilik-Instruct is a large-scale, high-quality Supervised Fine-Tuning (SFT) dataset for the Azerbaijani language. It is built upon the LocalDoc/wikipedia_azerbaijan dataset and enhanced using a novel persona-driven generation technique via OpenAIs GPT-5. The goal of this dataset is to move beyond formal, encyclopedic language and capture natural, conversational Azerbaijani across various domains. The dataset contains approximately 1 million samples, with the language being Azerbaijani (Latin script) and the generation models being GPT-5 (SFT) and GPT-5-Mini (Personas). Data types include multi-turn dialogue (30%), QA (Question Answering) (25%), summarization (25%), and general instructions (20%). The dataset employs a persona-driven approach to ensure linguistic diversity. The dataset structure includes fields such as messages, persona, type, and source_row. The dataset is licensed under CC BY-SA 4.0 and requires compliance with OpenAIs usage policies.
提供机构:
LocalDoc



