LocalDoc/Bilik-Instruct

Name: LocalDoc/Bilik-Instruct
Creator: LocalDoc
Published: 2025-12-23 12:02:53
License: 暂无描述

Hugging Face2025-12-23 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/LocalDoc/Bilik-Instruct

下载链接

链接失效反馈

官方服务：

资源简介：

Bilik-Instruct是一个针对阿塞拜疆语的大规模、高质量的监督微调（SFT）数据集。它基于LocalDoc/wikipedia_azerbaijan数据集，并通过OpenAI的GPT-5采用新颖的persona-driven生成技术进行增强。该数据集的目标是超越正式的百科全书式语言，捕捉各种领域的自然、对话式阿塞拜疆语。数据集包含约100万样本，语言为阿塞拜疆语（拉丁字母），生成模型为GPT-5 (SFT)和GPT-5-Mini (Personas)。数据类型包括多轮对话（30%）、问答（25%）、摘要（25%）和一般指令（20%）。数据集采用了persona-driven方法，确保语言多样性。数据集结构包括messages、persona、type和source_row等字段。数据集遵循CC BY-SA 4.0许可，并需要遵守OpenAI的使用政策。

Bilik-Instruct is a large-scale, high-quality Supervised Fine-Tuning (SFT) dataset for the Azerbaijani language. It is built upon the LocalDoc/wikipedia_azerbaijan dataset and enhanced using a novel persona-driven generation technique via OpenAIs GPT-5. The goal of this dataset is to move beyond formal, encyclopedic language and capture natural, conversational Azerbaijani across various domains. The dataset contains approximately 1 million samples, with the language being Azerbaijani (Latin script) and the generation models being GPT-5 (SFT) and GPT-5-Mini (Personas). Data types include multi-turn dialogue (30%), QA (Question Answering) (25%), summarization (25%), and general instructions (20%). The dataset employs a persona-driven approach to ensure linguistic diversity. The dataset structure includes fields such as messages, persona, type, and source_row. The dataset is licensed under CC BY-SA 4.0 and requires compliance with OpenAIs usage policies.

提供机构：

LocalDoc

5,000+

优质数据集

54 个

任务类型

进入经典数据集