five

LocalDoc/Bilik-Instruct

收藏
Hugging Face2025-12-23 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/Bilik-Instruct
下载链接
链接失效反馈
官方服务:
资源简介:
Bilik-Instruct是一个针对阿塞拜疆语的大规模、高质量的监督微调(SFT)数据集。它基于LocalDoc/wikipedia_azerbaijan数据集,并通过OpenAI的GPT-5采用新颖的persona-driven生成技术进行增强。该数据集的目标是超越正式的百科全书式语言,捕捉各种领域的自然、对话式阿塞拜疆语。数据集包含约100万样本,语言为阿塞拜疆语(拉丁字母),生成模型为GPT-5 (SFT)和GPT-5-Mini (Personas)。数据类型包括多轮对话(30%)、问答(25%)、摘要(25%)和一般指令(20%)。数据集采用了persona-driven方法,确保语言多样性。数据集结构包括messages、persona、type和source_row等字段。数据集遵循CC BY-SA 4.0许可,并需要遵守OpenAI的使用政策。

Bilik-Instruct is a large-scale, high-quality Supervised Fine-Tuning (SFT) dataset for the Azerbaijani language. It is built upon the LocalDoc/wikipedia_azerbaijan dataset and enhanced using a novel persona-driven generation technique via OpenAIs GPT-5. The goal of this dataset is to move beyond formal, encyclopedic language and capture natural, conversational Azerbaijani across various domains. The dataset contains approximately 1 million samples, with the language being Azerbaijani (Latin script) and the generation models being GPT-5 (SFT) and GPT-5-Mini (Personas). Data types include multi-turn dialogue (30%), QA (Question Answering) (25%), summarization (25%), and general instructions (20%). The dataset employs a persona-driven approach to ensure linguistic diversity. The dataset structure includes fields such as messages, persona, type, and source_row. The dataset is licensed under CC BY-SA 4.0 and requires compliance with OpenAIs usage policies.
提供机构:
LocalDoc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作