pythainlp/han-instruct-dataset-v4.0
收藏Hugging Face2024-08-01 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/pythainlp/han-instruct-dataset-v4.0
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train
num_bytes: 6960577
num_examples: 4377
download_size: 2373567
dataset_size: 6960577
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-sa-4.0
task_categories:
- text-generation
language:
- th
---
# Dataset Card for Han Instruct Dataset v4.0 🪿🪿🪿🪿
🪿 Han (ห่าน or goose) Instruct Dataset is a Thai instruction dataset by PyThaiNLP. This dataset collects all Thai instruct datasets that were made by humans and our old model. The dataset can be used to train Instruction Following models like ChatGPT or others.
Data sources:
- [Reference desk at Thai wikipedia](https://th.wikipedia.org/wiki/%E0%B8%A7%E0%B8%B4%E0%B8%81%E0%B8%B4%E0%B8%9E%E0%B8%B5%E0%B9%80%E0%B8%94%E0%B8%B5%E0%B8%A2:%E0%B8%9B%E0%B8%B8%E0%B8%88%E0%B8%89%E0%B8%B2-%E0%B8%A7%E0%B8%B4%E0%B8%AA%E0%B8%B1%E0%B8%8A%E0%B8%99%E0%B8%B2).
- [Law from justicechannel.org](https://justicechannel.org/)
- [pythainlp/final_training_set_v1_enth](https://huggingface.co/datasets/pythainlp/final_training_set_v1_enth): Human checked and edited.
- Self-instruct from [WangChanGLM](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-en)
- [Wannaphong.com](https://www.wannaphong.com)
- [Blognone](https://www.blognone.com)
- Synthetic dataset from LLM
- Human annotators
### Supported Tasks and Leaderboards
- ChatBot
- Instruction Following
### Languages
Thai
## Dataset Structure
### Data Fields
- messages: ChatML
### Considerations for Using the Data
The dataset can be biased by human annotators and LLM annotators. We recommend you check the dataset to select or remove an instruction before training the model or using it to at your risk.
### Licensing Information
CC-BY-SA 4.0
### Citation
If you use `Han Instruct Dataset (4.0)` in your project or publication, please cite the dataset as follows:
> Phatthiyaphaibun, W. (2024). Han Instruct Dataset (v4.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13145164
or
```bib
@dataset{phatthiyaphaibun_2024_13145164,
author = {Phatthiyaphaibun, Wannaphong},
title = {Han Instruct Dataset},
month = jul,
year = 2024,
publisher = {Zenodo},
version = {v4.0},
doi = {10.5281/zenodo.13145164},
url = {https://doi.org/10.5281/zenodo.13145164}
}
```
Zenodo: [https://doi.org/10.5281/zenodo.13145164](https://doi.org/10.5281/zenodo.13145164)
数据集信息:
特征字段:
- 名称:messages
类型:列表,包含两个子字段:
- content:数据类型为字符串
- role:数据类型为字符串
数据划分:
- 名称:train(训练集),字节大小:6960577,样本数量:4377
下载大小:2373567,数据集总大小:6960577
配置项:
- 配置名称:default(默认配置),数据文件:
- 数据划分:train,文件路径:data/train-*
许可证:cc-by-sa-4.0(知识共享署名-相同方式共享4.0国际许可协议)
任务类别:
- 文本生成(text-generation)
使用语言:
- th(泰语)
---
# Han指令数据集v4.0 🪿🪿🪿🪿 数据集卡片
🪿 Han(泰语为ห่าน,意为鹅)指令数据集是由PyThaiNLP团队制作的泰语指令数据集。本数据集整合了所有由人类及过往模型生成的泰语指令数据集,可用于训练如ChatGPT等遵循指令的模型。
数据来源:
- [泰语维基百科参考咨询台](https://th.wikipedia.org/wiki/%E0%B8%A7%E0%B8%B4%E0%B8%81%E0%B8%B4%E0%B8%9E%E0%B8%B5%E0%B9%80%E0%B8%94%E0%B8%B5%E0%B8%A2:%E0%B8%9B%E0%B8%B8%E0%B8%88%E0%B8%89%E0%B8%B2-%E0%B8%A7%E0%B8%B4%E0%B8%AA%E0%B8%B1%E0%B8%8A%E0%B8%99%E0%B8%B2)
- [justicechannel.org 法律内容](https://justicechannel.org/)
- [pythainlp/final_training_set_v1_enth](https://huggingface.co/datasets/pythainlp/final_training_set_v1_enth):经人工审核与编辑
- 源自[WangChanGLM](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-en)的自指令数据
- [Wannaphong.com](https://www.wannaphong.com)
- [Blognone](https://www.blognone.com)
- 由大语言模型(LLM)生成的合成数据集
- 人工标注数据
### 支持任务与排行榜
- 聊天机器人
- 指令遵循
### 使用语言
泰语
## 数据集结构
### 数据字段
- messages:采用ChatML格式
### 数据使用注意事项
本数据集可能存在人工标注者与大语言模型标注者带来的偏差。我们建议在训练模型或使用该数据集前,先对其进行检查以筛选或移除相关指令,由此产生的风险由使用者自行承担。
### 许可信息
CC-BY-SA 4.0(知识共享署名-相同方式共享4.0国际许可协议)
### 引用
若您在项目或学术发表中使用`Han Instruct Dataset (4.0)`,请按如下方式引用该数据集:
> Phatthiyaphaibun, W. (2024). Han Instruct Dataset (v4.0) [数据集]. Zenodo. https://doi.org/10.5281/zenodo.13145164
或使用BibTeX格式:
bib
@dataset{phatthiyaphaibun_2024_13145164,
author = {Phatthiyaphaibun, Wannaphong},
title = {Han Instruct Dataset},
month = jul,
year = 2024,
publisher = {Zenodo},
version = {v4.0},
doi = {10.5281/zenodo.13145164},
url = {https://doi.org/10.5281/zenodo.13145164}
}
Zenodo链接:[https://doi.org/10.5281/zenodo.13145164](https://doi.org/10.5281/zenodo.13145164)
提供机构:
pythainlp



