five

pythainlp/han-instruct-dataset-v4.0

收藏
Hugging Face2024-08-01 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/pythainlp/han-instruct-dataset-v4.0
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 6960577 num_examples: 4377 download_size: 2373567 dataset_size: 6960577 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-sa-4.0 task_categories: - text-generation language: - th --- # Dataset Card for Han Instruct Dataset v4.0 🪿🪿🪿🪿 🪿 Han (ห่าน or goose) Instruct Dataset is a Thai instruction dataset by PyThaiNLP. This dataset collects all Thai instruct datasets that were made by humans and our old model. The dataset can be used to train Instruction Following models like ChatGPT or others. Data sources: - [Reference desk at Thai wikipedia](https://th.wikipedia.org/wiki/%E0%B8%A7%E0%B8%B4%E0%B8%81%E0%B8%B4%E0%B8%9E%E0%B8%B5%E0%B9%80%E0%B8%94%E0%B8%B5%E0%B8%A2:%E0%B8%9B%E0%B8%B8%E0%B8%88%E0%B8%89%E0%B8%B2-%E0%B8%A7%E0%B8%B4%E0%B8%AA%E0%B8%B1%E0%B8%8A%E0%B8%99%E0%B8%B2). - [Law from justicechannel.org](https://justicechannel.org/) - [pythainlp/final_training_set_v1_enth](https://huggingface.co/datasets/pythainlp/final_training_set_v1_enth): Human checked and edited. - Self-instruct from [WangChanGLM](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-en) - [Wannaphong.com](https://www.wannaphong.com) - [Blognone](https://www.blognone.com) - Synthetic dataset from LLM - Human annotators ### Supported Tasks and Leaderboards - ChatBot - Instruction Following ### Languages Thai ## Dataset Structure ### Data Fields - messages: ChatML ### Considerations for Using the Data The dataset can be biased by human annotators and LLM annotators. We recommend you check the dataset to select or remove an instruction before training the model or using it to at your risk. ### Licensing Information CC-BY-SA 4.0 ### Citation If you use `Han Instruct Dataset (4.0)` in your project or publication, please cite the dataset as follows: > Phatthiyaphaibun, W. (2024). Han Instruct Dataset (v4.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13145164 or ```bib @dataset{phatthiyaphaibun_2024_13145164, author = {Phatthiyaphaibun, Wannaphong}, title = {Han Instruct Dataset}, month = jul, year = 2024, publisher = {Zenodo}, version = {v4.0}, doi = {10.5281/zenodo.13145164}, url = {https://doi.org/10.5281/zenodo.13145164} } ``` Zenodo: [https://doi.org/10.5281/zenodo.13145164](https://doi.org/10.5281/zenodo.13145164)

数据集信息: 特征字段: - 名称:messages 类型:列表,包含两个子字段: - content:数据类型为字符串 - role:数据类型为字符串 数据划分: - 名称:train(训练集),字节大小:6960577,样本数量:4377 下载大小:2373567,数据集总大小:6960577 配置项: - 配置名称:default(默认配置),数据文件: - 数据划分:train,文件路径:data/train-* 许可证:cc-by-sa-4.0(知识共享署名-相同方式共享4.0国际许可协议) 任务类别: - 文本生成(text-generation) 使用语言: - th(泰语) --- # Han指令数据集v4.0 🪿🪿🪿🪿 数据集卡片 🪿 Han(泰语为ห่าน,意为鹅)指令数据集是由PyThaiNLP团队制作的泰语指令数据集。本数据集整合了所有由人类及过往模型生成的泰语指令数据集,可用于训练如ChatGPT等遵循指令的模型。 数据来源: - [泰语维基百科参考咨询台](https://th.wikipedia.org/wiki/%E0%B8%A7%E0%B8%B4%E0%B8%81%E0%B8%B4%E0%B8%9E%E0%B8%B5%E0%B9%80%E0%B8%94%E0%B8%B5%E0%B8%A2:%E0%B8%9B%E0%B8%B8%E0%B8%88%E0%B8%89%E0%B8%B2-%E0%B8%A7%E0%B8%B4%E0%B8%AA%E0%B8%B1%E0%B8%8A%E0%B8%99%E0%B8%B2) - [justicechannel.org 法律内容](https://justicechannel.org/) - [pythainlp/final_training_set_v1_enth](https://huggingface.co/datasets/pythainlp/final_training_set_v1_enth):经人工审核与编辑 - 源自[WangChanGLM](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-en)的自指令数据 - [Wannaphong.com](https://www.wannaphong.com) - [Blognone](https://www.blognone.com) - 由大语言模型(LLM)生成的合成数据集 - 人工标注数据 ### 支持任务与排行榜 - 聊天机器人 - 指令遵循 ### 使用语言 泰语 ## 数据集结构 ### 数据字段 - messages:采用ChatML格式 ### 数据使用注意事项 本数据集可能存在人工标注者与大语言模型标注者带来的偏差。我们建议在训练模型或使用该数据集前,先对其进行检查以筛选或移除相关指令,由此产生的风险由使用者自行承担。 ### 许可信息 CC-BY-SA 4.0(知识共享署名-相同方式共享4.0国际许可协议) ### 引用 若您在项目或学术发表中使用`Han Instruct Dataset (4.0)`,请按如下方式引用该数据集: > Phatthiyaphaibun, W. (2024). Han Instruct Dataset (v4.0) [数据集]. Zenodo. https://doi.org/10.5281/zenodo.13145164 或使用BibTeX格式: bib @dataset{phatthiyaphaibun_2024_13145164, author = {Phatthiyaphaibun, Wannaphong}, title = {Han Instruct Dataset}, month = jul, year = 2024, publisher = {Zenodo}, version = {v4.0}, doi = {10.5281/zenodo.13145164}, url = {https://doi.org/10.5281/zenodo.13145164} } Zenodo链接:[https://doi.org/10.5281/zenodo.13145164](https://doi.org/10.5281/zenodo.13145164)
提供机构:
pythainlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作