llm-japanese-dataset-vanilla
收藏arXiv2023-11-05 更新2024-06-21 收录
下载链接:
https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset-vanilla
下载链接
链接失效反馈官方服务:
资源简介:
llm-japanese-dataset-vanilla是由东京大学构建的一个专为日语大型语言模型设计的指令数据集。该数据集通过过滤和扩展现有数据,去除了翻译任务,专注于纯日语内容,包含约250万样本和5种任务类型,如常识、摘要、阅读理解、简化和校正。数据集的创建旨在解决非英语语言中指令数据集的缺乏问题,并通过指令调整提高模型在未见任务上的性能。该数据集适用于日语和英语基础的大型语言模型,旨在通过指令调整提升模型在下游任务中的表现。
llm-japanese-dataset-vanilla is an instruction-tuning dataset specifically tailored for Japanese large language models, developed by the University of Tokyo. This dataset is constructed by filtering and expanding existing text corpora, removing translation-related tasks and focusing exclusively on Japanese-only content. It contains approximately 2.5 million instances across five task categories: common sense reasoning, summarization, reading comprehension, text simplification, and text correction. Developed to address the scarcity of targeted instruction datasets for non-English languages, the dataset aims to improve model performance on unseen tasks via instruction tuning. It is compatible with large language models pre-trained on both Japanese and English text corpora, and is designed to enhance model performance on downstream tasks through instruction tuning.
提供机构:
东京大学
创建时间:
2023-09-07



