llm-japanese-dataset-vanilla

Name: llm-japanese-dataset-vanilla
Creator: 东京大学
Published: 2023-11-05 14:32:30
License: 暂无描述

arXiv2023-11-05 更新2024-06-21 收录

下载链接：

https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset-vanilla

下载链接

链接失效反馈

官方服务：

资源简介：

llm-japanese-dataset-vanilla是由东京大学构建的一个专为日语大型语言模型设计的指令数据集。该数据集通过过滤和扩展现有数据，去除了翻译任务，专注于纯日语内容，包含约250万样本和5种任务类型，如常识、摘要、阅读理解、简化和校正。数据集的创建旨在解决非英语语言中指令数据集的缺乏问题，并通过指令调整提高模型在未见任务上的性能。该数据集适用于日语和英语基础的大型语言模型，旨在通过指令调整提升模型在下游任务中的表现。

llm-japanese-dataset-vanilla is an instruction-tuning dataset specifically tailored for Japanese large language models, developed by the University of Tokyo. This dataset is constructed by filtering and expanding existing text corpora, removing translation-related tasks and focusing exclusively on Japanese-only content. It contains approximately 2.5 million instances across five task categories: common sense reasoning, summarization, reading comprehension, text simplification, and text correction. Developed to address the scarcity of targeted instruction datasets for non-English languages, the dataset aims to improve model performance on unseen tasks via instruction tuning. It is compatible with large language models pre-trained on both Japanese and English text corpora, and is designed to enhance model performance on downstream tasks through instruction tuning.

提供机构：

东京大学

创建时间：

2023-09-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集