extraction-wiki-ja
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/llm-jp/extraction-wiki-ja
下载链接
链接失效反馈官方服务:
资源简介:
# extraction-wiki-ja
This repository provides an instruction-tuning dataset developed by LLM-jp, a collaborative project launched in Japan.
This is a Japanese instruction-tuning dataset tailored for information extraction and structuring from Japanese Wikipedia text.
The dataset consists of instruction–response pairs automatically generated from Japanese Wikipedia articles. Instructions are created by prompting [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) with passages from Wikipedia, and the corresponding responses are also generated using the same model.
To ensure quality, both instructions and responses are filtered using Qwen/Qwen2.5-32B-Instruct.
The base corpus is a subset of Japanese Wikipedia data curated as part of the [llm-jp-corpus-v3](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3).
The dataset is divided into the following subsets:
- v0.1: Two-turn dialogue format (instruction + response)
- v0.2: Two-turn dialogue format (instruction + response)
- v0.3: Four-turn dialogue format (instruction + response + instruction + response)
## Send Questions to
llm-jp(at)nii.ac.jp
## Model Card Authors
The names are listed in alphabetical order.
Hirokazu Kiyomaru and Takashi Kodama.
# extraction-wiki-ja
本仓库提供由日本发起的合作项目LLM-jp研发的指令微调数据集。
本数据集为面向日语维基百科文本的信息抽取与结构化任务定制的日语指令微调数据集。
本数据集包含从日语维基百科文章自动生成的指令-回复对。指令通过将维基百科段落输入通义千问Qwen/Qwen2.5-32B-Instruct(Qwen/Qwen2.5-32B-Instruct)进行提示生成,对应的回复亦由同一模型生成。
为保障数据集质量,我们使用通义千问Qwen/Qwen2.5-32B-Instruct(Qwen/Qwen2.5-32B-Instruct)对指令与回复进行了筛选。
本数据集的基础语料库是作为llm-jp-corpus-v3项目一部分所整理的日语维基百科数据子集。
本数据集分为以下子版本:
- v0.1:双轮对话格式(指令+回复)
- v0.2:双轮对话格式(指令+回复)
- v0.3:四轮对话格式(指令+回复+指令+回复)
## 咨询邮箱
llm-jp(at)nii.ac.jp
## 模型卡片作者
作者姓名按字母顺序排列,分别为Hirokazu Kiyomaru与Takashi Kodama.
提供机构:
maas
创建时间:
2025-11-25



