HachiML/Evol-Alpaca-gen3-500
收藏Hugging Face2024-05-20 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/HachiML/Evol-Alpaca-gen3-500
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: No.
dtype: int64
- name: seed_id
dtype: string
- name: generation
dtype: int64
- name: evol_history
sequence: string
- name: instruction
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 1062581
num_examples: 507
download_size: 505398
dataset_size: 1062581
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: apache-2.0
task_categories:
- text-generation
language:
- ja
tags:
- synthetic
- evol-instruct
size_categories:
- n<1K
---
# Evol-Alpaca-gen3-500
<!-- Provide a quick summary of the dataset. -->
Evol-Alpaca-gen3-500は、
- [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca/tree/main)のseed tasksを日本語化
- [Evol-Instruction](https://arxiv.org/abs/2304.12244)の手法
- [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
で作った合成データ(Synthetic data)です。
モデルの利用には[Deepinfra](https://deepinfra.com/mistralai/Mixtral-8x22B-Instruct-v0.1/api?example=openai-python)を利用しています。
<!-- This dataset card aims to be a base template for new datasets. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md?plain=1). -->
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** [HachiML](https://huggingface.co/HachiML)
- **Language(s) (NLP):** Japanese
- **License:** Apache 2.0
- **Github:** [Evol-Instruct-jp](https://github.com/Hajime-Y/Evol-Instruct-jp)
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
```Python
# library
from datasets import load_dataset
# Load dataset.
dataset = load_dataset("HachiML/Evol-Alpaca-gen3-500")
```
## Code
**Github:** [Evol-Instruct-jp](https://github.com/Hajime-Y/Evol-Instruct-jp)
にコードを置いています。このコードを元に、以下の設定で生成しました。
```Python
!python main.py \
--input_file "./data/alpaca_seed_tasks_jp.jsonl" \
--output_file "./output/generated.json" \
--eliminated_file "./output/eliminated.json" \
--model "mistralai/Mixtral-8x22B-Instruct-v0.1" \
--hallucination_check_model "" \
--generations 3 \
--num_instructions_to_generate 500
```
提供机构:
HachiML
原始信息汇总
数据集概述
数据集信息
- 名称: Evol-Alpaca-gen3-500
- 语言: 日语
- 许可证: Apache 2.0
- 任务类别: 文本生成
- 标签: 合成数据, Evol-Instruction
- 大小类别: 小于1K
数据集特征
- No.: 整数类型
- seed_id: 字符串类型
- generation: 整数类型
- evol_history: 字符串序列
- instruction: 字符串类型
- output: 字符串类型
数据集分割
- 训练集:
- 大小: 1062581字节
- 示例数量: 507
数据集大小
- 下载大小: 505398字节
- 数据集大小: 1062581字节
配置
- 默认配置:
- 数据文件:
- 分割: 训练
- 路径: data/train-*
- 数据文件:



