HachiML/alpaca_jp_python
收藏Hugging Face2024-05-20 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/HachiML/alpaca_jp_python
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ja
license: apache-2.0
size_categories:
- 10K<n<100K
task_categories:
- text-generation
dataset_info:
features:
- name: No.
dtype: int64
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: avg_similarity_score
dtype: float64
- name: similar_instructions
list:
- name: instruction
dtype: string
- name: similarity
dtype: float64
- name: index
dtype: int64
- name: clean
dtype: string
splits:
- name: v1.0_cleaned
num_bytes: 19707929
num_examples: 10960
- name: _archive_v1.0
num_bytes: 19814564
num_examples: 11024
- name: _archive_v0.2_cleaned
num_bytes: 16175512
num_examples: 9066
- name: _archive_v0.2
num_bytes: 16435602
num_examples: 9221
- name: _archive_v0.1_cleaned
num_bytes: 7005088
num_examples: 3910
- name: _archive_v0.1
num_bytes: 15317196
num_examples: 8629
download_size: 21343164
dataset_size: 94455891
configs:
- config_name: default
data_files:
- split: v1.0_cleaned
path: data/v1.0_cleaned-*
- split: _archive_v1.0
path: data/_archive_v1.0-*
- split: _archive_v0.2_cleaned
path: data/_archive_v0.2_cleaned-*
- split: _archive_v0.2
path: data/_archive_v0.2-*
- split: _archive_v0.1_cleaned
path: data/_archive_v0.1_cleaned-*
- split: _archive_v0.1
path: data/_archive_v0.1-*
tags:
- synthetic
- code
- python
- self-instruct
---
# alpaca_jp_python
<!-- Provide a quick summary of the dataset. -->
alpaca_jp_pythonは、
- [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca/tree/main)の手法
- [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
で作った合成データ(Synthetic data)です。
モデルの利用には[Deepinfra](https://deepinfra.com/mistralai/Mixtral-8x22B-Instruct-v0.1/api?example=openai-python)を利用しています。
また、"_cleaned"がついたデータセットは[mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)によって精査されています。
<!-- This dataset card aims to be a base template for new datasets. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md?plain=1). -->
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** [HachiML](https://huggingface.co/HachiML)
- **Language(s) (NLP):** Japanese
- **License:** Apache 2.0
- **Github:** [Alpaca-jp](https://github.com/Hajime-Y/Alpaca-jp)
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
```Python
# library
from datasets import load_dataset
# Recommend getting the latest version (split).
dataset = load_dataset("HachiML/alpaca_jp_python", split="v1.0_cleaned")
```
## Data Cleaning
また、"_cleaned"がついたデータセットは[mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)によって精査されています。
クレンジングに仕様したプロンプトを以下に示します。
```Python
def create_prompt(instruction, input_data, output_data, programming_language="python"):
"""
指示、入力データ、出力データを組み合わせてプロンプトを作成する。
Args:
instruction (str): ユーザーからの指示
input_data (str): 入力データ
output_data (str): 出力データ
programming_language (str): プログラミング言語名
Returns:
str: 生成されたプロンプト
"""
if input_data=="":
text = f"""Assess whether the following combination of instruction, and output is appropriate.
1. The only natural language for instructions and output is Japanese.
2. The task is related to {programming_language}.
3. Verify that the input data matches the language and context of the instruction.
4. Check the output data for:
- Language consistency with the instruction and input.
- Accuracy and relevance to the input.
- Clarity without repetition or errors.
\nInstruction: {instruction}\nOutput: {output_data}
\nYour Judgement (Just answer: True or False. No need to explain the reason.):"""
else:
text = f"""Assess whether the following combination of instruction, input, and output is appropriate.
1. The only natural language for instructions, input, and output is Japanese.
2. The task is related to {programming_language}.
3. Verify that the input data matches the language and context of the instruction.
4. Check the output data for:
- Language consistency with the instruction and input.
- Accuracy and relevance to the input.
- Clarity without repetition or errors.
\nInstruction: {instruction}\nInput: {input_data}\nOutput: {output_data}
\nYour Judgement (Just answer: True or False. No need to explain the reason.):"""
return text
```
## prompt for data generation
```
You are asked to come up with a set of 10 diverse coding task instructions related to python. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.
Here are the requirements:
1. Avoid using the same phrases for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.
3. The type of instructions should be diverse. The list should include diverse types of tasks like generating code, explaining, fixing, refactoring, optimizing, translating, documenting, analyzing, completing, machine learning, data analyzing etc.
4. The natural language during instructions, inputs and outputs must be in Japanese. English must not be used. Comment text in the code must be in Japanese.
5. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
6. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging. The list should include diverse types of context like SQL database, csv, XML, image, text, sound etc.
7. Not all instructions require input. For example, the instruction shuch as "Create a function in Python that adds a, b" does not need to provide a specific context. In this case, we simply put "<noinput>" in the input field.
8. The output should be an appropriate response to the instruction and the input.
List of 10 tasks:
```
alpaca_jp_python is a synthetic dataset created using the methods of Stanford Alpaca and mistralai/Mixtral-8x22B-Instruct-v0.1. The dataset contains Japanese text related to programming tasks, particularly Python. It includes features such as number, instruction, input, output, average similarity score, list of similar instructions, index, and clean marker. The dataset is divided into multiple versions, each with cleaned and uncleaned versions. The cleaning process uses specific Python functions to ensure language consistency and relevance of instructions, inputs, and outputs. The dataset is intended for text generation tasks, especially those related to Python programming.
提供机构:
HachiML
原始信息汇总
数据集概述
基本信息
- 语言: 日语
- 许可证: Apache 2.0
- 数据集大小: 10K<n<100K
- 任务类别: 文本生成
数据集结构
特征
- No.: 整数类型
- instruction: 字符串类型
- input: 字符串类型
- output: 字符串类型
- avg_similarity_score: 浮点数类型
- similar_instructions: 列表类型,包含以下子特征:
- instruction: 字符串类型
- similarity: 浮点数类型
- index: 整数类型
- clean: 字符串类型
数据分割
- v1.0_cleaned: 19707929 字节, 10960 个样本
- _archive_v1.0: 19814564 字节, 11024 个样本
- _archive_v0.2_cleaned: 16175512 字节, 9066 个样本
- _archive_v0.2: 16435602 字节, 9221 个样本
- _archive_v0.1_cleaned: 7005088 字节, 3910 个样本
- _archive_v0.1: 15317196 字节, 8629 个样本
数据集大小
- 下载大小: 21343164 字节
- 数据集大小: 94455891 字节
配置
- config_name: default
- data_files:
- v1.0_cleaned: data/v1.0_cleaned-*
- _archive_v1.0: data/_archive_v1.0-*
- _archive_v0.2_cleaned: data/_archive_v0.2_cleaned-*
- _archive_v0.2: data/_archive_v0.2-*
- _archive_v0.1_cleaned: data/_archive_v0.1_cleaned-*
- _archive_v0.1: data/_archive_v0.1-*
标签
- synthetic
- code
- python
- self-instruct



