HachiML/alpaca_jp_python

Name: HachiML/alpaca_jp_python
Creator: HachiML
Published: 2024-05-20 01:44:32
License: 暂无描述

Hugging Face2024-05-20 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/HachiML/alpaca_jp_python

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ja license: apache-2.0 size_categories: - 10K<n<100K task_categories: - text-generation dataset_info: features: - name: No. dtype: int64 - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: avg_similarity_score dtype: float64 - name: similar_instructions list: - name: instruction dtype: string - name: similarity dtype: float64 - name: index dtype: int64 - name: clean dtype: string splits: - name: v1.0_cleaned num_bytes: 19707929 num_examples: 10960 - name: _archive_v1.0 num_bytes: 19814564 num_examples: 11024 - name: _archive_v0.2_cleaned num_bytes: 16175512 num_examples: 9066 - name: _archive_v0.2 num_bytes: 16435602 num_examples: 9221 - name: _archive_v0.1_cleaned num_bytes: 7005088 num_examples: 3910 - name: _archive_v0.1 num_bytes: 15317196 num_examples: 8629 download_size: 21343164 dataset_size: 94455891 configs: - config_name: default data_files: - split: v1.0_cleaned path: data/v1.0_cleaned-* - split: _archive_v1.0 path: data/_archive_v1.0-* - split: _archive_v0.2_cleaned path: data/_archive_v0.2_cleaned-* - split: _archive_v0.2 path: data/_archive_v0.2-* - split: _archive_v0.1_cleaned path: data/_archive_v0.1_cleaned-* - split: _archive_v0.1 path: data/_archive_v0.1-* tags: - synthetic - code - python - self-instruct --- # alpaca_jp_python  alpaca_jp_pythonは、 - [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca/tree/main)の手法 - [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) で作った合成データ(Synthetic data)です。モデルの利用には[Deepinfra](https://deepinfra.com/mistralai/Mixtral-8x22B-Instruct-v0.1/api?example=openai-python)を利用しています。また、"_cleaned"がついたデータセットは[mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)によって精査されています。  ## Dataset Details ### Dataset Description  - **Curated by:** [HachiML](https://huggingface.co/HachiML) - **Language(s) (NLP):** Japanese - **License:** Apache 2.0 - **Github:** [Alpaca-jp](https://github.com/Hajime-Y/Alpaca-jp) ## Uses  ```Python # library from datasets import load_dataset # Recommend getting the latest version (split). dataset = load_dataset("HachiML/alpaca_jp_python", split="v1.0_cleaned") ``` ## Data Cleaning また、"_cleaned"がついたデータセットは[mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)によって精査されています。クレンジングに仕様したプロンプトを以下に示します。 ```Python def create_prompt(instruction, input_data, output_data, programming_language="python"): """ 指示、入力データ、出力データを組み合わせてプロンプトを作成する。 Args: instruction (str): ユーザーからの指示 input_data (str): 入力データ output_data (str): 出力データ programming_language (str): プログラミング言語名 Returns: str: 生成されたプロンプト """ if input_data=="": text = f"""Assess whether the following combination of instruction, and output is appropriate. 1. The only natural language for instructions and output is Japanese. 2. The task is related to {programming_language}. 3. Verify that the input data matches the language and context of the instruction. 4. Check the output data for: - Language consistency with the instruction and input. - Accuracy and relevance to the input. - Clarity without repetition or errors. \nInstruction: {instruction}\nOutput: {output_data} \nYour Judgement (Just answer: True or False. No need to explain the reason.):""" else: text = f"""Assess whether the following combination of instruction, input, and output is appropriate. 1. The only natural language for instructions, input, and output is Japanese. 2. The task is related to {programming_language}. 3. Verify that the input data matches the language and context of the instruction. 4. Check the output data for: - Language consistency with the instruction and input. - Accuracy and relevance to the input. - Clarity without repetition or errors. \nInstruction: {instruction}\nInput: {input_data}\nOutput: {output_data} \nYour Judgement (Just answer: True or False. No need to explain the reason.):""" return text ``` ## prompt for data generation ``` You are asked to come up with a set of 10 diverse coding task instructions related to python. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions. Here are the requirements: 1. Avoid using the same phrases for each instruction to maximize diversity. 2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons. 3. The type of instructions should be diverse. The list should include diverse types of tasks like generating code, explaining, fixing, refactoring, optimizing, translating, documenting, analyzing, completing, machine learning, data analyzing etc. 4. The natural language during instructions, inputs and outputs must be in Japanese. English must not be used. Comment text in the code must be in Japanese. 5. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted. 6. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging. The list should include diverse types of context like SQL database, csv, XML, image, text, sound etc. 7. Not all instructions require input. For example, the instruction shuch as "Create a function in Python that adds a, b" does not need to provide a specific context. In this case, we simply put "<noinput>" in the input field. 8. The output should be an appropriate response to the instruction and the input. List of 10 tasks: ```

alpaca_jp_python is a synthetic dataset created using the methods of Stanford Alpaca and mistralai/Mixtral-8x22B-Instruct-v0.1. The dataset contains Japanese text related to programming tasks, particularly Python. It includes features such as number, instruction, input, output, average similarity score, list of similar instructions, index, and clean marker. The dataset is divided into multiple versions, each with cleaned and uncleaned versions. The cleaning process uses specific Python functions to ensure language consistency and relevance of instructions, inputs, and outputs. The dataset is intended for text generation tasks, especially those related to Python programming.

提供机构：

HachiML

原始信息汇总

数据集概述

基本信息

语言: 日语
许可证: Apache 2.0
数据集大小: 10K<n<100K
任务类别: 文本生成

数据集结构

特征

No.: 整数类型
instruction: 字符串类型
input: 字符串类型
output: 字符串类型
avg_similarity_score: 浮点数类型
similar_instructions: 列表类型，包含以下子特征：
- instruction: 字符串类型
- similarity: 浮点数类型
index: 整数类型
clean: 字符串类型

数据分割

v1.0_cleaned: 19707929 字节, 10960 个样本
_archive_v1.0: 19814564 字节, 11024 个样本
_archive_v0.2_cleaned: 16175512 字节, 9066 个样本
_archive_v0.2: 16435602 字节, 9221 个样本
_archive_v0.1_cleaned: 7005088 字节, 3910 个样本
_archive_v0.1: 15317196 字节, 8629 个样本

数据集大小

下载大小: 21343164 字节
数据集大小: 94455891 字节

配置

config_name: default
data_files:
- v1.0_cleaned: data/v1.0_cleaned-*
- _archive_v1.0: data/_archive_v1.0-*
- _archive_v0.2_cleaned: data/_archive_v0.2_cleaned-*
- _archive_v0.2: data/_archive_v0.2-*
- _archive_v0.1_cleaned: data/_archive_v0.1_cleaned-*
- _archive_v0.1: data/_archive_v0.1-*