five

HachiML/alpaca_jp_python

收藏
Hugging Face2024-05-20 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/HachiML/alpaca_jp_python
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ja license: apache-2.0 size_categories: - 10K<n<100K task_categories: - text-generation dataset_info: features: - name: No. dtype: int64 - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: avg_similarity_score dtype: float64 - name: similar_instructions list: - name: instruction dtype: string - name: similarity dtype: float64 - name: index dtype: int64 - name: clean dtype: string splits: - name: v1.0_cleaned num_bytes: 19707929 num_examples: 10960 - name: _archive_v1.0 num_bytes: 19814564 num_examples: 11024 - name: _archive_v0.2_cleaned num_bytes: 16175512 num_examples: 9066 - name: _archive_v0.2 num_bytes: 16435602 num_examples: 9221 - name: _archive_v0.1_cleaned num_bytes: 7005088 num_examples: 3910 - name: _archive_v0.1 num_bytes: 15317196 num_examples: 8629 download_size: 21343164 dataset_size: 94455891 configs: - config_name: default data_files: - split: v1.0_cleaned path: data/v1.0_cleaned-* - split: _archive_v1.0 path: data/_archive_v1.0-* - split: _archive_v0.2_cleaned path: data/_archive_v0.2_cleaned-* - split: _archive_v0.2 path: data/_archive_v0.2-* - split: _archive_v0.1_cleaned path: data/_archive_v0.1_cleaned-* - split: _archive_v0.1 path: data/_archive_v0.1-* tags: - synthetic - code - python - self-instruct --- # alpaca_jp_python <!-- Provide a quick summary of the dataset. --> alpaca_jp_pythonは、 - [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca/tree/main)の手法 - [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) で作った合成データ(Synthetic data)です。 モデルの利用には[Deepinfra](https://deepinfra.com/mistralai/Mixtral-8x22B-Instruct-v0.1/api?example=openai-python)を利用しています。 また、"_cleaned"がついたデータセットは[mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)によって精査されています。 <!-- This dataset card aims to be a base template for new datasets. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md?plain=1). --> ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [HachiML](https://huggingface.co/HachiML) - **Language(s) (NLP):** Japanese - **License:** Apache 2.0 - **Github:** [Alpaca-jp](https://github.com/Hajime-Y/Alpaca-jp) ## Uses <!-- Address questions around how the dataset is intended to be used. --> ```Python # library from datasets import load_dataset # Recommend getting the latest version (split). dataset = load_dataset("HachiML/alpaca_jp_python", split="v1.0_cleaned") ``` ## Data Cleaning また、"_cleaned"がついたデータセットは[mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)によって精査されています。 クレンジングに仕様したプロンプトを以下に示します。 ```Python def create_prompt(instruction, input_data, output_data, programming_language="python"): """ 指示、入力データ、出力データを組み合わせてプロンプトを作成する。 Args: instruction (str): ユーザーからの指示 input_data (str): 入力データ output_data (str): 出力データ programming_language (str): プログラミング言語名 Returns: str: 生成されたプロンプト """ if input_data=="": text = f"""Assess whether the following combination of instruction, and output is appropriate. 1. The only natural language for instructions and output is Japanese. 2. The task is related to {programming_language}. 3. Verify that the input data matches the language and context of the instruction. 4. Check the output data for: - Language consistency with the instruction and input. - Accuracy and relevance to the input. - Clarity without repetition or errors. \nInstruction: {instruction}\nOutput: {output_data} \nYour Judgement (Just answer: True or False. No need to explain the reason.):""" else: text = f"""Assess whether the following combination of instruction, input, and output is appropriate. 1. The only natural language for instructions, input, and output is Japanese. 2. The task is related to {programming_language}. 3. Verify that the input data matches the language and context of the instruction. 4. Check the output data for: - Language consistency with the instruction and input. - Accuracy and relevance to the input. - Clarity without repetition or errors. \nInstruction: {instruction}\nInput: {input_data}\nOutput: {output_data} \nYour Judgement (Just answer: True or False. No need to explain the reason.):""" return text ``` ## prompt for data generation ``` You are asked to come up with a set of 10 diverse coding task instructions related to python. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions. Here are the requirements: 1. Avoid using the same phrases for each instruction to maximize diversity. 2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons. 3. The type of instructions should be diverse. The list should include diverse types of tasks like generating code, explaining, fixing, refactoring, optimizing, translating, documenting, analyzing, completing, machine learning, data analyzing etc. 4. The natural language during instructions, inputs and outputs must be in Japanese. English must not be used. Comment text in the code must be in Japanese. 5. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted. 6. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging. The list should include diverse types of context like SQL database, csv, XML, image, text, sound etc. 7. Not all instructions require input. For example, the instruction shuch as "Create a function in Python that adds a, b" does not need to provide a specific context. In this case, we simply put "<noinput>" in the input field. 8. The output should be an appropriate response to the instruction and the input. List of 10 tasks: ```

alpaca_jp_python is a synthetic dataset created using the methods of Stanford Alpaca and mistralai/Mixtral-8x22B-Instruct-v0.1. The dataset contains Japanese text related to programming tasks, particularly Python. It includes features such as number, instruction, input, output, average similarity score, list of similar instructions, index, and clean marker. The dataset is divided into multiple versions, each with cleaned and uncleaned versions. The cleaning process uses specific Python functions to ensure language consistency and relevance of instructions, inputs, and outputs. The dataset is intended for text generation tasks, especially those related to Python programming.
提供机构:
HachiML
原始信息汇总

数据集概述

基本信息

  • 语言: 日语
  • 许可证: Apache 2.0
  • 数据集大小: 10K<n<100K
  • 任务类别: 文本生成

数据集结构

特征

  • No.: 整数类型
  • instruction: 字符串类型
  • input: 字符串类型
  • output: 字符串类型
  • avg_similarity_score: 浮点数类型
  • similar_instructions: 列表类型,包含以下子特征:
    • instruction: 字符串类型
    • similarity: 浮点数类型
  • index: 整数类型
  • clean: 字符串类型

数据分割

  • v1.0_cleaned: 19707929 字节, 10960 个样本
  • _archive_v1.0: 19814564 字节, 11024 个样本
  • _archive_v0.2_cleaned: 16175512 字节, 9066 个样本
  • _archive_v0.2: 16435602 字节, 9221 个样本
  • _archive_v0.1_cleaned: 7005088 字节, 3910 个样本
  • _archive_v0.1: 15317196 字节, 8629 个样本

数据集大小

  • 下载大小: 21343164 字节
  • 数据集大小: 94455891 字节

配置

  • config_name: default
  • data_files:
    • v1.0_cleaned: data/v1.0_cleaned-*
    • _archive_v1.0: data/_archive_v1.0-*
    • _archive_v0.2_cleaned: data/_archive_v0.2_cleaned-*
    • _archive_v0.2: data/_archive_v0.2-*
    • _archive_v0.1_cleaned: data/_archive_v0.1_cleaned-*
    • _archive_v0.1: data/_archive_v0.1-*

标签

  • synthetic
  • code
  • python
  • self-instruct
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作