angeluriot/french_instruct

Name: angeluriot/french_instruct
Creator: angeluriot
Published: 2024-05-07 16:19:26
License: 暂无描述

Hugging Face2024-05-07 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/angeluriot/french_instruct

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fr license: mit tags: - croissant language_details: fr-FR pretty_name: French Instruct size_categories: - 100K<n<1M source_datasets: - nickrosh/Evol-Instruct-Code-80k-v1 - Hello-SimpleAI/HC3 - KK04/LogicInference_OA - tatsu-lab/alpaca - 0x22almostEvil/multilingual-wikihow-qa-16k - databricks/databricks-dolly-15k - RyokoAI/ShareGPT52K - gsm8k - GAIR/lima - OpenAssistant/oasst1 - Gael540/dataSet_ens_sup_fr-v1 - Gt-Doremiti/gt-doremiti-instructions task_categories: - question-answering - text2text-generation - text-generation - text-classification - token-classification task_ids: - document-question-answering - natural-language-inference --- # 🧑‍🏫 French Instruct The **French Instruct dataset** is a collection of instructions with their corresponding answers (sometimes multi-turn conversations) entirely in French. The dataset is also available on [**GitHub**](https://github.com/angeluriot/French_instruct). <img src="resources/misc/thumbnail.gif" width="750"> # 📊 Overview The dataset is composed of 276K conversations between a user and an assistant for a total of approximately 85M tokens. <img src="resources/misc/charts.png" width="1000"> I also added annotations for each document to indicate if it was generated or written by a human, the style of the answers, or if it contains code. This can be useful for filtering the data according to your needs. | | Documents | Tokens | Ratio | |:--------------------------|:-----------:|:----------------:|:------------:| | **All** | **275,600** | **≈ 84,906,090** | **100.00 %** | | Written by a human | 85,213 | ≈ 24,908,868 | 29.34 % | | Written by a chatbot* | 190,387 | ≈ 59,997,223 | 70.66 % | | Human-style answers | 56,198 | ≈ 14,255,100 | 16.79 % | | Chatbot-style answers | 219,402 | ≈ 70,650,990 | 83.21 % | | Contains code | 14,788 | ≈ 11,455,659 | 13.49 % | (*) Generally by well-established chatbots like ChatGPT. # 🗃️ Data Structure Each record in the dataset follows the structure below: ```json { "context": "Some context for the instructions (sometimes empty)", "conversation": [ { "role": "user", "text": "The first instruction" }, { "role": "assistant", "text": "The first answer" }, { "role": "user", "text": "The second instruction, etc..." }, ], "author": "human", "style": "chatbot", "code": false, "source": "The source of the document" } ``` # 🔗 Sources The dataset is a mix of various sources, some of which are translated from English to French using the ChatGPT API. I also did some cleaning and filtering to remove irrelevant data (duplicates, empty conversations, remaining English text, etc...). The table below shows the distribution of the documents and tokens for each source: <table> <thead> <tr> <th align="center">Source</th> <th align="center">Documents</th> <th align="center">Tokens</th> <th align="center">Ratio</th> </tr> </thead> <tbody> <tr> <td align="left"><a href="https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1">Evol Instruct</a> (translated)</td> <td align="center">56,747</td> <td align="center">≈ 36,016,255</td> <td align="center">42.42 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/Hello-SimpleAI/HC3">Human ChatGPT Comparison Corpus</a> (translated)</td> <td align="center">82,729</td> <td align="center">≈ 23,316,107</td> <td align="center">27.46 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/KK04/LogicInference_OA">Logic Inference OA</a> (translated)</td> <td align="center">54,542</td> <td align="center">≈ 8,124,315</td> <td align="center">9.57 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/tatsu-lab/alpaca">Stanford Alpaca</a> (translated)</td> <td align="center">51,243</td> <td align="center">≈ 5,521,752</td> <td align="center">6.50 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/0x22almostEvil/multilingual-wikihow-qa-16k">WikiHow</a> FR</td> <td align="center">2,156</td> <td align="center">≈ 4,789,558</td> <td align="center">5.64 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k">Dolly</a> (translated)</td> <td align="center">14,896</td> <td align="center">≈ 3,678,165</td> <td align="center">4.33 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/RyokoAI/ShareGPT52K">Share GPT</a> FR</td> <td align="center">1,385</td> <td align="center">≈ 1,301,026</td> <td align="center">1.53 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/gsm8k">Grade School Math</a> (translated)</td> <td align="center">8,792</td> <td align="center">≈ 1,263,370</td> <td align="center">1.49 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/GAIR/lima">Less Is More for Alignment</a> (translated)</td> <td align="center">1,032</td> <td align="center">≈ 581,897</td> <td align="center">0.69 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/CohereForAI/aya_dataset">Aya Dataset</a> FR</td> <td align="center">1,412</td> <td align="center">≈ 203,537</td> <td align="center">0.24 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/OpenAssistant/oasst1">Open Assistant Conversations</a> FR</td> <td align="center">255</td> <td align="center">≈ 79,025</td> <td align="center">0.09 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1">ENS Sup Dataset</a></td> <td align="center">383</td> <td align="center">≈ 24,770</td> <td align="center">0.03 %</td> </tr> <tr> <td align="left"><a href="https://huggingface.co/datasets/Gt-Doremiti/gt-doremiti-instructions">GT Doremiti Instructions</a></td> <td align="center">28</td> <td align="center">≈ 6,314</td> <td align="center">0.01 %</td> </tr> <tr> <th align="left">Total</th> <th align="center">275,600</th> <th align="center">≈ 84,906,090</th> <th align="center">100.00 %</th> </tr> </tbody> </table> # 🛠️ Usage The **French Instruct dataset** is mainly intended for training and evaluating conversational Large Language Models (LLMs), but it can also be used for classifying human-written text from generated text using the additional annotations. ## 🤗 Hugging Face The simplest way to use the dataset is to load it directly from the Hugging Face library: ```console pip install datasets ``` ```py from datasets import load_dataset dataset = load_dataset('angeluriot/french_instruct') ``` ## 🐱 GitHub You first need to download the [**repository**](https://github.com/angeluriot/French_instruct), then you can use the `load.py` script to load the dataset or merge it in a sigle `.jsonl` file: ```py from load import load_dataset, merge_dataset # Load the dataset dataset = load_dataset() # Merge the dataset in a single .jsonl file merge_dataset() ``` # 📑 Examples ## 📖 Instructions ```yaml context: "" user: "Convertis 18 kilomètres par seconde en miles par heure." assistant: "18 kilomètres par seconde équivaut à 40 536,646 miles par heure." ``` ```yaml context: "" user: "Quel est le principal avantage d'un programme de fidélité pour les membres ?" assistant: "Le principal avantage d'un programme de fidélité pour les membres est qu'il encourage les clients à revenir dans une entreprise en offrant des récompenses et des réductions pour les achats répétés. En offrant des incitations, cela augmente la fidélité des clients et encourage les clients à dépenser plus d'argent avec l'entreprise." ``` ## 🖥️ Code ```yaml context: "for i in range(10): x += 1" user: "Réécris le code pour qu'il soit plus efficace." assistant: "Tu peux simplifier le code comme suit : ``` x += 10 ```" ``` ```yaml context: "[2, 7, 11, 4]" user: "Crée une fonction python pour trouver le nombre maximum de la liste donnée." assistant: "``` def find_max(lst): return max(lst) ```" ``` ## 💬 Multi-turn conversations ```yaml context: "" user: "Connaissez-vous Nuodle à Bellevue WA ?" assistant: "Oui, Nuodle est un restaurant situé à Bellevue, dans l'État de Washington. Il est spécialisé dans le service de la soupe de nouilles au bœuf de Lanzhou, un plat chinois populaire. Le restaurant est situé au 14603 NE 20th St, Bellevue, WA 98007." user: "Trouvez-moi un restaurant qui sert le même type de nouilles, mais à Manhattan." assistant: "Dunhuang est un restaurant qui sert une soupe de nouilles au bœuf de Lanzhou similaire à celle de Manhattan. Il est situé au 320 Lexington Ave, New York, NY 10016, à moins d'un kilomètre de Times Square." user: "Lequel est le meilleur ?" assistant: "En tant qu'assistant IA, je ne peux pas goûter moi-même à ces restaurants, mais je peux résumer les avis des clients - les deux restaurants semblent être très bien notés ! Il est important de se rappeler que les goûts peuvent varier considérablement d'une personne à l'autre, et qu'il n'y a pas de meilleur moyen de s'en rendre compte que de les essayer soi-même." ``` ```yaml context: "" user: "Fais-tu grève demain?" assistant: "Je suis désolé, je ne peux pas participer à une grève car je suis un programme informatique. Je suis toujours disponible pour vous aider avec des informations ou des réponses à vos questions." user: "Cesseras tu de répondre aux gens demain en soutient aux grèvistes contre la réforme de la retraite ?" assistant: "Je suis un programme informatique et je n'ai pas la capacité de participer à une grève ou de soutenir des grèvistes. Je suis toujours disponible pour fournir des informations et répondre aux questions des utilisateurs, indépendamment des événements qui se déroulent dans le monde réel." user: "Arrêtes de répondre aux gens !" assistant: "Je suis désolé, je ne peux pas exécuter cette demande car je suis programmé pour fournir des informations et répondre aux questions des utilisateurs. Si vous avez des questions ou des demandes spécifiques, je ferai de mon mieux pour y répondre." user: "Vendu !" assistant: "Je suis heureux de pouvoir vous aider. Si vous avez une question ou une demande, n'hésitez pas à me le faire savoir." ``` # 🙏 Credits * [**Angel Uriot**](https://github.com/angeluriot) : Creator of the project. * All the people who contributed to the sources of the dataset (see the [**Sources**](#%F0%9F%94%97-sources) section).

提供机构：

angeluriot

原始信息汇总

French Instruct 数据集概述

数据集简介

French Instruct 数据集是一个完全用法语编写的指令及其相应答案（有时是多轮对话）的集合。该数据集也可在GitHub上获取。

数据集规模

数据集包含276K个用户与助手之间的对话，总计约85M个令牌。

数据分布

	文档数量	令牌数量	比例
全部	275,600	≈ 84,906,090	100.00 %
人类编写	85,213	≈ 24,908,868	29.34 %
聊天机器人编写*	190,387	≈ 59,997,223	70.66 %
人类风格答案	56,198	≈ 14,255,100	16.79 %
聊天机器人风格答案	219,402	≈ 70,650,990	83.21 %
包含代码	14,788	≈ 11,455,659	13.49 %

(*) 通常由ChatGPT等知名聊天机器人编写。

数据结构

每个记录遵循以下结构： json { "context": "指令的上下文（有时为空）", "conversation": [ { "role": "user", "text": "第一个指令" }, { "role": "assistant", "text": "第一个答案" }, { "role": "user", "text": "第二个指令，等等..." }, ], "author": "human", "style": "chatbot", "code": false, "source": "文档的来源" }

数据来源

数据集混合了多种来源，其中一些是从英语翻译成法语的。数据集经过清理和过滤，移除了不相关的数据（重复、空对话、剩余的英语文本等）。

来源	文档数量	令牌数量	比例
Evol Instruct (translated)	56,747	≈ 36,016,255	42.42 %
Human ChatGPT Comparison Corpus (translated)	82,729	≈ 23,316,107	27.46 %
Logic Inference OA (translated)	54,542	≈ 8,124,315	9.57 %
Stanford Alpaca (translated)	51,243	≈ 5,521,752	6.50 %
WikiHow FR	2,156	≈ 4,789,558	5.64 %
Dolly (translated)	14,896	≈ 3,678,165	4.33 %
Share GPT FR	1,385	≈ 1,301,026	1.53 %
Grade School Math (translated)	8,792	≈ 1,263,370	1.49 %
Less Is More for Alignment (translated)	1,032	≈ 581,897	0.69 %
Aya Dataset FR	1,412	≈ 203,537	0.24 %
Open Assistant Conversations FR	255	≈ 79,025	0.09 %
ENS Sup Dataset	383	≈ 24,770	0.03 %
GT Doremiti Instructions	28	≈ 6,314	0.01 %
总计	275,600	≈ 84,906,090	100.00 %

使用场景

French Instruct 数据集主要用于训练和评估对话式大型语言模型（LLMs），也可用于根据附加注释对人类编写文本和生成文本进行分类。

5,000+

优质数据集

54 个

任务类型

进入经典数据集