rishiraj/portuguesechat

Name: rishiraj/portuguesechat
Creator: rishiraj
Published: 2023-11-28 06:19:44
License: 暂无描述

Hugging Face2023-11-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rishiraj/portuguesechat

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: prompt dtype: string - name: prompt_id dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: category dtype: string - name: text dtype: string splits: - name: train num_bytes: 30628039 num_examples: 9500 - name: test num_bytes: 1644450 num_examples: 500 download_size: 19873853 dataset_size: 32272489 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* task_categories: - conversational - text-generation language: - pt pretty_name: Portuguese Chat license: cc-by-nc-4.0 --- # Dataset Card for Portuguese Chat We know that current English-first LLMs don’t work well for many other languages, both in terms of performance, latency, and speed. Building instruction datasets for non-English languages is an important challenge that needs to be solved. Dedicated towards addressing this problem, I release 3 new datasets [rishiraj/portuguesechat](https://huggingface.co/datasets/rishiraj/portuguesechat/), [rishiraj/bengalichat](https://huggingface.co/datasets/rishiraj/bengalichat/) & [rishiraj/hindichat](https://huggingface.co/datasets/rishiraj/hindichat/) of 10,000 instructions and demonstrations each. This data can be used for supervised fine-tuning (SFT) to make language multilingual models follow instructions better. ### Dataset Summary [rishiraj/portuguesechat](https://huggingface.co/datasets/rishiraj/portuguesechat/) was modelled after the instruction dataset described in OpenAI's [InstructGPT paper](https://huggingface.co/papers/2203.02155), and is translated from [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots/) which comprised mostly of single-turn instructions across the following categories: | Category | Count | |:-----------|--------:| | Generation | 4560 | | Open QA | 1240 | | Brainstorm | 1120 | | Chat | 850 | | Rewrite | 660 | | Summarize | 420 | | Coding | 350 | | Classify | 350 | | Closed QA | 260 | | Extract | 190 | ### Languages The data in [rishiraj/portuguesechat](https://huggingface.co/datasets/rishiraj/portuguesechat/) are in Portuguese (BCP-47 pt). ### Data Fields The data fields are as follows: * `prompt`: Describes the task the model should perform. * `prompt_id`: A unique ID for the prompt. * `messages`: An array of messages, where each message indicates the role (system, user, assistant) and the content. * `category`: Which category the example belongs to (e.g. `Chat` or `Coding`). * `text`: Content of `messages` in a format that is compatible with dataset_text_field of SFTTrainer. ### Data Splits | | train_sft | test_sft | |---------------|------:| ---: | | portuguesechat | 9500 | 500 | ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{portuguesechat, author = {Rishiraj Acharya}, title = {Portuguese Chat}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{https://huggingface.co/datasets/rishiraj/portuguesechat}} } ```

提供机构：

rishiraj

原始信息汇总

数据集卡片 - Portuguese Chat

数据集概述

数据集摘要

rishiraj/portuguesechat 是基于 OpenAI 的 InstructGPT 论文描述的指令数据集构建的，并从 HuggingFaceH4/no_robots 翻译而来，主要包含以下类别的单轮指令：

类别	数量
Generation	4560
Open QA	1240
Brainstorm	1120
Chat	850
Rewrite	660
Summarize	420
Coding	350
Classify	350
Closed QA	260
Extract	190

语言

数据集中的数据为葡萄牙语（BCP-47 pt）。

数据字段

数据字段如下：

prompt: 描述模型应执行的任务。
prompt_id: 提示的唯一ID。
messages: 消息数组，每个消息指示角色（系统、用户、助手）和内容。
category: 示例所属的类别（例如 Chat 或 Coding）。
text: messages 的内容，格式兼容 SFTTrainer 的 dataset_text_field。

数据分割

	train_sft	test_sft
portuguesechat	9500	500

许可信息

数据集在 Creative Commons NonCommercial (CC BY-NC 4.0) 许可下可用。

引用信息

@misc{portuguesechat, author = {Rishiraj Acharya}, title = {Portuguese Chat}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {url{https://huggingface.co/datasets/rishiraj/portuguesechat}} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集