rishiraj/portuguesechat
收藏Hugging Face2023-11-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rishiraj/portuguesechat
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: prompt
dtype: string
- name: prompt_id
dtype: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: category
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 30628039
num_examples: 9500
- name: test
num_bytes: 1644450
num_examples: 500
download_size: 19873853
dataset_size: 32272489
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
task_categories:
- conversational
- text-generation
language:
- pt
pretty_name: Portuguese Chat
license: cc-by-nc-4.0
---
# Dataset Card for Portuguese Chat
We know that current English-first LLMs don’t work well for many other languages, both in terms of performance, latency, and speed. Building instruction datasets for non-English languages is an important challenge that needs to be solved.
Dedicated towards addressing this problem, I release 3 new datasets [rishiraj/portuguesechat](https://huggingface.co/datasets/rishiraj/portuguesechat/), [rishiraj/bengalichat](https://huggingface.co/datasets/rishiraj/bengalichat/) & [rishiraj/hindichat](https://huggingface.co/datasets/rishiraj/hindichat/) of 10,000 instructions and demonstrations each. This data can be used for supervised fine-tuning (SFT) to make language multilingual models follow instructions better.
### Dataset Summary
[rishiraj/portuguesechat](https://huggingface.co/datasets/rishiraj/portuguesechat/) was modelled after the instruction dataset described in OpenAI's [InstructGPT paper](https://huggingface.co/papers/2203.02155), and is translated from [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots/) which comprised mostly of single-turn instructions across the following categories:
| Category | Count |
|:-----------|--------:|
| Generation | 4560 |
| Open QA | 1240 |
| Brainstorm | 1120 |
| Chat | 850 |
| Rewrite | 660 |
| Summarize | 420 |
| Coding | 350 |
| Classify | 350 |
| Closed QA | 260 |
| Extract | 190 |
### Languages
The data in [rishiraj/portuguesechat](https://huggingface.co/datasets/rishiraj/portuguesechat/) are in Portuguese (BCP-47 pt).
### Data Fields
The data fields are as follows:
* `prompt`: Describes the task the model should perform.
* `prompt_id`: A unique ID for the prompt.
* `messages`: An array of messages, where each message indicates the role (system, user, assistant) and the content.
* `category`: Which category the example belongs to (e.g. `Chat` or `Coding`).
* `text`: Content of `messages` in a format that is compatible with dataset_text_field of SFTTrainer.
### Data Splits
| | train_sft | test_sft |
|---------------|------:| ---: |
| portuguesechat | 9500 | 500 |
### Licensing Information
The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode).
### Citation Information
```
@misc{portuguesechat,
author = {Rishiraj Acharya},
title = {Portuguese Chat},
year = {2023},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/datasets/rishiraj/portuguesechat}}
}
```
提供机构:
rishiraj
原始信息汇总
数据集卡片 - Portuguese Chat
数据集概述
数据集摘要
rishiraj/portuguesechat 是基于 OpenAI 的 InstructGPT 论文 描述的指令数据集构建的,并从 HuggingFaceH4/no_robots 翻译而来,主要包含以下类别的单轮指令:
| 类别 | 数量 |
|---|---|
| Generation | 4560 |
| Open QA | 1240 |
| Brainstorm | 1120 |
| Chat | 850 |
| Rewrite | 660 |
| Summarize | 420 |
| Coding | 350 |
| Classify | 350 |
| Closed QA | 260 |
| Extract | 190 |
语言
数据集中的数据为葡萄牙语(BCP-47 pt)。
数据字段
数据字段如下:
prompt: 描述模型应执行的任务。prompt_id: 提示的唯一ID。messages: 消息数组,每个消息指示角色(系统、用户、助手)和内容。category: 示例所属的类别(例如Chat或Coding)。text:messages的内容,格式兼容 SFTTrainer 的 dataset_text_field。
数据分割
| train_sft | test_sft | |
|---|---|---|
| portuguesechat | 9500 | 500 |
许可信息
数据集在 Creative Commons NonCommercial (CC BY-NC 4.0) 许可下可用。
引用信息
@misc{portuguesechat, author = {Rishiraj Acharya}, title = {Portuguese Chat}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {url{https://huggingface.co/datasets/rishiraj/portuguesechat}} }



