five

rishiraj/portuguesechat

收藏
Hugging Face2023-11-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rishiraj/portuguesechat
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: prompt dtype: string - name: prompt_id dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: category dtype: string - name: text dtype: string splits: - name: train num_bytes: 30628039 num_examples: 9500 - name: test num_bytes: 1644450 num_examples: 500 download_size: 19873853 dataset_size: 32272489 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* task_categories: - conversational - text-generation language: - pt pretty_name: Portuguese Chat license: cc-by-nc-4.0 --- # Dataset Card for Portuguese Chat We know that current English-first LLMs don’t work well for many other languages, both in terms of performance, latency, and speed. Building instruction datasets for non-English languages is an important challenge that needs to be solved. Dedicated towards addressing this problem, I release 3 new datasets [rishiraj/portuguesechat](https://huggingface.co/datasets/rishiraj/portuguesechat/), [rishiraj/bengalichat](https://huggingface.co/datasets/rishiraj/bengalichat/) & [rishiraj/hindichat](https://huggingface.co/datasets/rishiraj/hindichat/) of 10,000 instructions and demonstrations each. This data can be used for supervised fine-tuning (SFT) to make language multilingual models follow instructions better. ### Dataset Summary [rishiraj/portuguesechat](https://huggingface.co/datasets/rishiraj/portuguesechat/) was modelled after the instruction dataset described in OpenAI's [InstructGPT paper](https://huggingface.co/papers/2203.02155), and is translated from [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots/) which comprised mostly of single-turn instructions across the following categories: | Category | Count | |:-----------|--------:| | Generation | 4560 | | Open QA | 1240 | | Brainstorm | 1120 | | Chat | 850 | | Rewrite | 660 | | Summarize | 420 | | Coding | 350 | | Classify | 350 | | Closed QA | 260 | | Extract | 190 | ### Languages The data in [rishiraj/portuguesechat](https://huggingface.co/datasets/rishiraj/portuguesechat/) are in Portuguese (BCP-47 pt). ### Data Fields The data fields are as follows: * `prompt`: Describes the task the model should perform. * `prompt_id`: A unique ID for the prompt. * `messages`: An array of messages, where each message indicates the role (system, user, assistant) and the content. * `category`: Which category the example belongs to (e.g. `Chat` or `Coding`). * `text`: Content of `messages` in a format that is compatible with dataset_text_field of SFTTrainer. ### Data Splits | | train_sft | test_sft | |---------------|------:| ---: | | portuguesechat | 9500 | 500 | ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{portuguesechat, author = {Rishiraj Acharya}, title = {Portuguese Chat}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{https://huggingface.co/datasets/rishiraj/portuguesechat}} } ```
提供机构:
rishiraj
原始信息汇总

数据集卡片 - Portuguese Chat

数据集概述

数据集摘要

rishiraj/portuguesechat 是基于 OpenAI 的 InstructGPT 论文 描述的指令数据集构建的,并从 HuggingFaceH4/no_robots 翻译而来,主要包含以下类别的单轮指令:

类别 数量
Generation 4560
Open QA 1240
Brainstorm 1120
Chat 850
Rewrite 660
Summarize 420
Coding 350
Classify 350
Closed QA 260
Extract 190

语言

数据集中的数据为葡萄牙语(BCP-47 pt)。

数据字段

数据字段如下:

  • prompt: 描述模型应执行的任务。
  • prompt_id: 提示的唯一ID。
  • messages: 消息数组,每个消息指示角色(系统、用户、助手)和内容。
  • category: 示例所属的类别(例如 ChatCoding)。
  • text: messages 的内容,格式兼容 SFTTrainer 的 dataset_text_field。

数据分割

train_sft test_sft
portuguesechat 9500 500

许可信息

数据集在 Creative Commons NonCommercial (CC BY-NC 4.0) 许可下可用。

引用信息

@misc{portuguesechat, author = {Rishiraj Acharya}, title = {Portuguese Chat}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {url{https://huggingface.co/datasets/rishiraj/portuguesechat}} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作