sunzeyeah/chinese_chatgpt_corpus

Name: sunzeyeah/chinese_chatgpt_corpus
Creator: sunzeyeah
Published: 2023-03-23 16:53:47
License: 暂无描述

Hugging Face2023-03-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/sunzeyeah/chinese_chatgpt_corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - unknown language: - zh license: - unknown multilinguality: - monolingual pretty_name: Chinese-ChatGPT-Corpus size_categories: - 5M<n<10M task_categories: - text-generation - text2text-generation - question-answering - reinforcement-learning task_ids: - language-modeling - masked-language-modeling --- # Dataset Card for chinese_chatgpt_corpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Size of downloaded dataset files:** 5.05 GB - **Size of the generated dataset:** 0 GB - **Total amount of disk used:** 5.05 GB ### Dataset Summary This repo collects chinese corpus for Supervised Finetuning (SFT) and Reinforcement Learning From Human Feedback (RLHF). ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages Chinese ## Dataset Structure ### Data Instances #### train_data_external_v1.jsonl - **Size of downloaded dataset files:** 5.04 GB - **Size of the generated dataset:** 0 GB - **Total amount of disk used:** 5.04 GB An example looks as follows: ``` { "prompt": "问题：有没有给未成年贷款的有的联系", "answers": [ { "answer": "若通过招行办理，我行规定，贷款人年龄需年满18岁，且年龄加贷款年限不得超过70岁。如果您持有我行信用卡附属卡，可尝试办理预借现金。", "score": 1 } ], "prefix": "回答：" } ``` #### dev_data_external_v1.jsonl - **Size of downloaded dataset files:** 9.55 MB - **Size of the generated dataset:** 0 MB - **Total amount of disk used:** 9.55 MB An example looks as follows: ``` { "prompt": "初学纹发现1/2\"的管螺纹并不是1\"的一半。不知道其中的原因，请各位指点。", "answers": [ { "answer": "管螺纹的名义尺寸是“管子”的孔（内）径，而管子的壁厚不是两倍。所以，1/2\"的管螺纹并不是1\"的一半，", "score": 1 } ], "prefix": "回答：" } ``` ### Data Fields The data fields are the same among all splits. #### train_data_external_v1.jsonl - `prompt`: prompt, `string` - `answers`: list of answers - `answer`: answer, `string` - `score`: score of answer, `int` - `prefix`: prefix to the answer, `string` #### dev_data_external_v1.jsonl - `prompt`: prompt, `string` - `answers`: list of answers - `answer`: answer, `string` - `score`: score of answer, `int` - `prefix`: prefix to the answer, `string` ### Data Splits | name | train | |----------|-------:| |train_data_external_v1.jsonl|5477982| |dev_data_external_v1.jsonl|10000| ## Dataset Creation ### Curation Rationale Link to github: [data_prepare](https://github.com/sunzeyeah/RLHF/blob/master/src/data_prepare.py) ### Source Data #### Initial Data Collection and Normalization - [百科](https://github.com/brightmart/nlp_chinese_corpus) - [知道问答](https://github.com/SophonPlus/ChineseNlpCorpus) - [对联](https://github.com/wb14123/couplet-dataset/releases/download/1.0/couplet.tar.gz) - [古文](https://github.com/NiuTrans/Classical-Modern) - [古诗词](https://github.com/chinese-poetry/chinese-poetry) - 微博新闻评论 #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

sunzeyeah

原始信息汇总

数据集概述

数据集基本信息

名称: Chinese-ChatGPT-Corpus
语言: 中文
许可证: 未知
多语言性: 单语种
大小: 5M<n<10M

数据集内容

支持的任务: 文本生成、文本到文本生成、问答、强化学习
任务ID: 语言建模、掩码语言建模

数据集结构

数据实例:
- train_data_external_v1.jsonl: 5.04 GB
- dev_data_external_v1.jsonl: 9.55 MB
数据字段:
- prompt: 字符串
- answers: 列表
  - answer: 字符串
  - score: 整数
- prefix: 字符串

数据集创建

来源数据:
- 百科、知道问答、对联、古文、古诗词、微博新闻评论
数据收集与标准化:
- 来自多个公开资源的数据集

数据集使用考虑

许可证信息: 未知
数据集影响、偏见及限制: 信息不足

5,000+

优质数据集

54 个

任务类型

进入经典数据集