five

sunzeyeah/chinese_chatgpt_corpus

收藏
Hugging Face2023-03-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sunzeyeah/chinese_chatgpt_corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - unknown language: - zh license: - unknown multilinguality: - monolingual pretty_name: Chinese-ChatGPT-Corpus size_categories: - 5M<n<10M task_categories: - text-generation - text2text-generation - question-answering - reinforcement-learning task_ids: - language-modeling - masked-language-modeling --- # Dataset Card for chinese_chatgpt_corpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Size of downloaded dataset files:** 5.05 GB - **Size of the generated dataset:** 0 GB - **Total amount of disk used:** 5.05 GB ### Dataset Summary This repo collects chinese corpus for Supervised Finetuning (SFT) and Reinforcement Learning From Human Feedback (RLHF). ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages Chinese ## Dataset Structure ### Data Instances #### train_data_external_v1.jsonl - **Size of downloaded dataset files:** 5.04 GB - **Size of the generated dataset:** 0 GB - **Total amount of disk used:** 5.04 GB An example looks as follows: ``` { "prompt": "问题:有没有给未成年贷款的有的联系", "answers": [ { "answer": "若通过招行办理,我行规定,贷款人年龄需年满18岁,且年龄加贷款年限不得超过70岁。如果您持有我行信用卡附属卡,可尝试办理预借现金。", "score": 1 } ], "prefix": "回答:" } ``` #### dev_data_external_v1.jsonl - **Size of downloaded dataset files:** 9.55 MB - **Size of the generated dataset:** 0 MB - **Total amount of disk used:** 9.55 MB An example looks as follows: ``` { "prompt": "初学纹发现1/2\"的管螺纹并不是1\"的一半。不知道其中的原因,请各位指点。", "answers": [ { "answer": "管螺纹的名义尺寸是“管子”的孔(内)径,而管子的壁厚不是两倍。所以,1/2\"的管螺纹并不是1\"的一半,", "score": 1 } ], "prefix": "回答:" } ``` ### Data Fields The data fields are the same among all splits. #### train_data_external_v1.jsonl - `prompt`: prompt, `string` - `answers`: list of answers - `answer`: answer, `string` - `score`: score of answer, `int` - `prefix`: prefix to the answer, `string` #### dev_data_external_v1.jsonl - `prompt`: prompt, `string` - `answers`: list of answers - `answer`: answer, `string` - `score`: score of answer, `int` - `prefix`: prefix to the answer, `string` ### Data Splits | name | train | |----------|-------:| |train_data_external_v1.jsonl|5477982| |dev_data_external_v1.jsonl|10000| ## Dataset Creation ### Curation Rationale Link to github: [data_prepare](https://github.com/sunzeyeah/RLHF/blob/master/src/data_prepare.py) ### Source Data #### Initial Data Collection and Normalization - [百科](https://github.com/brightmart/nlp_chinese_corpus) - [知道问答](https://github.com/SophonPlus/ChineseNlpCorpus) - [对联](https://github.com/wb14123/couplet-dataset/releases/download/1.0/couplet.tar.gz) - [古文](https://github.com/NiuTrans/Classical-Modern) - [古诗词](https://github.com/chinese-poetry/chinese-poetry) - 微博新闻评论 #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
sunzeyeah
原始信息汇总

数据集概述

数据集基本信息

  • 名称: Chinese-ChatGPT-Corpus
  • 语言: 中文
  • 许可证: 未知
  • 多语言性: 单语种
  • 大小: 5M<n<10M

数据集内容

  • 支持的任务: 文本生成、文本到文本生成、问答、强化学习
  • 任务ID: 语言建模、掩码语言建模

数据集结构

  • 数据实例:
    • train_data_external_v1.jsonl: 5.04 GB
    • dev_data_external_v1.jsonl: 9.55 MB
  • 数据字段:
    • prompt: 字符串
    • answers: 列表
      • answer: 字符串
      • score: 整数
    • prefix: 字符串

数据集创建

  • 来源数据:
    • 百科、知道问答、对联、古文、古诗词、微博新闻评论
  • 数据收集与标准化:
    • 来自多个公开资源的数据集

数据集使用考虑

  • 许可证信息: 未知
  • 数据集影响、偏见及限制: 信息不足
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作