ShareGPT-Chinese-English-90k
收藏魔搭社区2026-05-23 更新2024-06-01 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/ShareGPT-Chinese-English-90k
下载链接
链接失效反馈官方服务:
资源简介:
# ShareGPT-Chinese-English-90k Bilingual Human-Machine QA Dataset
A high-quality Chinese-English parallel bilingual human-machine QA dataset, covering user questions in real and complex scenarios. It is used for training high-quality dialogue models (more robust in instruction distribution than those datasets generated by repeatedly calling API interfaces to simulate machine-generated Q&A, like Moss)
Features:
- 1. Provides fully semantically equivalent Chinese-English parallel corpus, facilitating bilingual dialogue model training.
- 2. All questions are genuine inquiries from users, not fabricated by artificial imagination or API polling (like Moss), aligning more closely with the real distribution of user scenarios and their expressions of questions.
- 3. The ShareGPT dataset is collected through voluntary sharing by netizens, acting as a natural filter (via human perception) that screens out most dialogues with poor experience.
It is recommended to use the Firefly framework for quick and easy out-of-the-box loading of this data format: https://github.com/yangjianxin1/Firefly
Note: This dataset was collected at a time before ChatGPT showed signs of significant cognitive decline. (It is speculated that this may be partly because the official replaced the 150B gpt3.5 with a distilled version of about 10B to reduce expenses, and partly because the introduction of more refusal responses led to a degradation in the model's ability to connect knowledge and logic.)
The training of an excellent dialogue LLM cannot do without a high-quality multi-turn dialogue dataset. If you also wish to become a volunteer,
you are welcome to join the dataset QQ group: 130920969, to exchange, collect, and contribute to the construction of high-quality datasets.
# ShareGPT-Chinese-English-90k 中英文双语人机问答数据集
中英文平行双语优质人机问答数据集,覆盖真实复杂场景下的用户提问。用于训练高质量的对话模型 (比那些通过反复调用api接口生成机器模拟问答的数据在指令分布上更鲁棒)
特点:
- 1.同时提供意义表达完全相同的中英文平行对照语料,可进行双语对话模型训练。
- 2.所有问题均非人为臆想加上api轮询拟造的假数据(如Moss),更加符合真实用户场景的指令分布和提问表达。
- 3.sharegpt数据集是由网友自发分享而收集到的,相当于有一层非常天然的过滤(通过人类感觉),筛除了大部分体验不好的对话。
推荐使用firefly框架,可以快速开箱即用使用该数据格式的加载: https://github.com/yangjianxin1/Firefly
PS:当前数据集为firefly格式,可以自行使用仓库内提供的脚本转换为更广为使用的sharegpt格式的多轮对话数据集.
```python
import json
def convert_jsonl(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as f:
with open(output_file, 'w', encoding='utf-8') as fout:
for line in f:
data = json.loads(line.strip())
conversations = data['conversation']
new_conversations = []
for conv in conversations:
for key, value in conv.items():
if key == 'assistant':
key = 'gpt'
else:
key = 'human'
new_conversations.append({'from': key, 'value': value})
new_data = {'conversations': new_conversations}
fout.write(json.dumps(new_data, ensure_ascii=False) + '\n')
# 替换输入文件路径和输出文件路径
input_file = 'input_firefly.jsonl'
output_file = 'output_sharegpt.jsonl'
convert_jsonl(input_file, output_file)
```
补充:该数据收集于chatGPT还未表现出明显智力退化的时间点。(猜测一方面可能是官方为了减小开支把150B的gpt3.5替换成10b左右的蒸馏版本了,另一方面可能是由于引入了更多的拒绝答复导致模型连接知识逻辑的程度退化)
优秀对话llm的训练离不开高质量的多轮对话数据集,如果你也想成为志愿者
欢迎加入shareAI QQ群:130920969,共同进行优质数据集的交流、收集和建设工作
特别感谢:“淮北艾阿网络科技有限公司”对翻译工作费用的赞助支持!
<img width="360" src="https://cdn-uploads.huggingface.co/production/uploads/631f5b422225f12fc0f2c838/rnAz74Adg-m8QbRraXhqU.jpeg">
如果您的工作成果使用到了该项目,请按如下方式进行引用:
If your work results use this project, please cite it as follows:
```
@dataset{sharegpt_chinese_english_90k,
author = {{ShareAI Lab}},
title = {ShareGPT-Chinese-English-90k: A Bilingual Chinese-English Human-Machine Dialogue Dataset},
year = {2023},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k}
}
```
兼容写法(BibTeX @misc):
Fallback (BibTeX @misc):
```
@misc{sharegpt_chinese_english_90k,
author = {{ShareAI Lab}},
title = {ShareGPT-Chinese-English-90k: A Bilingual Chinese-English Human-Machine Dialogue Dataset},
year = {2023},
howpublished = {\url{https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k}},
note = {Hugging Face dataset repository}
}
```
Legacy note: Some earlier works cite the dataset author as “shareAI”. The canonical author name is “ShareAI Lab”, while the dataset URL remains unchanged.
# ShareGPT-Chinese-English-90k 中英文双语人机问答数据集
这是一份高质量的中英文平行双语人机问答数据集,覆盖真实复杂场景下的用户提问,可用于训练优质对话模型——其在指令分布鲁棒性上优于通过反复调用API接口生成机器模拟问答的数据集(如Moss)。
特点:
- 1. 提供语义完全等价的中英文平行语料,可支撑双语对话模型的训练。
- 2. 所有问题均为用户真实发起的咨询,并非人工臆造或通过API轮询生成的虚假数据(如Moss),更贴合真实用户场景的提问分布与表达习惯。
- 3. 本数据集通过网友自愿分享收集而来,相当于经由人类感知构建的天然过滤机制,筛除了绝大多数体验不佳的对话内容。
推荐使用Firefly框架快速开箱即用加载该数据集格式:https://github.com/yangjianxin1/Firefly
PS:当前数据集为Firefly格式,可自行使用仓库内提供的脚本转换为更广为使用的ShareGPT格式多轮对话数据集。
python
import json
def convert_jsonl(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as f:
with open(output_file, 'w', encoding='utf-8') as fout:
for line in f:
data = json.loads(line.strip())
conversations = data['conversation']
new_conversations = []
for conv in conversations:
for key, value in conv.items():
if key == 'assistant':
key = 'gpt'
else:
key = 'human'
new_conversations.append({'from': key, 'value': value})
new_data = {'conversations': new_conversations}
fout.write(json.dumps(new_data, ensure_ascii=False) + '
')
# 替换输入文件路径和输出文件路径
input_file = 'input_firefly.jsonl'
output_file = 'output_sharegpt.jsonl'
convert_jsonl(input_file, output_file)
补充说明:
本数据集采集于ChatGPT尚未出现明显认知退化的时期。(推测原因主要有二:一是OpenAI官方为缩减开支,将150B参数的GPT-3.5替换为约10B参数的蒸馏版本;二是模型引入了更多拒绝应答内容,导致其关联知识与逻辑的能力出现退化。)
优秀的对话大语言模型(Large Language Model,LLM)训练离不开高质量的多轮对话数据集。若您希望成为志愿者,欢迎加入数据集共建QQ群:130920969,共同开展优质数据集的交流、收集与建设工作。
特别感谢:"淮北艾阿网络科技有限公司"对本次翻译工作的经费赞助支持!
<img width="360" src="https://cdn-uploads.huggingface.co/production/uploads/631f5b422225f12fc0f2c838/rnAz74Adg-m8QbRraXhqU.jpeg">
如果您的工作成果使用到了本数据集,请按如下方式引用:
@dataset{sharegpt_chinese_english_90k,
author = {{ShareAI Lab}},
title = {ShareGPT-Chinese-English-90k: A Bilingual Chinese-English Human-Machine Dialogue Dataset},
year = {2023},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k}
}
兼容写法(BibTeX @misc):
@misc{sharegpt_chinese_english_90k,
author = {{ShareAI Lab}},
title = {ShareGPT-Chinese-English-90k: A Bilingual Chinese-English Human-Machine Dialogue Dataset},
year = {2023},
howpublished = {url{https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k}},
note = {Hugging Face 数据集仓库}
}
遗留说明:部分早期成果将本数据集的作者标注为"shareAI",标准作者名称应为"ShareAI Lab",数据集链接保持不变。
提供机构:
maas
创建时间:
2024-05-09
搜集汇总
数据集介绍

背景与挑战
背景概述
ShareGPT-Chinese-English-90k是一个高质量的中英文平行双语人机问答数据集,包含真实用户提问,适用于训练对话模型。其特点在于提供语义完全对应的双语语料,且数据通过网友自发分享收集,经过天然过滤以提升质量。
以上内容由遇见数据集搜集并总结生成



