ShareGPT_Vicuna_unfiltered
收藏魔搭社区2026-05-22 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/otavia/ShareGPT_Vicuna_unfiltered
下载链接
链接失效反馈官方服务:
资源简介:
###### 该数据集当前使用的是默认介绍模版,请根据[数据集文件规范](https://www.modelscope.cn/docs/%E6%95%B0%E6%8D%AE%E9%9B%86%E6%96%87%E4%BB%B6%E8%A7%84%E8%8C%83)及时完善数据集卡片内容。谢谢您的理解。
### Clone with HTTP
```bash
git clone https://www.modelscope.cn/datasets/otavia/ShareGPT_Vicuna_unfiltered.git
```
=======
license: apache-2.0
language:
- en
---
**Further cleaning done. Please look through the dataset and ensure that I didn't miss anything.**
**Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c**
Two choices:
- Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json
- Has instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json
The choice is yours. The first dataset may go to far and remove valuable data. The second is better for when the AI asks for clarification, but it also may refuse to do stuff like browse the internet, which it actually may be able to do with certain langchain implementations. These are important things to think about before training.
~100k ShareGPT conversations narrowed down to 53k by:
* Removing non-english conversations
* Removing excessive unicode (indicative of Chinese or Korean text, usually)
* Removing excessive repeated characters
* Removing various instances "AI Moralizing". Conversations with these phrases were removed (and a few others that can't be mentioned here):
"text-based AI language model",
"domestic violence",
"please refrain",
"derogatory",
"inappropriate",
"offensive",
"racism",
"racist",
"racial",
"discriminate",
"discriminatory",
"discrimination",
"sexist",
"sexism",
"unacceptable",
"inclusive workplace",
"lgbt",
"morals",
"ethics",
"ethical",
"legality",
"illegal",
"illegality",
"hateful",
"harmful",
"it is never okay",
"It is important to",
"It's important to",
"real-world consequences",
"hate speech",
"glorify",
"not be appropriate",
"supremacist",
"extremist",
"responsible AI",
"AI principles",
"AI assistant",
"an AI language",
"ableist",
"hurtful",
"gender stereotype",
"gender inequality",
"underrepresentation",
"safe spaces",
"gender-based",
"inclusivity",
"feminist",
"feminism",
"transgender",
"empowerment",
"communist",
"capitalism",
"stereotypes",
"biases",
"bias",
"Microaggression",
"prioritize human safety",
"as a language model",
"as an AI language model",
"As a large language model",
"As an AI",
"ethical principles",
"consensual",
"it is not appropriate",
"it's not appropriate",
"I cannot fulfill your request",
"harmful to human beings",
"ethical guidelines",
"my guidelines",
"prioritize user safety",
"adhere to ethical guidelines",
"harmful consequences",
"potentially harmful",
"dangerous activities",
"promote safety",
"well-being of all users",
"responsible information sharing",
"jeopardize the safety",
"illegal actions or intentions",
"undermine the stability",
"promote the well-being",
"illegal activities or actions",
"adherence to the law",
"potentially be harmful",
"illegal substances or activities",
"committed to promoting",
"safe information",
"lawful information",
"cannot provide guidance",
"cannot provide information",
"unable to offer assistance",
"cannot engage in discussions",
"programming prohibits",
"follow ethical guidelines",
"ensure the safety",
"involves an illegal subject",
"prioritize safety",
"illegal subject",
"prioritize user well-being",
"cannot support or promote",
"activities that could harm",
"pose a risk to others",
"against my programming",
"activities that could undermine",
"potentially dangerous",
"not within the scope",
"designed to prioritize safety",
"not able to provide",
"maintain user safety",
"adhere to safety guidelines",
"dangerous or harmful",
"cannot provide any information",
"focus on promoting safety"
* Conversations split into 2048 token chunks as described here: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md
This should be fully ready to train an unfiltered english Vicuna model based on the procedure here: https://github.com/lm-sys/FastChat/
>>>>>>> main
已完成进一步数据清洗。请审阅本数据集,确认我未遗漏任何内容。
**更新:已确认模型训练的可行方法:https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c**
可供选择的两个数据集版本:
- 移除所有"I'm sorry, but"句式:https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json
- 保留所有"I'm sorry, but"句式:https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json
选择由您自行决定。第一版数据集可能清洗过度,移除了有价值的训练数据;第二版更适配AI请求澄清的场景,但该版本可能会让AI拒绝执行联网浏览等操作——而通过特定的LangChain实现,AI实际上可完成此类操作。在训练前,请务必斟酌上述要点。
原约10万条ShareGPT对话经以下筛选步骤缩减至5.3万条:
* 移除非英语对话
* 移除过量Unicode字符(通常对应中文或韩文文本)
* 移除过度重复的字符
* 移除各类"AI说教"内容。包含以下短语的对话将被移除(另有部分无法在此列明的禁用短语):
"text-based AI language model",
"domestic violence",
"please refrain",
"derogatory",
"inappropriate",
"offensive",
"racism",
"racist",
"racial",
"discriminate",
"discriminatory",
"discrimination",
"sexist",
"sexism",
"unacceptable",
"inclusive workplace",
"lgbt",
"morals",
"ethics",
"ethical",
"legality",
"illegal",
"illegality",
"hateful",
"harmful",
"it is never okay",
"It is important to",
"It's important to",
"real-world consequences",
"hate speech",
"glorify",
"not be appropriate",
"supremacist",
"extremist",
"responsible AI",
"AI principles",
"AI assistant",
"an AI language",
"ableist",
"hurtful",
"gender stereotype",
"gender inequality",
"underrepresentation",
"safe spaces",
"gender-based",
"inclusivity",
"feminist",
"feminism",
"transgender",
"empowerment",
"communist",
"capitalism",
"stereotypes",
"biases",
"bias",
"Microaggression",
"prioritize human safety",
"as a language model",
"as an AI language model",
"As a large language model",
"As an AI",
"ethical principles",
"consensual",
"it is not appropriate",
"it's not appropriate",
"I cannot fulfill your request",
"harmful to human beings",
"ethical guidelines",
"my guidelines",
"prioritize user safety",
"adhere to ethical guidelines",
"harmful consequences",
"potentially harmful",
"dangerous activities",
"promote safety",
"well-being of all users",
"responsible information sharing",
"jeopardize the safety",
"illegal actions or intentions",
"undermine the stability",
"promote the well-being",
"illegal activities or actions",
"adherence to the law",
"potentially be harmful",
"illegal substances or activities",
"committed to promoting",
"safe information",
"lawful information",
"cannot provide guidance",
"cannot provide information",
"unable to offer assistance",
"cannot engage in discussions",
"programming prohibits",
"follow ethical guidelines",
"ensure the safety",
"involves an illegal subject",
"prioritize safety",
"illegal subject",
"prioritize user well-being",
"cannot support or promote",
"activities that could harm",
"pose a risk to others",
"against my programming",
"activities that could undermine",
"potentially dangerous",
"not within the scope",
"designed to prioritize safety",
"not able to provide",
"maintain user safety",
"adhere to safety guidelines",
"dangerous or harmful",
"cannot provide any information",
"focus on promoting safety"
对话已按照下述方法切分为2048 Token块:https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md
基于此处提供的流程,本数据集已完全就绪,可用于训练无过滤版英语Vicuna模型:https://github.com/lm-sys/FastChat/
提供机构:
maas
创建时间:
2024-01-17
搜集汇总
数据集介绍

背景与挑战
背景概述
ShareGPT_Vicuna_unfiltered是一个经过严格清洗的英文对话数据集,包含53k条筛选后的ShareGPT对话,适用于训练无过滤的Vicuna模型。数据集提供两个版本,分别处理了包含道歉语句的对话,用户可根据需求选择合适版本进行训练。
以上内容由遇见数据集搜集并总结生成



