sharegpt-hyperfiltered-3k
收藏魔搭社区2025-12-29 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/sharegpt-hyperfiltered-3k
下载链接
链接失效反馈官方服务:
资源简介:
# sharegpt-hyperfiltered-3k
90k sharegpt convos brought down to ~3k (3243) via language filtering, keyword detection, deduping, and regex. Following things were done:
- Deduplication on first message from human
- Remove non-English convos
- Remove censorship, refusals, and alignment
- Remove incorrect/low-quality answers
- Remove creative tasks
- ChatGPT's creative outputs are very censored and robotic; I think the base model can do better.
- Remove URLs
- Remove cutoffs
- Remove math/reasoning questions
- It sucks without CoT prompting, so this data should be mixed with better reasoning examples like OpenOrca or Dolphin.
## 示例代码
```python
from modelscope import MsDataset
from modelscope.utils.constant import DownloadMode
ds = MsDataset.load('AI-ModelScope/sharegpt-hyperfiltered-3k', split='train', download_mode=DownloadMode.FORCE_REDOWNLOAD)
print(next(iter(ds)))
```
# sharegpt-hyperfiltered-3k
本数据集基于90k条ShareGPT对话构建,通过语言过滤、关键词检测、去重及正则表达式(regex)进行精简筛选,最终保留约3000条对话(实际为3243条)。已执行以下预处理步骤:
- 对人类用户的首条对话消息执行去重操作
- 移除非英语语言的对话
- 移除包含内容审查、拒答指令及AI对齐(alignment)的对话样本
- 移除错误或低质量的模型回复
- 移除创意类任务对话
ChatGPT的创意类输出通常受严格内容审查,且风格生硬刻板,笔者认为基础大语言模型能够生成更优质的创意内容。
- 移除统一资源定位符(URL)
- 移除被截断的对话内容
- 移除数学及推理类问题
由于未使用思维链(Chain of Thought,简称CoT)提示时,此类数据的效果欠佳,因此建议将本数据集与OpenOrca、Dolphin等更优质的推理类示例数据集混合使用。
## 示例代码
python
from modelscope import MsDataset
from modelscope.utils.constant import DownloadMode
ds = MsDataset.load('AI-ModelScope/sharegpt-hyperfiltered-3k', split='train', download_mode=DownloadMode.FORCE_REDOWNLOAD)
print(next(iter(ds)))
提供机构:
maas
创建时间:
2023-12-04



