sharegpt-hyperfiltered-3k

Name: sharegpt-hyperfiltered-3k
Creator: maas
Published: 2025-12-29 17:07:44
License: 暂无描述

魔搭社区2025-12-29 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/sharegpt-hyperfiltered-3k

下载链接

链接失效反馈

官方服务：

资源简介：

# sharegpt-hyperfiltered-3k 90k sharegpt convos brought down to ~3k (3243) via language filtering, keyword detection, deduping, and regex. Following things were done: - Deduplication on first message from human - Remove non-English convos - Remove censorship, refusals, and alignment - Remove incorrect/low-quality answers - Remove creative tasks - ChatGPT's creative outputs are very censored and robotic; I think the base model can do better. - Remove URLs - Remove cutoffs - Remove math/reasoning questions - It sucks without CoT prompting, so this data should be mixed with better reasoning examples like OpenOrca or Dolphin. ## 示例代码 ```python from modelscope import MsDataset from modelscope.utils.constant import DownloadMode ds = MsDataset.load('AI-ModelScope/sharegpt-hyperfiltered-3k', split='train', download_mode=DownloadMode.FORCE_REDOWNLOAD) print(next(iter(ds))) ```

# sharegpt-hyperfiltered-3k 本数据集基于90k条ShareGPT对话构建，通过语言过滤、关键词检测、去重及正则表达式（regex）进行精简筛选，最终保留约3000条对话（实际为3243条）。已执行以下预处理步骤： - 对人类用户的首条对话消息执行去重操作 - 移除非英语语言的对话 - 移除包含内容审查、拒答指令及AI对齐（alignment）的对话样本 - 移除错误或低质量的模型回复 - 移除创意类任务对话 ChatGPT的创意类输出通常受严格内容审查，且风格生硬刻板，笔者认为基础大语言模型能够生成更优质的创意内容。 - 移除统一资源定位符（URL） - 移除被截断的对话内容 - 移除数学及推理类问题由于未使用思维链（Chain of Thought，简称CoT）提示时，此类数据的效果欠佳，因此建议将本数据集与OpenOrca、Dolphin等更优质的推理类示例数据集混合使用。 ## 示例代码 python from modelscope import MsDataset from modelscope.utils.constant import DownloadMode ds = MsDataset.load('AI-ModelScope/sharegpt-hyperfiltered-3k', split='train', download_mode=DownloadMode.FORCE_REDOWNLOAD) print(next(iter(ds)))

提供机构：

maas

创建时间：

2023-12-04

搜集汇总

数据集介绍