betteruncensored/sharegpt

Name: betteruncensored/sharegpt
Creator: betteruncensored
Published: 2024-02-12 22:23:06
License: 暂无描述

Hugging Face2024-02-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/betteruncensored/sharegpt

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从`philschmid/sharegpt-raw`复制而来，并经过`Better Uncensored (BUn)`管道处理，生成了两个版本的数据集：`sharegpt_20230401_clean_bun.json`（包含57058个对话）和`sharegpt_20230401_clean_split_bun.json`（包含103152个对话）。处理流程包括合并原始JSON文件、美化JSON格式、清理数据（如移除HTML标签）、去审查处理以及分割长对话。需要注意的是，BUn管道移除了大部分非ASCII语言的对话，因此该数据集不适用于主要使用非ASCII语言（如中文、俄语等）的场景。

提供机构：

betteruncensored

原始信息汇总

数据集概述

该数据集是从 philschmid/sharegpt-raw 复制而来，经过 Better Uncensored (BUn) pipeline 处理后，得到两个版本的清理和去审查数据集：

sharegpt_20230401_clean_bun.json：包含 57058 个对话。
sharegpt_20230401_clean_split_bun.json：包含 103152 个对话，其中长对话已被拆分。

数据集处理步骤

合并和格式化：
- 合并两个原始 JSON 文件并美化合并后的文件。 bash python merge.py sharegpt_90k_raw_dataset/sg_90k_part1.json sharegpt_90k_raw_dataset/sg_90k_part2.json sharegpt_20230401_html_unformatted.json python pretty_json.py --in sharegpt_20230401_html_unformatted.json --out sharegpt_20230401_html.json
验证 JSON 文件（可选）： bash if jq empty sharegpt_20230401_html.json 2>/dev/null; then echo "JSON is valid" else echo "JSON is invalid" fi
清理数据：
- 移除 HTML 标签等。 bash python3 clean_sharegpt.py --in sharegpt_20230401_html.json --out sharegpt_20230401_clean.json
去审查处理： bash python uncensor_sharegpt.py --in-file sharegpt_20230401_clean.json --out-file sharegpt_20230401_clean_bun.json
拆分长对话： bash python -m fastchat.data.split_long_conversation --in sharegpt_20230401_clean_bun.json --out sharegpt_20230401_clean_split_bun.json --model-name meta-llama/Llama-2-13b-hf

5,000+

优质数据集

54 个

任务类型

进入经典数据集