CaterinaLac/sharegpt-deduplicated

Name: CaterinaLac/sharegpt-deduplicated
Creator: CaterinaLac
Published: 2023-10-04 14:40:39
License: 暂无描述

Hugging Face2023-10-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/CaterinaLac/sharegpt-deduplicated

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - conversational language: - en - zh - ko - fr - ja - es - 'no' - et - de - ca - vi - fi size_categories: - 1K<n<10K --- # Dataset Card for Dataset Name ## Dataset Description ### Dataset Summary This dataset is a deduplicated version of [sharegpt4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4). The deduplication process has two steps: 1. The literal duplicates (both input and outputs) are removed 2. The remaining (5749) instances are embedded with the [SentenceTransformer library](https://www.sbert.net/) ("paraphrase-multilingual-mpnet-base-v2" model). Then, we compute the cosine similarity among all the possible pairs, and consider paraphrases those pairs with a similarity > 0.95. For each paraphrase group, we only retain one element. The resulting dataset has 5139 elements. ### Languages The dataset includes several languages, but the vast majority of it is in English. Roughly 600 instances are in more than one language, as detected by [langdetect](https://pypi.org/project/langdetect/). The languages that appear across the dataset, together with the number of instances they appear in, follow: <details> <summary>Language Distribution</summary> en 4053 zh-cn 423 ko 333 fr 168 ja 151 es 142 no 110 et 97 de 81 ca 78 vi 63 fi 52 zh-tw 47 pt 42 tl 39 ru 24 he 24 id 23 it 22 sv 21 pl 16 nl 16 th 15 ro 11 da 9 tr 8 cs 8 hr 6 uk 5 af 5 ar 4 bg 3 cy 2 sk 2 hu 2 so 2 bn 1 sl 1 hi 1 sw 1 lv 1 el 1 </details> ### Data Fields Each instance has two fields: - 'input': one turn of a human-bot conversation, initiated by a human. It starts with 'Human: ', and it ends with 'Assistant: ' - 'output': the bot reply

提供机构：

CaterinaLac

原始信息汇总

数据集卡片

数据集描述

数据集概述

该数据集是sharegpt4的去重版本。去重过程包括两个步骤：

移除字面重复的实例（包括输入和输出）。
剩余的5749个实例通过SentenceTransformer库（使用"paraphrase-multilingual-mpnet-base-v2"模型）进行嵌入。然后计算所有可能对的余弦相似度，相似度大于0.95的被视为同义句，每个同义句组只保留一个元素。最终数据集包含5139个元素。

语言

数据集包含多种语言，但大部分是英语。大约600个实例包含多种语言，这些语言及其出现次数如下： <details> <summary>语言分布</summary> en 4053 zh-cn 423 ko 333 fr 168 ja 151 es 142 no 110 et 97 de 81 ca 78 vi 63 fi 52 zh-tw 47 pt 42 tl 39 ru 24 he 24 id 23 it 22 sv 21 pl 16 nl 16 th 15 ro 11 da 9 tr 8 cs 8 hr 6 uk 5 af 5 ar 4 bg 3 cy 2 sk 2 hu 2 so 2 bn 1 sl 1 hi 1 sw 1 lv 1 el 1 </details>

数据字段

每个实例包含两个字段：

input: 人类与机器人的对话轮次，由人类发起。以Human: 开始，以Assistant: 结束。
output: 机器人的回复

5,000+

优质数据集

54 个

任务类型

进入经典数据集