BramVanroy/WildChat-1M-filtered-gpt-4

Name: BramVanroy/WildChat-1M-filtered-gpt-4
Creator: BramVanroy
Published: 2024-05-04 02:31:45
License: 暂无描述

Hugging Face2024-05-04 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/BramVanroy/WildChat-1M-filtered-gpt-4

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train_sft num_bytes: 1207333657.4581366 num_examples: 124691 - name: test_sft num_bytes: 134152487.54186335 num_examples: 13855 download_size: 702428181 dataset_size: 1341486145.0 configs: - config_name: default data_files: - split: train_sft path: data/train_sft-* - split: test_sft path: data/test_sft-* --- Based on [WildChat](https://huggingface.co/datasets/allenai/WildChat-1M). Only kept gpt-4 completions with no toxicity and removed conversations that had empty content. Only English, Spanish, French, German and Italian samples are kept. ```python from datasets import load_dataset, DatasetDict dataset = load_dataset("allenai/WildChat-1M", num_proc=96, split="train") unique_convos = set(dataset.unique("conversation_hash")) LANGUAGES = ["english", "spanish", "french", "german", "italian"] KEEP_MSG_KEYS = ["content", "role"] avoid_words = ["gpt", "openai"] def filter(sample): if sample["toxic"]: return False if "gpt-4" not in sample["model"]: return False if sample["language"].lower() not in LANGUAGES: return False if any(not m["content"].strip() for m in sample["conversation"]): return False content = " ".join([m["content"].strip() for m in sample["conversation"]]).lower() if any(w in content for w in avoid_words): return False if sample["conversation_hash"] in unique_convos: unique_convos.remove(sample["conversation_hash"]) else: return False return True dataset = dataset.filter(filter, num_proc=96) dataset = dataset.select_columns("conversation").rename_column("conversation", "messages") dataset = dataset.map(lambda sample: {"messages": [{k: v for k, v in message.items() if k in KEEP_MSG_KEYS} for message in sample["messages"]]}, num_proc=96) dataset = dataset.train_test_split(test_size=0.1) dataset = DatasetDict({ "train_sft": dataset["train"], "test_sft": dataset["test"] }) print(dataset) dataset.push_to_hub("BramVanroy/WildChat-1M-filtered-gpt-4") ```

提供机构：

BramVanroy

原始信息汇总

数据集概述

数据集特征

messages
- content: 数据类型为字符串
- role: 数据类型为字符串

数据集分割

train_sft
- 数据大小: 1207333657.4581366 字节
- 示例数量: 124691
test_sft
- 数据大小: 134152487.54186335 字节
- 示例数量: 13855

数据集大小

下载大小: 702428181 字节
数据集总大小: 1341486145.0 字节

配置

config_name: default
- train_sft: 路径为 data/train_sft-*
- test_sft: 路径为 data/test_sft-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集