BramVanroy/WildChat-1M-filtered-gpt-4
收藏Hugging Face2024-05-04 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/BramVanroy/WildChat-1M-filtered-gpt-4
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train_sft
num_bytes: 1207333657.4581366
num_examples: 124691
- name: test_sft
num_bytes: 134152487.54186335
num_examples: 13855
download_size: 702428181
dataset_size: 1341486145.0
configs:
- config_name: default
data_files:
- split: train_sft
path: data/train_sft-*
- split: test_sft
path: data/test_sft-*
---
Based on [WildChat](https://huggingface.co/datasets/allenai/WildChat-1M).
Only kept gpt-4 completions with no toxicity and removed conversations that had empty content. Only English, Spanish, French, German and Italian samples are kept.
```python
from datasets import load_dataset, DatasetDict
dataset = load_dataset("allenai/WildChat-1M", num_proc=96, split="train")
unique_convos = set(dataset.unique("conversation_hash"))
LANGUAGES = ["english", "spanish", "french", "german", "italian"]
KEEP_MSG_KEYS = ["content", "role"]
avoid_words = ["gpt", "openai"]
def filter(sample):
if sample["toxic"]:
return False
if "gpt-4" not in sample["model"]:
return False
if sample["language"].lower() not in LANGUAGES:
return False
if any(not m["content"].strip() for m in sample["conversation"]):
return False
content = " ".join([m["content"].strip() for m in sample["conversation"]]).lower()
if any(w in content for w in avoid_words):
return False
if sample["conversation_hash"] in unique_convos:
unique_convos.remove(sample["conversation_hash"])
else:
return False
return True
dataset = dataset.filter(filter, num_proc=96)
dataset = dataset.select_columns("conversation").rename_column("conversation", "messages")
dataset = dataset.map(lambda sample: {"messages": [{k: v for k, v in message.items() if k in KEEP_MSG_KEYS} for message in sample["messages"]]}, num_proc=96)
dataset = dataset.train_test_split(test_size=0.1)
dataset = DatasetDict({
"train_sft": dataset["train"],
"test_sft": dataset["test"]
})
print(dataset)
dataset.push_to_hub("BramVanroy/WildChat-1M-filtered-gpt-4")
```
提供机构:
BramVanroy
原始信息汇总
数据集概述
数据集特征
- messages
- content: 数据类型为字符串
- role: 数据类型为字符串
数据集分割
- train_sft
- 数据大小: 1207333657.4581366 字节
- 示例数量: 124691
- test_sft
- 数据大小: 134152487.54186335 字节
- 示例数量: 13855
数据集大小
- 下载大小: 702428181 字节
- 数据集总大小: 1341486145.0 字节
配置
- config_name: default
- train_sft: 路径为 data/train_sft-*
- test_sft: 路径为 data/test_sft-*



