five

BramVanroy/WildChat-1M-filtered-gpt-4

收藏
Hugging Face2024-05-04 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/BramVanroy/WildChat-1M-filtered-gpt-4
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train_sft num_bytes: 1207333657.4581366 num_examples: 124691 - name: test_sft num_bytes: 134152487.54186335 num_examples: 13855 download_size: 702428181 dataset_size: 1341486145.0 configs: - config_name: default data_files: - split: train_sft path: data/train_sft-* - split: test_sft path: data/test_sft-* --- Based on [WildChat](https://huggingface.co/datasets/allenai/WildChat-1M). Only kept gpt-4 completions with no toxicity and removed conversations that had empty content. Only English, Spanish, French, German and Italian samples are kept. ```python from datasets import load_dataset, DatasetDict dataset = load_dataset("allenai/WildChat-1M", num_proc=96, split="train") unique_convos = set(dataset.unique("conversation_hash")) LANGUAGES = ["english", "spanish", "french", "german", "italian"] KEEP_MSG_KEYS = ["content", "role"] avoid_words = ["gpt", "openai"] def filter(sample): if sample["toxic"]: return False if "gpt-4" not in sample["model"]: return False if sample["language"].lower() not in LANGUAGES: return False if any(not m["content"].strip() for m in sample["conversation"]): return False content = " ".join([m["content"].strip() for m in sample["conversation"]]).lower() if any(w in content for w in avoid_words): return False if sample["conversation_hash"] in unique_convos: unique_convos.remove(sample["conversation_hash"]) else: return False return True dataset = dataset.filter(filter, num_proc=96) dataset = dataset.select_columns("conversation").rename_column("conversation", "messages") dataset = dataset.map(lambda sample: {"messages": [{k: v for k, v in message.items() if k in KEEP_MSG_KEYS} for message in sample["messages"]]}, num_proc=96) dataset = dataset.train_test_split(test_size=0.1) dataset = DatasetDict({ "train_sft": dataset["train"], "test_sft": dataset["test"] }) print(dataset) dataset.push_to_hub("BramVanroy/WildChat-1M-filtered-gpt-4") ```
提供机构:
BramVanroy
原始信息汇总

数据集概述

数据集特征

  • messages
    • content: 数据类型为字符串
    • role: 数据类型为字符串

数据集分割

  • train_sft
    • 数据大小: 1207333657.4581366 字节
    • 示例数量: 124691
  • test_sft
    • 数据大小: 134152487.54186335 字节
    • 示例数量: 13855

数据集大小

  • 下载大小: 702428181 字节
  • 数据集总大小: 1341486145.0 字节

配置

  • config_name: default
    • train_sft: 路径为 data/train_sft-*
    • test_sft: 路径为 data/test_sft-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作