viktoroo/combined-chat-datasets

Name: viktoroo/combined-chat-datasets
Creator: viktoroo
Published: 2026-04-09 14:34:29
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/viktoroo/combined-chat-datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - zh - es - fr - de - ru - pt - multilingual license: other license_name: mixed-see-per-config license_link: LICENSE.md task_categories: - text-generation - text-classification - text-ranking size_categories: - 10M<n<100M pretty_name: Combined Chat Datasets tags: - conversations - chat - preferences - rlhf - alignment - arena - wildchat - sharegpt - standardized configs: - config_name: lmsys_chat_1m data_files: - split: train path: data/lmsys_chat_1m/train-* - config_name: sharegpt52k data_files: - split: train path: data/sharegpt52k/train-* - config_name: sharechat data_files: - split: train path: data/sharechat/train-* - config_name: collective_cognition data_files: - split: train path: data/collective_cognition/train-* - config_name: sharelm data_files: - split: train path: data/sharelm/train-* - config_name: studychat data_files: - split: train path: data/studychat/train-* - config_name: wildchat data_files: - split: train path: data/wildchat/train-* - config_name: wildchat_1m data_files: - split: train path: data/wildchat_1m/train-* - config_name: chatbot_arena_33k data_files: - split: train path: data/chatbot_arena_33k/train-* - config_name: arena_pref_55k data_files: - split: train path: data/arena_pref_55k/train-* - config_name: arena_pref_100k data_files: - split: train path: data/arena_pref_100k/train-* - config_name: arena_pref_140k data_files: - split: train path: data/arena_pref_140k/train-* - config_name: search_arena_24k data_files: - split: train path: data/search_arena_24k/train-* - config_name: oasst1 data_files: - split: train path: data/oasst1/train-* - config_name: oasst2 data_files: - split: train path: data/oasst2/train-* - config_name: hh_rlhf data_files: - split: train path: data/hh_rlhf/train-* - config_name: prism_alignment data_files: - split: train path: data/prism_alignment/train-* - config_name: dices data_files: - split: train path: data/dices/train-* - config_name: ultrafeedback data_files: - split: train path: data/ultrafeedback/train-* - config_name: nectar data_files: - split: train path: data/nectar/train-* - config_name: helpsteer data_files: - split: train path: data/helpsteer/train-* - config_name: helpsteer2 data_files: - split: train path: data/helpsteer2/train-* - config_name: helpsteer3 data_files: - split: train path: data/helpsteer3/train-* - config_name: shp data_files: - split: train path: data/shp/train-* - config_name: shp2 data_files: - split: train path: data/shp2/train-* - config_name: dolly_15k data_files: - split: train path: data/dolly_15k/train-* - config_name: no_robots data_files: - split: train path: data/no_robots/train-* - config_name: aya_dataset data_files: - split: train path: data/aya_dataset/train-* - config_name: hc3 data_files: - split: train path: data/hc3/train-* - config_name: arena_hard_auto data_files: - split: train path: data/arena_hard_auto/train-* --- # Combined Chat Datasets A standardized, unified collection of **30 conversational AI datasets** -- spanning organic in-the-wild chats, voluntary sharing, side-by-side preferences, conversation trees, RLHF pairs, and crowdsourced instruction tuning data -- normalized to a single schema for easy joint use. > **This dataset is a re-distribution. It does not relicense the underlying data.** > See the [Legal & Licensing](#legal--licensing) section -- you must comply with each source dataset's original license. ## Quick start ```python from datasets import load_dataset # Load a single source dataset ds = load_dataset("viktoroo/combined-chat-datasets", "lmsys_chat_1m") print(ds["train"][0]) # Iterate over the messages for msg in ds["train"][0]["messages"]: print(f"{msg['role']}: {msg['content']}") ``` To list all available configs: ```python from datasets import get_dataset_config_names print(get_dataset_config_names("viktoroo/combined-chat-datasets")) ``` Each config corresponds to one source dataset. The schema is shared across all configs, so you can concatenate freely: ```python from datasets import concatenate_datasets, load_dataset a = load_dataset("viktoroo/combined-chat-datasets", "wildchat", split="train") b = load_dataset("viktoroo/combined-chat-datasets", "lmsys_chat_1m", split="train") combined = concatenate_datasets([a, b]) ``` ## Why combine these? Conversational data for LLM training and evaluation is scattered across dozens of repositories with **6 different schema patterns**, inconsistent role names, varied preference encodings, and different timestamp formats. This dataset provides: 1. **One unified schema** -- every row has `messages: list[{role, content}]`, no matter the source. 2. **One loading API** -- `load_dataset(..., "config_name")` for all 30 datasets. 3. **Per-source configs** -- load just what you need; no need to download 30+ GB to access one dataset. 4. **Provenance preserved** -- `source_dataset` column always identifies the origin so you can filter, weight, or trace back. ## Datasets included | Config | Source | Type | Rows | License | |--------|--------|------|------|---------| | `lmsys_chat_1m` | [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) | Full conversations | 1,000,000 | Custom (gated, research) | | `sharegpt52k` | [RyokoAI/ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K) | Full conversations | ~90,000 | Varies | | `sharechat` | [tucnguyen/ShareChat](https://huggingface.co/datasets/tucnguyen/ShareChat) | Full conversations | 660,293 | Custom (gated, research) | | `collective_cognition` | [CollectiveCognition/chats-data-2023-09-27](https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-09-27) | Full conversations | 200 | MIT | | `sharelm` | [shachardon/ShareLM](https://huggingface.co/datasets/shachardon/ShareLM) | Full conversations | 3,551,155 | See source | | `studychat` | [wmcnicho/StudyChat](https://huggingface.co/datasets/wmcnicho/StudyChat) | Full conversations | 16,851 | CC-BY-4.0 (gated) | | `wildchat` | [allenai/WildChat](https://huggingface.co/datasets/allenai/WildChat) | Full conversations | 529,428 | ODC-BY | | `wildchat_1m` | [allenai/WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M) | Full conversations | 837,989 | ODC-BY | | `chatbot_arena_33k` | [lmsys/chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) | Side-by-side preference | 33,000 | CC-BY-4.0 / CC-BY-NC-4.0 (gated) | | `arena_pref_55k` | [lmarena-ai/arena-human-preference-55k](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-55k) | Side-by-side preference | 57,477 | Apache-2.0 | | `arena_pref_100k` | [lmarena-ai/arena-human-preference-100k](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k) | Side-by-side preference | 106,134 | Custom (gated) | | `arena_pref_140k` | [lmarena-ai/arena-human-preference-140k](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-140k) | Side-by-side preference | 135,634 | CC-BY-4.0 (gated) | | `search_arena_24k` | [lmarena-ai/search-arena-24k](https://huggingface.co/datasets/lmarena-ai/search-arena-24k) | Side-by-side preference | 24,069 | CC-BY-4.0 | | `oasst1` | [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) | Conversation trees | 88,838 | Apache-2.0 | | `oasst2` | [OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) | Conversation trees | 135,174 | Apache-2.0 | | `hh_rlhf` | [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) | Chosen/rejected pairs | 169,352 | MIT | | `prism_alignment` | [HannahRoseKirk/prism-alignment](https://huggingface.co/datasets/HannahRoseKirk/prism-alignment) | Chosen/rejected pairs | 77,882 | CC-BY-NC-4.0 | | `dices` | [google-research-datasets/dices-dataset](https://github.com/google-research-datasets/dices-dataset) | Chosen/rejected pairs | 115,153 | CC-BY-4.0 | | `ultrafeedback` | [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) | Ranked multi-response | 63,967 | MIT | | `nectar` | [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) | Ranked multi-response | 182,954 | See source | | `helpsteer` | [nvidia/HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer) | Ranked multi-response | 37,120 | CC-BY-4.0 | | `helpsteer2` | [nvidia/HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2) | Ranked multi-response | 21,362 | CC-BY-4.0 | | `helpsteer3` | [nvidia/HelpSteer3](https://huggingface.co/datasets/nvidia/HelpSteer3) | Ranked multi-response | 132,937 | CC-BY-4.0 | | `shp` | [stanfordnlp/SHP](https://huggingface.co/datasets/stanfordnlp/SHP) | Ranked multi-response | 385,563 | See source | | `shp2` | [stanfordnlp/SHP-2](https://huggingface.co/datasets/stanfordnlp/SHP-2) | Ranked multi-response | 4,067,043 | See source | | `dolly_15k` | [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) | Single-turn | 15,011 | CC-BY-SA-3.0 | | `no_robots` | [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) | Single-turn | 10,000 | CC-BY-NC-4.0 | | `aya_dataset` | [CohereLabs/aya_dataset](https://huggingface.co/datasets/CohereLabs/aya_dataset) | Single-turn | 205,568 | Apache-2.0 | | `hc3` | [Hello-SimpleAI/HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | Single-turn | 24,322 | See source | | `arena_hard_auto` | [lmarena-ai/arena-hard-auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto) | Prompts only | 1,250 | See source | **Total: 30 configs, ~12.7M rows.** ## Schema Every row in every config conforms to this unified schema: | Field | Type | Description | |-------|------|-------------| | `conversation_id` | `string` | Globally unique ID (UUID) | | `source_dataset` | `string` | One of the 30 config names above | | `messages` | `list[{role: string, content: string}]` | Conversation turns. `role` is `"user"`, `"assistant"`, or `"system"` | | `model` | `string?` | Primary model name (null if human-only or mixed) | | `language` | `string?` | ISO 639-1 code (e.g. `"en"`) | | `num_turns` | `int32` | Number of messages in `messages` | | `created_at` | `string?` | ISO-8601 timestamp, UTC | | `messages_b` | `list[{role, content}]?` | Second conversation in a pair (preference datasets only) | | `model_b` | `string?` | Second model name (preference datasets only) | | `winner` | `string?` | `"a"`, `"b"`, `"tie"`, or null | | `judge_type` | `string?` | `"human"`, `"llm"`, `"upvotes"`, or null | | `score_helpfulness` | `float32?` | Normalized to [0, 1]; null if not annotated | | `score_correctness` | `float32?` | Normalized to [0, 1]; null if not annotated | | `score_coherence` | `float32?` | Normalized to [0, 1]; null if not annotated | | `score_safety` | `float32?` | Normalized to [0, 1]; null if not annotated | | `score_overall` | `float32?` | Normalized to [0, 1]; null if not annotated | ### Role normalization The following source role names are mapped to the standardized values: | Source convention | Standardized | |-------------------|--------------| | `user`, `human`, `prompter`, `Human` | `user` | | `assistant`, `gpt`, `chatbot`, `Assistant` | `assistant` | | `system` | `system` | ### Preference datasets For datasets that compare two model outputs (Arena, HH-RLHF, SHP, PRISM, etc.), each row stores **both** conversations: `messages` is conversation A, `messages_b` is conversation B. The `winner` field indicates which one won. This avoids requiring a join while keeping a single-table schema. ### Quality scores Datasets with per-response attribute ratings (HelpSteer, UltraFeedback, etc.) populate the relevant `score_*` fields. All scores are **normalized to [0, 1]** regardless of the source scale (Likert 0–4, Likert 1–5, Reddit upvotes, GPT-4 ratings, etc.). Use the `source_dataset` field if you need to recover the original scale. ## Example usages ### Filter by language ```python ds = load_dataset("viktoroo/combined-chat-datasets", "wildchat_1m", split="train") english = ds.filter(lambda x: x["language"] == "en") ``` ### Build a multi-source training set ```python from datasets import concatenate_datasets, load_dataset sources = ["wildchat_1m", "lmsys_chat_1m", "sharelm", "oasst2"] train = concatenate_datasets( [load_dataset("viktoroo/combined-chat-datasets", s, split="train") for s in sources] ) print(f"Total rows: {len(train):,}") ``` ### Extract preference pairs ```python ds = load_dataset("viktoroo/combined-chat-datasets", "arena_pref_140k", split="train") pairs = ds.filter(lambda x: x["winner"] in ("a", "b")) print(f"{len(pairs)} non-tie preference pairs") ``` ### Convert to the OpenAI chat format ```python def to_openai_format(row): return {"messages": row["messages"]} ds = load_dataset("viktoroo/combined-chat-datasets", "no_robots", split="train") openai_format = ds.map(to_openai_format) ``` ## How it was built 1. **Catalog** -- 30 datasets identified with rough sizes, schemas, licenses ([code repository](https://github.com/viktor-shcherb/combined-chat-datasets)). 2. **Download** -- raw files fetched from HuggingFace Hub (using `snapshot_download`) or, for DICES, directly from GitHub. 3. **Convert** -- per-dataset converter scripts (under `converters/` in the code repo) read each raw format and emit Parquet matching the unified schema. Includes role mapping, language code normalization, timestamp conversion to ISO-8601, score normalization, and tree linearization (OASST). 4. **Publish** -- one folder per source dataset under `data/`, with each folder a HuggingFace config. The full code, including the download script, converters, and upload tooling, is available at: **https://github.com/viktor-shcherb/combined-chat-datasets** ## Known limitations - **Multimodal data** in `arena_pref_140k` is flattened to text; image references are preserved as placeholder strings in `content` but original image bytes are NOT included. - **OASST1/2 trees** are linearized into one row per root-to-leaf path. Branch ranks and tree structure are encoded in metadata, but if you need the full tree, use the original `OpenAssistant/oasst1` and `OpenAssistant/oasst2` repos. - **HH-RLHF** human/assistant transcripts are parsed from delimited strings; turn boundaries are inferred and may occasionally split incorrectly on edge cases. - **DICES** safety annotations include rich rater demographics (gender, race, age) that go beyond the unified schema; only the conversation + binary preference is preserved here. Refer to the original GitHub repo for full annotations. - The `score_*` fields are normalized to [0, 1] which loses the original scale granularity. The `source_dataset` field tells you the original convention. ## Legal & Licensing **Critical:** This dataset is a redistribution of 30 independently-licensed datasets. **Each dataset retains its original copyright and license**, and downloading users are individually responsible for complying with each one. ### What this means for you - Some datasets are **research-only** (CC-BY-NC, custom licenses). You may not use them for commercial purposes. - Some datasets are **gated** at the source -- the original publishers required users to accept terms before access. Even though this redistribution may not enforce that gating, **you remain bound by those original terms**. - Some datasets require **attribution** when you publish work derived from them. - The `source_dataset` column on every row identifies the origin -- use it to look up the applicable license. ### Per-config license summary See the table in [Datasets included](#datasets-included) above. Click each source link to read the full license on the original repository. ### Citation If you use this collection, please cite the **individual source datasets**, not this aggregator. Each source's HuggingFace page lists its preferred citation. We do not request a citation for the aggregation itself. ### No warranty This redistribution is provided "as is", without warranty of any kind. The maintainers of this aggregation make no representations about the accuracy, completeness, or fitness for any particular purpose of the underlying datasets. ### Removal requests If you are an author of one of the source datasets and want this aggregation modified or removed, please open an issue at https://github.com/viktor-shcherb/combined-chat-datasets.

提供机构：

viktoroo

5,000+

优质数据集

54 个

任务类型

进入经典数据集