OldKingMeister/lmsys-arena-processed-data

Name: OldKingMeister/lmsys-arena-processed-data
Creator: OldKingMeister
Published: 2026-03-07 11:20:55
License: 暂无描述

Hugging Face2026-03-07 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/OldKingMeister/lmsys-arena-processed-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - lmsys - chatbot-arena - preference-modeling - reward-modeling - kaggle - conversational-ai license: other task_categories: - text-classification size_categories: - 100K<n<1M --- # LMSYS Chatbot Arena - Processed Data This dataset contains processed data from the [LMSYS Chatbot Arena Competition](https://www.kaggle.com/competitions/lmsys-chatbot-arena) on Kaggle. ## Dataset Description The task is **preference modeling** (also known as reward modeling): given a prompt and two responses (Response A and Response B), predict which response humans prefer. ### Files | File | Size | Description | |------|------|-------------| | `train.csv` | 176 MB | Original training data with conversation pairs and winner labels | | `prompt_a_prompt_b.csv` | 366 MB | Data with pre-processed prompt_a and prompt_b columns | | `train_combined.csv` | 534 MB | Fully processed data with combined prompts and responses | | `corpus.json` | 112 MB | Text corpus for TF-IDF processing | ### Data Format All CSV files contain the following key columns: - `id`: Sample identifier - `model_a`, `model_b`: Names of the models being compared - `prompt`: JSON array of conversation turns - `response_a`, `response_b`: JSON array of responses from each model - `winner_model_a`, `winner_model_b`, `winner_tie`: Binary labels indicating human preference ## Usage ### Loading the Data ```python import pandas as pd # Load original training data train_df = pd.read_csv("train.csv") # Load processed data with combined prompts combined_df = pd.read_csv("train_combined.csv") ``` ### Data Processing Pipeline The data goes through several processing steps: 1. **Raw Data** (`train.csv`): Original conversations with multiple turns 2. **Prompt Split** (`prompt_a_prompt_b.csv`): Prompts formatted for comparison 3. **Combined** (`train_combined.csv`): Ready-to-use format for model training ### Example Data Structure ```python { "id": 30192, "model_a": "gpt-4-1106-preview", "model_b": "gpt-4-0613", "prompt": ["Is it morally right to try to have a certain percentage...", "OK, does pineapple belong on a pizza?"], "response_a": ["The question of whether it is morally right...", "Ah, the age-old culinary conundrum..."], "response_b": ["As an AI, I don't have personal beliefs...", "As an AI, I don't eat..."], "winner_model_a": 1, "winner_model_b": 0, "winner_tie": 0 } ``` ## Citation ```bibtex @misc{lmsys-arena-2024, title={LMSYS Chatbot Arena Competition}, howpublished={https://www.kaggle.com/competitions/lmsys-chatbot-arena}, year={2024} } ``` ## Related Models Trained models using this data are available on Hugging Face: - [gemma-2b-lmsys-arena-final](https://huggingface.co/OldKingMeister/gemma-2b-lmsys-arena-final) - [llama-3-8b-lmsys-arena-final](https://huggingface.co/OldKingMeister/llama-3-8b-lmsys-arena-final) - [llama-3-8b-instruct-lmsys-arena-final](https://huggingface.co/OldKingMeister/llama-3-8b-instruct-lmsys-arena-final)

提供机构：

OldKingMeister

5,000+

优质数据集

54 个

任务类型

进入经典数据集