lmsys_chat_1m_clean_R1

Name: lmsys_chat_1m_clean_R1
Creator: maas
Published: 2026-01-02 16:29:11
License: 暂无描述

魔搭社区2026-01-02 更新2025-04-12 收录

下载链接：

https://modelscope.cn/datasets/oumi-ai/lmsys_chat_1m_clean_R1

下载链接

链接失效反馈

官方服务：

资源简介：

[![oumi logo](https://oumi.ai/logo_lockup_black.svg)](https://github.com/oumi-ai/oumi) [![Made with Oumi](https://badgen.net/badge/Made%20with/Oumi/%23085CFF?icon=https%3A%2F%2Foumi.ai%2Flogo_dark.svg)](https://github.com/oumi-ai/oumi) [![Documentation](https://img.shields.io/badge/Documentation-oumi-blue.svg)](https://oumi.ai/docs/en/latest/index.html) [![Blog](https://img.shields.io/badge/Blog-oumi-blue.svg)](https://oumi.ai/blog) [![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi) # oumi-ai/lmsys_chat_1m_clean_R1 **lmsys_chat_1m_clean_R1** is a text dataset designed to train Conversational Language Models with **DeepSeek-R1 level reasoning**. Prompts were pulled from [LMSYS](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) and filtered to [lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean), and responses were taken from **[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)** without additional filters present. We release **lmsys_chat_1m_clean_R1** to help enable the community to develop the best fully open reasoning model! [lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) queries with responses generated from [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) - **Curated by:** [Oumi AI](https://oumi.ai/) using Oumi inference on [Parasail](https://www.parasail.io/) - **Language(s) (NLP):** English - **License:** [Apache 2.0](https://opensource.org/license/apache-2-0) - **Demo:** [See the MiniMath notebook for a similar example](https://github.com/oumi-ai/oumi/blob/307436bd98706cb9ce7b0bbf31204770af2b7c8c/notebooks/Oumi%20-%20MiniMath-R1-1.5B.ipynb) ## Uses  Use this dataset for supervised fine-tuning of LLMs by including it into a training mixture for creating an R1-like model. ## Out-of-Scope Use  This dataset covers a broad coverage of use-cases documented in the [original dataset](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean), but is likely reflective of only one particular set of users (LMSYS Chatbot Arena submissions) ## Dataset Structure  ``` { # Unique conversation identifier, tied back to lmsys_chat_1m_clean samples "conversation_id": str, # The user turn/prompt "prompt": str, # The assistant (DeepSeek R1) response # Includes the thought trace which is wrapped in <think> and </think> tags "response": str, # Data formatted to user + assistant turns in chat format # Example: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}] "messages": list[dict[str, str]], # Metadata for sample "metadata": dict[str, ...], } ``` ## Dataset Creation ### Curation Rationale  To enable the community to develop a fully-open state-of-the-art Foundational Language Model, we've produced and released this dataset to serve as part of the foundation of reasoning data for the model. It was produced using the Oumi’s inference capabilities on Parasail. ### Source Data  Queries were sourced from [lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) which is data filtered from the original LMSYS Chat 1M dataset. #### Data Collection and Processing  * Responses were collected via Oumi's batch inference support for [Parasail](https://parasail.io/). * Samples which could not be parsed were discarded (<100). * All other samples include metadata indicating if they are complete or not (which was determined by whether or not a `</think>` token is present) #### Personal and Sensitive Information  Data is not known or likely to contain any personal, sensitive, or private information, but it is possible due to the nature of the data (submitted queries from LMSYS Chatbot Arena) ## Bias, Risks, and Limitations  1. The source prompts are from [lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) `conversations` column and may reflect any biases in the data filtration process. 2. Some prompts contained within may be adversarial or controversial in their queries or content. 3. The responses produced will likely be reflective of any biases or limitations produced by DeepSeek-R1. ## Citation  **BibTeX:** ``` @misc{lmsysChat1mCleanR12025, author = {Jeremiah Greer}, title = {lmsys_chat_1m_clean_R1 Dataset}, month = {February}, year = {2025}, url = {https://huggingface.co/datasets/oumi-ai/lmsys_chat_1m_clean_R1} } @software{oumi2025, author = {Oumi Community}, title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models}, month = {January}, year = {2025}, url = {https://github.com/oumi-ai/oumi} } ```

[![oumi logo](https://oumi.ai/logo_lockup_black.svg)](https://github.com/oumi-ai/oumi) [![Made with Oumi](https://badgen.net/badge/Made%20with/Oumi/%23085CFF?icon=https%3A%2F%2Foumi.ai%2Flogo_dark.svg)](https://github.com/oumi-ai/oumi) [![Documentation](https://img.shields.io/badge/Documentation-oumi-blue.svg)](https://oumi.ai/docs/en/latest/index.html) [![Blog](https://img.shields.io/badge/Blog-oumi-blue.svg)](https://oumi.ai/blog) [![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi) # oumi-ai/lmsys_chat_1m_clean_R1 **lmsys_chat_1m_clean_R1** 是一款专为训练具备**DeepSeek-R1级推理能力**的对话式语言模型设计的文本数据集。提示词源自[LMSYS](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)，经筛选得到[lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean)，回复则直接取自**[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)**，未施加额外筛选规则。我们发布**lmsys_chat_1m_clean_R1**，旨在助力社区研发最优的全开源推理模型！ [lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) 中的查询语句，其回复由[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)生成。 - **整理方**：[Oumi AI](https://oumi.ai/)，依托Parasail平台的Oumi推理能力完成整理 - **自然语言处理（NLP）所用语言**：英语 - **授权协议**：[Apache 2.0](https://opensource.org/license/apache-2.0) - **演示示例**：[可参考MiniMath笔记本以获取类似案例](https://github.com/oumi-ai/oumi/blob/307436bd98706cb9ce7b0bbf31204770af2b7c8c/notebooks/Oumi%20-%20MiniMath-R1-1.5B.ipynb) ## 适用场景  本数据集可被纳入训练混合集，用于大语言模型（Large Language Model）的监督微调，以研发具备类似R1推理能力的模型。 ## 不适用场景  本数据集覆盖了[原始数据集](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean)中记录的众多应用场景，但仅能反映一类特定用户群体的行为（即LMSYS聊天机器人竞技场的提交内容）。 ## 数据集结构  { # 唯一对话标识符，与lmsys_chat_1m_clean数据样本关联 "conversation_id": 字符串类型, # 用户轮次/提示词 "prompt": 字符串类型, # 助手（DeepSeek R1）的回复内容 # 包含包裹在<think>与</think>标签内的思维轨迹 "response": 字符串类型, # 按照对话格式整理的用户+助手轮次数据 # 示例：[{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}] "messages": 字典字符串列表类型, # 样本元数据 "metadata": 任意键值字典类型, } ## 数据集构建 ### 整理初衷  为助力社区研发全开源的顶尖基础大语言模型，我们制作并发布本数据集，以作为该模型推理训练数据的核心组成部分。本数据集依托Parasail平台的Oumi推理能力生成。 ### 源数据  查询语句源自[lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean)，该数据集是从原始LMSYS Chat 1M数据集经过筛选后得到的。 #### 数据收集与处理  * 回复内容通过Oumi针对[Parasail](https://parasail.io/)的批量推理功能收集得到。 * 无法解析的样本已被丢弃（数量不足100个）。 * 其余所有样本均包含元数据，用于标记样本是否完整（判断依据为是否存在`</think>`标记）。 #### 个人与敏感信息  目前已知本数据集不包含任何个人、敏感或私密信息，但鉴于数据来源为LMSYS聊天机器人竞技场的提交查询，仍存在潜在风险。 ## 偏差、风险与局限性  1. 源提示词取自[lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean)的`conversations`列，可能反映数据筛选过程中存在的各类偏差。 2. 部分提示词的查询或内容可能具有对抗性或争议性。 3. 生成的回复可能会反映DeepSeek-R1模型本身存在的偏差与局限性。 ## 引用说明  **BibTeX格式引用：** @misc{lmsysChat1mCleanR12025, author = {Jeremiah Greer}, title = {lmsys_chat_1m_clean_R1 Dataset}, month = {February}, year = {2025}, url = {https://huggingface.co/datasets/oumi-ai/lmsys_chat_1m_clean_R1} } @software{oumi2025, author = {Oumi Community}, title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models}, month = {January}, year = {2025}, url = {https://github.com/oumi-ai/oumi} }

提供机构：

maas

创建时间：

2025-04-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集