lmsys_chat_1m_clean_R1
收藏魔搭社区2026-01-02 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/oumi-ai/lmsys_chat_1m_clean_R1
下载链接
链接失效反馈官方服务:
资源简介:
[](https://github.com/oumi-ai/oumi)
[](https://github.com/oumi-ai/oumi)
[](https://oumi.ai/docs/en/latest/index.html)
[](https://oumi.ai/blog)
[](https://discord.gg/oumi)
# oumi-ai/lmsys_chat_1m_clean_R1
**lmsys_chat_1m_clean_R1** is a text dataset designed to train Conversational Language Models with **DeepSeek-R1 level reasoning**.
Prompts were pulled from [LMSYS](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) and filtered to [lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean), and responses were taken from **[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)** without additional filters present.
We release **lmsys_chat_1m_clean_R1** to help enable the community to develop the best fully open reasoning model!
[lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) queries with responses generated from [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)
- **Curated by:** [Oumi AI](https://oumi.ai/) using Oumi inference on [Parasail](https://www.parasail.io/)
- **Language(s) (NLP):** English
- **License:** [Apache 2.0](https://opensource.org/license/apache-2-0)
- **Demo:** [See the MiniMath notebook for a similar example](https://github.com/oumi-ai/oumi/blob/307436bd98706cb9ce7b0bbf31204770af2b7c8c/notebooks/Oumi%20-%20MiniMath-R1-1.5B.ipynb)
## Uses
<!-- This section describes suitable use cases for the dataset. -->
Use this dataset for supervised fine-tuning of LLMs by including it into a training mixture for creating an R1-like model.
## Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
This dataset covers a broad coverage of use-cases documented in the [original dataset](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean), but is likely reflective of only one particular set of users (LMSYS Chatbot Arena submissions)
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
```
{
# Unique conversation identifier, tied back to lmsys_chat_1m_clean samples
"conversation_id": str,
# The user turn/prompt
"prompt": str,
# The assistant (DeepSeek R1) response
# Includes the thought trace which is wrapped in <think> and </think> tags
"response": str,
# Data formatted to user + assistant turns in chat format
# Example: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}]
"messages": list[dict[str, str]],
# Metadata for sample
"metadata": dict[str, ...],
}
```
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
To enable the community to develop a fully-open state-of-the-art Foundational Language Model, we've produced and released this dataset to serve as part of the foundation of reasoning data for the model. It was produced using the Oumi’s inference capabilities on Parasail.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
Queries were sourced from [lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) which is data filtered from the original LMSYS Chat 1M dataset.
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
* Responses were collected via Oumi's batch inference support for [Parasail](https://parasail.io/).
* Samples which could not be parsed were discarded (<100).
* All other samples include metadata indicating if they are complete or not (which was determined by whether or not a `</think>` token is present)
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
Data is not known or likely to contain any personal, sensitive, or private information, but it is possible due to the nature of the data (submitted queries from LMSYS Chatbot Arena)
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
1. The source prompts are from [lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) `conversations` column and may reflect any biases in the data filtration process.
2. Some prompts contained within may be adversarial or controversial in their queries or content.
3. The responses produced will likely be reflective of any biases or limitations produced by DeepSeek-R1.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@misc{lmsysChat1mCleanR12025,
author = {Jeremiah Greer},
title = {lmsys_chat_1m_clean_R1 Dataset},
month = {February},
year = {2025},
url = {https://huggingface.co/datasets/oumi-ai/lmsys_chat_1m_clean_R1}
}
@software{oumi2025,
author = {Oumi Community},
title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models},
month = {January},
year = {2025},
url = {https://github.com/oumi-ai/oumi}
}
```
[](https://github.com/oumi-ai/oumi)
[](https://github.com/oumi-ai/oumi)
[](https://oumi.ai/docs/en/latest/index.html)
[](https://oumi.ai/blog)
[](https://discord.gg/oumi)
# oumi-ai/lmsys_chat_1m_clean_R1
**lmsys_chat_1m_clean_R1** 是一款专为训练具备**DeepSeek-R1级推理能力**的对话式语言模型设计的文本数据集。
提示词源自[LMSYS](https://huggingface.co/datasets/lmsys/lmsys-chat-1m),经筛选得到[lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean),回复则直接取自**[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)**,未施加额外筛选规则。
我们发布**lmsys_chat_1m_clean_R1**,旨在助力社区研发最优的全开源推理模型!
[lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) 中的查询语句,其回复由[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)生成。
- **整理方**:[Oumi AI](https://oumi.ai/),依托Parasail平台的Oumi推理能力完成整理
- **自然语言处理(NLP)所用语言**:英语
- **授权协议**:[Apache 2.0](https://opensource.org/license/apache-2.0)
- **演示示例**:[可参考MiniMath笔记本以获取类似案例](https://github.com/oumi-ai/oumi/blob/307436bd98706cb9ce7b0bbf31204770af2b7c8c/notebooks/Oumi%20-%20MiniMath-R1-1.5B.ipynb)
## 适用场景
<!-- 本节描述该数据集的适用场景。 -->
本数据集可被纳入训练混合集,用于大语言模型(Large Language Model)的监督微调,以研发具备类似R1推理能力的模型。
## 不适用场景
<!-- 本节说明不当使用、恶意使用,以及该数据集无法良好适配的应用场景。 -->
本数据集覆盖了[原始数据集](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean)中记录的众多应用场景,但仅能反映一类特定用户群体的行为(即LMSYS聊天机器人竞技场的提交内容)。
## 数据集结构
<!-- 本节说明数据集字段,以及数据集划分标准、数据点间关联等额外结构信息。 -->
{
# 唯一对话标识符,与lmsys_chat_1m_clean数据样本关联
"conversation_id": 字符串类型,
# 用户轮次/提示词
"prompt": 字符串类型,
# 助手(DeepSeek R1)的回复内容
# 包含包裹在<think>与</think>标签内的思维轨迹
"response": 字符串类型,
# 按照对话格式整理的用户+助手轮次数据
# 示例:[{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}]
"messages": 字典字符串列表类型,
# 样本元数据
"metadata": 任意键值字典类型,
}
## 数据集构建
### 整理初衷
<!-- 本数据集的研发动机。 -->
为助力社区研发全开源的顶尖基础大语言模型,我们制作并发布本数据集,以作为该模型推理训练数据的核心组成部分。本数据集依托Parasail平台的Oumi推理能力生成。
### 源数据
<!-- 本节说明源数据的来源,例如新闻文本与标题、社交媒体帖文、翻译语句等。 -->
查询语句源自[lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean),该数据集是从原始LMSYS Chat 1M数据集经过筛选后得到的。
#### 数据收集与处理
<!-- 本节说明数据收集与处理流程,例如数据筛选标准、过滤与归一化方法、所用工具与库等。 -->
* 回复内容通过Oumi针对[Parasail](https://parasail.io/)的批量推理功能收集得到。
* 无法解析的样本已被丢弃(数量不足100个)。
* 其余所有样本均包含元数据,用于标记样本是否完整(判断依据为是否存在`</think>`标记)。
#### 个人与敏感信息
<!-- 说明数据集是否包含可能被视为个人、敏感或私密的数据(例如暴露地址、唯一可识别的姓名或别名、种族或族裔起源、性取向、宗教信仰、政治观点、财务或健康数据等)。若已采取数据匿名化措施,请说明匿名化流程。 -->
目前已知本数据集不包含任何个人、敏感或私密信息,但鉴于数据来源为LMSYS聊天机器人竞技场的提交查询,仍存在潜在风险。
## 偏差、风险与局限性
<!-- 本节说明技术与社会技术层面的局限性。 -->
1. 源提示词取自[lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean)的`conversations`列,可能反映数据筛选过程中存在的各类偏差。
2. 部分提示词的查询或内容可能具有对抗性或争议性。
3. 生成的回复可能会反映DeepSeek-R1模型本身存在的偏差与局限性。
## 引用说明
<!-- 若有介绍该数据集的论文或博文,请在此处附上APA与Bibtex格式的引用信息。 -->
**BibTeX格式引用:**
@misc{lmsysChat1mCleanR12025,
author = {Jeremiah Greer},
title = {lmsys_chat_1m_clean_R1 Dataset},
month = {February},
year = {2025},
url = {https://huggingface.co/datasets/oumi-ai/lmsys_chat_1m_clean_R1}
}
@software{oumi2025,
author = {Oumi Community},
title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models},
month = {January},
year = {2025},
url = {https://github.com/oumi-ai/oumi}
}
提供机构:
maas
创建时间:
2025-04-09



