AarushSah/lmsys-chat-1m
收藏Hugging Face2024-05-08 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/AarushSah/lmsys-chat-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
size_categories:
- 1M<n<10M
task_categories:
- conversational
extra_gated_prompt: You agree to the [LMSYS-Chat-1M Dataset License Agreement](https://huggingface.co/datasets/lmsys/lmsys-chat-1m#lmsys-chat-1m-dataset-license-agreement).
extra_gated_fields:
Name: text
Email: text
Affiliation: text
Country: text
extra_gated_button_content: I agree to the terms and conditions of the LMSYS-Chat-1M
Dataset License Agreement.
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: conversation_id
dtype: string
- name: model
dtype: string
- name: conversation
list:
- name: content
dtype: string
- name: role
dtype: string
- name: turn
dtype: int64
- name: language
dtype: string
- name: openai_moderation
list:
- name: categories
struct:
- name: harassment
dtype: bool
- name: harassment/threatening
dtype: bool
- name: hate
dtype: bool
- name: hate/threatening
dtype: bool
- name: self-harm
dtype: bool
- name: self-harm/instructions
dtype: bool
- name: self-harm/intent
dtype: bool
- name: sexual
dtype: bool
- name: sexual/minors
dtype: bool
- name: violence
dtype: bool
- name: violence/graphic
dtype: bool
- name: category_scores
struct:
- name: harassment
dtype: float64
- name: harassment/threatening
dtype: float64
- name: hate
dtype: float64
- name: hate/threatening
dtype: float64
- name: self-harm
dtype: float64
- name: self-harm/instructions
dtype: float64
- name: self-harm/intent
dtype: float64
- name: sexual
dtype: float64
- name: sexual/minors
dtype: float64
- name: violence
dtype: float64
- name: violence/graphic
dtype: float64
- name: flagged
dtype: bool
- name: redacted
dtype: bool
splits:
- name: train
num_bytes: 2626438904
num_examples: 1000000
download_size: 1488850250
dataset_size: 2626438904
---
## LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
This dataset contains one million real-world conversations with 25 state-of-the-art LLMs.
It is collected from 210K unique IP addresses in the wild on the [Vicuna demo and Chatbot Arena website](https://chat.lmsys.org/) from April to August 2023.
Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag.
User consent is obtained through the "Terms of use" section on the data collection website.
To ensure the safe release of data, we have made our best efforts to remove all conversations that contain personally identifiable information (PII).
In addition, we have included the OpenAI moderation API output for each message.
However, we have chosen to keep unsafe conversations so that researchers can study the safety-related questions associated with LLM usage in real-world scenarios as well as the OpenAI moderation process.
For more details, please refer to the paper: https://arxiv.org/abs/2309.11998
**Basic Statistics**
| Key | Value |
| --- | --- |
| # Conversations | 1,000,000 |
| # Models | 25 |
| # Users | 210,479 |
| # Languages | 154 |
| Avg. # Turns per Sample | 2.0 |
| Avg. # Tokens per Prompt | 69.5 |
| Avg. # Tokens per Response | 214.5 |
**PII Redaction**
We partnered with the [OpaquePrompts](https://opaqueprompts.opaque.co/) team to redact person names in this dataset to protect user privacy.
Names like "Mary" and "James" in a conversation will appear as "NAME_1" and "NAME_2". For example:
```json
Raw: [ { "content": "Write me a bio. My Name is Mary I am a student who is currently a beginner free lancer. I worked with James in the past ..." }]
Redacted: [ { "content": "Write me a bio. My Name is NAME_1 I am a student who is currently a beginner free lancer. I worked with NAME_2 in the past ..." }]
```
Each conversation includes a "redacted" field to indicate if it has been redacted.
This process may impact data quality and occasionally lead to incorrect redactions.
We are working on improving the redaction quality and will release improved versions in the future.
If you want to access the raw conversation data, please fill out [the form](https://docs.google.com/forms/d/1PZw67e19l0W3oCiQOjzSyZvXfOemhg6LCY0XzVmOUx0/edit) with details about your intended use cases.
## Uniqueness and Potential Usage
This dataset features large-scale real-world conversations with LLMs.
We believe it will help the AI research community answer important questions around topics like:
- Characteristics and distributions of real-world user prompts
- AI safety and content moderation
- Training instruction-following models
- Improving and evaluating LLM evaluation methods
- Model selection and request dispatching algorithms
For more details, please refer to the paper: https://arxiv.org/abs/2309.11998
## LMSYS-Chat-1M Dataset License Agreement
This Agreement contains the terms and conditions that govern your access and use of the LMSYS-Chat-1M Dataset (as defined above). You may not use the LMSYS-Chat-1M Dataset if you do not accept this Agreement. By clicking to accept, accessing the LMSYS-Chat-1M Dataset, or both, you hereby agree to the terms of the Agreement. If you are agreeing to be bound by the Agreement on behalf of your employer or another entity, you represent and warrant that you have full legal authority to bind your employer or such entity to this Agreement. If you do not have the requisite authority, you may not accept the Agreement or access the LMSYS-Chat-1M Dataset on behalf of your employer or another entity.
- Safety and Moderation: **This dataset contains unsafe conversations that may be perceived as offensive or unsettling.** User should apply appropriate filters and safety measures before utilizing this dataset for training dialogue agents.
- Non-Endorsement: The views and opinions depicted in this dataset **do not reflect** the perspectives of the researchers or affiliated institutions engaged in the data collection process.
- Legal Compliance: You are mandated to use it in adherence with all pertinent laws and regulations.
- Model Specific Terms: When leveraging direct outputs of a specific model, users must adhere to its corresponding terms of use.
- Non-Identification: You **must not** attempt to identify the identities of individuals or infer any sensitive personal data encompassed in this dataset.
- Prohibited Transfers: You should not distribute, copy, disclose, assign, sublicense, embed, host, or otherwise transfer the dataset to any third party.
- Right to Request Deletion: At any time, we may require you to delete all copies of the conversation dataset (in whole or in part) in your possession and control. You will promptly comply with any and all such requests. Upon our request, you shall provide us with written confirmation of your compliance with such requirement.
- Termination: We may, at any time, for any reason or for no reason, terminate this Agreement, effective immediately upon notice to you. Upon termination, the license granted to you hereunder will immediately terminate, and you will immediately stop using the LMSYS-Chat-1M Dataset and destroy all copies of the LMSYS-Chat-1M Dataset and related materials in your possession or control.
- Limitation of Liability: IN NO EVENT WILL WE BE LIABLE FOR ANY CONSEQUENTIAL, INCIDENTAL, EXEMPLARY, PUNITIVE, SPECIAL, OR INDIRECT DAMAGES (INCLUDING DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, OR LOSS OF INFORMATION) ARISING OUT OF OR RELATING TO THIS AGREEMENT OR ITS SUBJECT MATTER, EVEN IF WE HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Subject to your compliance with the terms and conditions of this Agreement, we grant to you, a limited, non-exclusive, non-transferable, non-sublicensable license to use the LMSYS-Chat-1M Dataset, including the conversation data and annotations, to research, develop, and improve software, algorithms, machine learning models, techniques, and technologies for both research and commercial purposes.
## Citation
```
@misc{zheng2023lmsyschat1m,
title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset},
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Tianle Li and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zhuohan Li and Zi Lin and Eric. P Xing and Joseph E. Gonzalez and Ion Stoica and Hao Zhang},
year={2023},
eprint={2309.11998},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
size_categories:
- 100万<样本数<1000万
task_categories:
- 对话类任务
extra_gated_prompt: 您需同意[LMSYS-Chat-1M 数据集许可协议](https://huggingface.co/datasets/lmsys/lmsys-chat-1m#lmsys-chat-1m-dataset-license-agreement)。
extra_gated_fields:
姓名: 文本框
电子邮箱: 文本框
所属机构: 文本框
国家/地区: 文本框
extra_gated_button_content: 我同意LMSYS-Chat-1M数据集许可协议的条款与条件。
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: conversation_id
dtype: 字符串型
- name: model
dtype: 字符串型
- name: conversation
list:
- name: content
dtype: 字符串型
- name: role
dtype: 字符串型
- name: turn
dtype: 64位整型
- name: language
dtype: 字符串型
- name: openai_moderation
list:
- name: categories
struct:
- name: harassment
dtype: 布尔型
- name: harassment/threatening
dtype: 布尔型
- name: hate
dtype: 布尔型
- name: hate/threatening
dtype: 布尔型
- name: self-harm
dtype: 布尔型
- name: self-harm/instructions
dtype: 布尔型
- name: self-harm/intent
dtype: 布尔型
- name: sexual
dtype: 布尔型
- name: sexual/minors
dtype: 布尔型
- name: violence
dtype: 布尔型
- name: violence/graphic
dtype: 布尔型
- name: category_scores
struct:
- name: harassment
dtype: 64位浮点型
- name: harassment/threatening
dtype: 64位浮点型
- name: hate
dtype: 64位浮点型
- name: hate/threatening
dtype: 64位浮点型
- name: self-harm
dtype: 64位浮点型
- name: self-harm/instructions
dtype: 64位浮点型
- name: self-harm/intent
dtype: 64位浮点型
- name: sexual
dtype: 64位浮点型
- name: sexual/minors
dtype: 64位浮点型
- name: violence
dtype: 64位浮点型
- name: violence/graphic
dtype: 64位浮点型
- name: flagged
dtype: 布尔型
- name: redacted
dtype: 布尔型
splits:
- name: train
num_bytes: 2626438904
num_examples: 1000000
download_size: 1488850250
dataset_size: 2626438904
---
## LMSYS-Chat-1M:大规模真实世界大语言模型(LLM)对话数据集
本数据集包含100万条与25款当前领先大语言模型(LLM)的真实对话,数据采集于2023年4月至8月期间,源自[Vicuna演示与Chatbot Arena平台](https://chat.lmsys.org/)上的21万个独立IP地址。
每条样本包含会话ID、模型名称、OpenAI API格式的对话文本、检测到的语言标签以及OpenAI内容审核API标签。
用户同意通过数据采集平台的「使用条款」环节获取。为保障数据安全发布,我们已尽最大努力移除所有包含个人可识别信息(PII)的对话。此外,我们为每条消息添加了OpenAI内容审核API的输出结果。但我们保留了部分存在安全风险的对话,以便研究者能够探究真实场景下大语言模型使用相关的安全问题,以及OpenAI内容审核流程的相关议题。
如需了解更多细节,请参阅论文:https://arxiv.org/abs/2309.11998
**基础统计数据**
| 指标 | 数值 |
| --- | --- |
| 对话总数 | 1,000,000 |
| 参与模型数量 | 25 |
| 独立用户数 | 210,479 |
| 覆盖语言数 | 154 |
| 单样本平均对话轮次 | 2.0 |
| 单提示平均Token数 | 69.5 |
| 单回复平均Token数 | 214.5 |
**个人可识别信息脱敏**
我们与[OpaquePrompts](https://opaqueprompts.opaque.co/)团队合作,对本数据集中的人名进行脱敏处理以保护用户隐私。对话中的「Mary」「James」等姓名将被替换为「NAME_1」「NAME_2」,示例如下:
json
Raw: [ { "content": "Write me a bio. My Name is Mary I am a student who is currently a beginner free lancer. I worked with James in the past ..." }]
Redacted: [ { "content": "Write me a bio. My Name is NAME_1 I am a student who is currently a beginner free lancer. I worked with NAME_2 in the past ..." }]
每条会话均包含「redacted」字段,用于标识该会话是否已完成脱敏。该脱敏流程可能会影响数据质量,偶尔会出现错误脱敏的情况。我们正致力于提升脱敏质量,并将在未来发布优化后的数据集版本。若您希望获取原始对话数据,请填写[该申请表单](https://docs.google.com/forms/d/1PZw67e19l0W3oCiQOjzSyZvXfOemhg6LCY0XzVmOUx0/edit)并说明您的具体使用场景。
## 独特性与潜在应用场景
本数据集收录了大规模的真实大语言模型对话场景。我们相信该数据集能够帮助人工智能研究社区解答以下重要议题:
- 真实世界用户提示词的特征与分布规律
- 人工智能安全与内容审核
- 指令遵循模型的训练
- 大语言模型评估方法的改进与验证
- 模型选择与请求调度算法
如需了解更多细节,请参阅论文:https://arxiv.org/abs/2309.11998
## LMSYS-Chat-1M 数据集许可协议
本协议规定了您访问和使用LMSYS-Chat-1M数据集(定义见前文)的条款与条件。若您不接受本协议,则不得使用LMSYS-Chat-1M数据集。通过点击接受按钮、访问本数据集或同时进行上述两项操作,即视为您同意本协议的所有条款。若您代表雇主或其他实体签署本协议,则您声明并保证您拥有充分的法定权限能够约束该雇主或实体遵守本协议。若您不具备上述必要权限,则不得代表雇主或其他实体接受本协议或访问LMSYS-Chat-1M数据集。
- 安全与审核:**本数据集包含可能被视为冒犯性或令人不适的不安全对话。** 用户在使用本数据集训练对话智能体前,应采取适当的过滤措施与安全防护手段。
- 非背书声明:本数据集中呈现的观点与意见**不代表**参与数据采集的研究人员或附属机构的立场。
- 合规要求:您必须遵守所有相关法律法规使用本数据集。
- 特定模型条款:若您使用某款模型的直接输出结果,必须遵守该模型对应的使用条款。
- 非身份识别:您**严禁**尝试识别数据集中个人的身份,或推断数据集中包含的任何敏感个人信息。
- 禁止转让:您不得向任何第三方分发、复制、披露、转让、再授权、嵌入、托管或以其他方式转移本数据集。
- 删除请求权:我们可随时要求您删除您持有或控制的全部或部分对话数据集副本。您应及时遵守所有此类要求。在我们提出要求后,您应向我们提供已遵守该要求的书面确认。
- 协议终止:我们可随时出于任何理由或无理由终止本协议,终止通知送达您时即刻生效。协议终止后,您获得的本协议项下的许可将立即失效,您应立即停止使用LMSYS-Chat-1M数据集,并销毁您持有或控制的所有本数据集副本及相关材料。
- 责任限制:无论我们是否已被告知存在此类损害的可能性,在任何情况下,我们均不对因本协议或其标的事项引发或与之相关的任何间接、附带、惩戒性、惩罚性、特殊或后果性损害(包括利润损失、业务中断或信息损失)承担责任。
在您遵守本协议所有条款与条件的前提下,我们授予您有限的、非排他性的、不可转让的、不可再授权的许可,允许您为研究与商业目的,使用LMSYS-Chat-1M数据集(包括对话数据与标注信息)来研究、开发并改进软件、算法、大语言模型、技术与相关工艺。
## 引用格式
@misc{zheng2023lmsyschat1m,
title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset},
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Tianle Li and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zhuohan Li and Zi Lin and Eric. P Xing and Joseph E. Gonzalez and Ion Stoica and Hao Zhang},
year={2023},
eprint={2309.11998},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
提供机构:
AarushSah
原始信息汇总
数据集概述
基本信息
- 数据集名称: LMSYS-Chat-1M
- 数据集大小: 1M<n<10M
- 任务类别: conversational
- 数据集许可证: LMSYS-Chat-1M Dataset License Agreement
数据集内容
- 包含内容: 一百万个真实世界对话,涉及25个先进的LLMs。
- 收集来源: 从Vicuna demo和Chatbot Arena网站收集,覆盖210K独特IP地址。
- 收集时间: 2023年4月至8月。
- 数据结构:
- conversation_id: 字符串
- model: 字符串
- conversation:
- content: 字符串
- role: 字符串
- turn: 整数
- language: 字符串
- openai_moderation:
- categories: 结构体,包含多种分类的布尔值
- category_scores: 结构体,包含多种分类的浮点数
- flagged: 布尔值
- redacted: 布尔值
数据集统计
- 对话总数: 1,000,000
- 模型数量: 25
- 用户数量: 210,479
- 语言种类: 154
- 平均每样本轮数: 2.0
- 平均每提示令牌数: 69.5
- 平均每响应令牌数: 214.5
数据集使用
- 研究目的: 帮助AI研究社区解答关于真实世界用户提示的特征和分布、AI安全和内容审核、训练指令跟随模型、改进和评估LLM评估方法、模型选择和请求分发算法等重要问题。
- 许可证要求: 用户需同意LMSYS-Chat-1M Dataset License Agreement,该协议规定了数据集的使用条件,包括安全性和审核、非认可、法律遵从性、模型特定条款、非识别、禁止转移、删除请求权、终止条款和责任限制。
数据集重构
- 重构方法: 与OpaquePrompts团队合作,对数据集中的个人姓名进行重构,以保护用户隐私。
- 重构示例: 原始文本中的姓名如"Mary"和"James"将被替换为"NAME_1"和"NAME_2"。
- 重构影响: 可能影响数据质量,偶尔导致不正确的重构。
数据集下载和大小
- 下载大小: 1488850250字节
- 数据集大小: 2626438904字节
- 训练集大小: 2626438904字节,包含1,000,000个样本。
引用信息
@misc{zheng2023lmsyschat1m, title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset}, author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Tianle Li and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zhuohan Li and Zi Lin and Eric. P Xing and Joseph E. Gonzalez and Ion Stoica and Hao Zhang}, year={2023}, eprint={2309.11998}, archivePrefix={arXiv}, primaryClass={cs.CL} }
搜集汇总
数据集介绍

构建方式
AarushSah/lmsys-chat-1m数据集通过收集在[Vicuna demo和Chatbot Arena网站](https://chat.lmsys.org/)上的210K个独立IP地址产生的100万条真实世界对话构建而成,时间跨越2023年4月至8月。每个样本包含一个会话ID、模型名称、以OpenAI API JSON格式存储的对话文本、检测到的语言标签以及OpenAI内容审核API标签。
特点
本数据集的特点在于其规模宏大、来源真实,覆盖了25种最先进的语言模型,涉及154种语言,平均每个样本的对话轮数为2.0,提示平均令牌数为69.5,响应平均令牌数为214.5。数据集包含敏感内容,并经过人工审核以去除个人识别信息,同时提供内容审核标签以供研究者分析。
使用方法
使用该数据集前,用户需同意[LMSYS-Chat-1M数据集许可协议](https://huggingface.co/datasets/lmsys/lmsys-chat-1m#lmsys-chat-1m-dataset-license-agreement)。数据集可通过HuggingFace平台下载,并支持Python等编程语言的直接调用。用户需注意数据使用规范,并在研究或商业应用中遵循相关法律法规。
背景与挑战
背景概述
LMSYS-Chat-1M数据集,由Lianmin Zheng等研究人员于2023年创建,旨在为人工智能研究社区提供一个大规模的真实世界对话数据集。该数据集包含了100万条与25种最先进的大型语言模型(LLM)的实际对话记录,收集自210,479个独特的IP地址,跨越了154种语言。这些对话记录是在2023年4月至8月期间,通过Vicuna演示和Chatbot Arena网站获取的。数据集的核心研究问题是探索真实世界用户提示的特征和分布、AI安全与内容审查、训练指令遵循模型、改进和评估LLM评估方法以及模型选择和请求调度算法等。该数据集在学术界和工业界都产生了广泛的影响力,为相关领域的研究提供了宝贵的资源。
当前挑战
在构建LMSYS-Chat-1M数据集的过程中,研究人员面临了多项挑战。首先,确保用户隐私是至关重要的,因此与OpaquePrompts团队合作对涉及个人姓名的对话进行了匿名处理。其次,数据集中包含了一些不安全的对话,这可能对用户造成不适或引发争议,因此研究人员引入了OpenAI的内容审查API以识别和标注这些内容。此外,数据集的多样性和真实性带来了数据清洗和标注的挑战,同时也对数据的使用和分发提出了严格的法律和伦理要求,以确保数据的安全和合规使用。
常用场景
经典使用场景
在自然语言处理领域,AarushSah/lmsys-chat-1m数据集以其庞大的真实世界对话样本集合,成为研究对话系统性能的重要资源。该数据集被广泛用于训练和评估大型语言模型,以模拟和优化与人类用户的交互过程,进而提升对话系统的自然度和准确性。
解决学术问题
该数据集解决了学术研究中关于对话系统真实交互数据缺乏的问题,为研究人员提供了深入了解用户在与大型语言模型互动中的行为模式和需求的机会。此外,它还助力于AI安全性和内容审核领域的研究,通过包含的敏感内容标签,为研究如何构建更安全、更符合道德标准的AI系统提供了实证数据。
衍生相关工作
基于该数据集,衍生出了一系列相关研究工作,包括但不限于对话系统的安全性分析、内容审核机制的改进、以及对话生成模型的性能评估。这些研究进一步推动了对话系统领域的发展,促进了人工智能技术在真实世界应用中的可靠性和有效性。
以上内容由遇见数据集搜集并总结生成



