five

lmsys-chat-1m

收藏
魔搭社区2026-05-14 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/lmsys-chat-1m
下载链接
链接失效反馈
官方服务:
资源简介:
## 示例代码 ```python from modelscope import MsDataset from modelscope.utils.constant import DownloadMode ds = MsDataset.load('AI-ModelScope/lmsys-chat-1m',subset_name='default', split='train', download_mode=DownloadMode.FORCE_REDOWNLOAD) print(next(iter(ds))) ``` ## LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the [Vicuna demo and Chatbot Arena website](https://chat.lmsys.org/) from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use" section on the data collection website. To ensure the safe release of data, we have made our best efforts to remove all conversations that contain personally identifiable information (PII). In addition, we have included the OpenAI moderation API output for each message. However, we have chosen to keep unsafe conversations so that researchers can study the safety-related questions associated with LLM usage in real-world scenarios as well as the OpenAI moderation process. For more details, please refer to the paper: https://arxiv.org/abs/2309.11998 **Basic Statistics** | Key | Value | | --- | --- | | # Conversations | 1,000,000 | | # Models | 25 | | # Users | 210,479 | | # Languages | 154 | | Avg. # Turns per Sample | 2.0 | | Avg. # Tokens per Prompt | 69.5 | | Avg. # Tokens per Response | 214.5 | **PII Redaction** We partnered with the [OpaquePrompts](https://opaqueprompts.opaque.co/) team to redact person names in this dataset to protect user privacy. Names like "Mary" and "James" in a conversation will appear as "NAME_1" and "NAME_2". For example: ```json Raw: [ { "content": "Write me a bio. My Name is Mary I am a student who is currently a beginner free lancer. I worked with James in the past ..." }] Redacted: [ { "content": "Write me a bio. My Name is NAME_1 I am a student who is currently a beginner free lancer. I worked with NAME_2 in the past ..." }] ``` Each conversation includes a "redacted" field to indicate if it has been redacted. This process may impact data quality and occasionally lead to incorrect redactions. We are working on improving the redaction quality and will release improved versions in the future. If you want to access the raw conversation data, please fill out [the form](https://docs.google.com/forms/d/1PZw67e19l0W3oCiQOjzSyZvXfOemhg6LCY0XzVmOUx0/edit) with details about your intended use cases. ## Uniqueness and Potential Usage This dataset features large-scale real-world conversations with LLMs. We believe it will help the AI research community answer important questions around topics like: - Characteristics and distributions of real-world user prompts - AI safety and content moderation - Training instruction-following models - Improving and evaluating LLM evaluation methods - Model selection and request dispatching algorithms For more details, please refer to the paper: https://arxiv.org/abs/2309.11998 ## LMSYS-Chat-1M Dataset License Agreement This Agreement contains the terms and conditions that govern your access and use of the LMSYS-Chat-1M Dataset (as defined above). You may not use the LMSYS-Chat-1M Dataset if you do not accept this Agreement. By clicking to accept, accessing the LMSYS-Chat-1M Dataset, or both, you hereby agree to the terms of the Agreement. If you are agreeing to be bound by the Agreement on behalf of your employer or another entity, you represent and warrant that you have full legal authority to bind your employer or such entity to this Agreement. If you do not have the requisite authority, you may not accept the Agreement or access the LMSYS-Chat-1M Dataset on behalf of your employer or another entity. - Safety and Moderation: **This dataset contains unsafe conversations that may be perceived as offensive or unsettling.** User should apply appropriate filters and safety measures before utilizing this dataset for training dialogue agents. - Non-Endorsement: The views and opinions depicted in this dataset **do not reflect** the perspectives of the researchers or affiliated institutions engaged in the data collection process. - Legal Compliance: You are mandated to use it in adherence with all pertinent laws and regulations. - Model Specific Terms: When leveraging direct outputs of a specific model, users must adhere to its corresponding terms of use. - Non-Identification: You **must not** attempt to identify the identities of individuals or infer any sensitive personal data encompassed in this dataset. - Prohibited Transfers: You should not distribute, copy, disclose, assign, sublicense, embed, host, or otherwise transfer the dataset to any third party. - Right to Request Deletion: At any time, we may require you to delete all copies of the conversation dataset (in whole or in part) in your possession and control. You will promptly comply with any and all such requests. Upon our request, you shall provide us with written confirmation of your compliance with such requirement. - Termination: We may, at any time, for any reason or for no reason, terminate this Agreement, effective immediately upon notice to you. Upon termination, the license granted to you hereunder will immediately terminate, and you will immediately stop using the LMSYS-Chat-1M Dataset and destroy all copies of the LMSYS-Chat-1M Dataset and related materials in your possession or control. - Limitation of Liability: IN NO EVENT WILL WE BE LIABLE FOR ANY CONSEQUENTIAL, INCIDENTAL, EXEMPLARY, PUNITIVE, SPECIAL, OR INDIRECT DAMAGES (INCLUDING DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, OR LOSS OF INFORMATION) ARISING OUT OF OR RELATING TO THIS AGREEMENT OR ITS SUBJECT MATTER, EVEN IF WE HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Subject to your compliance with the terms and conditions of this Agreement, we grant to you, a limited, non-exclusive, non-transferable, non-sublicensable license to use the LMSYS-Chat-1M Dataset, including the conversation data and annotations, to research, develop, and improve software, algorithms, machine learning models, techniques, and technologies for both research and commercial purposes. ## Citation ``` @misc{zheng2023lmsyschat1m, title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset}, author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Tianle Li and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zhuohan Li and Zi Lin and Eric. P Xing and Joseph E. Gonzalez and Ion Stoica and Hao Zhang}, year={2023}, eprint={2309.11998}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# LMSYS-Chat-1M:大规模真实世界大语言模型(LLM)对话数据集 本数据集包含100万条与25款顶尖大语言模型的真实对话。数据采集自2023年4月至8月期间,[Vicuna演示与Chatbot Arena网站](https://chat.lmsys.org/)上的210,479个独立IP地址。 每条样本包含对话ID、模型名称、OpenAI API格式的对话文本、检测到的语言标签,以及OpenAI内容审核API标签。 数据采集网站的“使用条款”板块已获取用户知情同意。为保障数据安全发布,我们已尽最大努力移除所有包含个人可识别信息(PII)的对话。此外,我们为每条消息添加了OpenAI内容审核API的输出结果。但我们保留了存在安全风险的对话,以便研究者能够研究真实场景下大语言模型使用相关的安全问题,以及OpenAI内容审核流程。本数据集未执行数据净化操作,因此可能包含主流基准测试中的测试题目。 更多细节请参考论文:https://arxiv.org/abs/2309.11998 **基本统计信息** | 指标 | 数值 | | --- | --- | | 对话总数 | 1,000,000 | | 模型总数 | 25 | | 用户总数 | 210,479 | | 覆盖语言数 | 154 | | 单样本平均对话轮次 | 2.0 | | 单提示平均Token数 | 69.5 | | 单回复平均Token数 | 214.5 | **个人可识别信息(PII)脱敏** 我们与[OpaquePrompts](https://opaqueprompts.opaque.co/)团队合作,对数据集中的人名进行脱敏处理以保护用户隐私。对话中的“Mary”“James”等姓名将分别显示为“NAME_1”“NAME_2”,示例如下: json Raw: [ { "content": "Write me a bio. My Name is Mary I am a student who is currently a beginner free lancer. I worked with James in the past ..." }] Redacted: [ { "content": "Write me a bio. My Name is NAME_1 I am a student who is currently a beginner free lancer. I worked with NAME_2 in the past ..." }] 每条对话均包含“redacted”字段,用于标识该对话是否已完成脱敏。该处理流程可能会影响数据质量,偶尔会出现错误脱敏的情况。我们正致力于提升脱敏质量,并将在未来发布优化版本。若您希望获取原始对话数据,请填写[此表单](https://docs.google.com/forms/d/1PZw67e19l0W3oCiQOjzSyZvXfOemhg6LCY0XzVmOUx0/edit)并说明您的使用场景详情。 ## 独特性与潜在应用场景 本数据集具备大规模真实世界大语言模型对话的特性。我们相信其将助力人工智能研究社区解答以下关键议题: - 真实世界用户提示词的特征与分布 - AI安全与内容审核 - 训练遵循指令的模型 - 改进与评估大语言模型的评估方法 - 模型选择与请求调度算法 更多细节请参考论文:https://arxiv.org/abs/2309.11998 ## LMSYS-Chat-1M数据集许可协议 本协议包含管理您访问和使用LMSYS-Chat-1M数据集(定义见上文)的条款与条件。若您不接受本协议,则不得使用LMSYS-Chat-1M数据集。通过点击接受、访问LMSYS-Chat-1M数据集,或同时进行以上两项操作,即视为您同意本协议条款。若您代表雇主或其他实体签署本协议,您声明并保证您拥有充分的法律权限可约束雇主或该实体遵守本协议。若您不具备相应权限,则不得代表雇主或其他实体接受本协议或访问LMSYS-Chat-1M数据集。 - 安全与审核:**本数据集包含可能被视为冒犯性或令人不适的不安全对话**。使用者在将该数据集用于训练对话AI智能体(AI Agent)前,应应用适当的过滤措施与安全手段。 - 不代表官方立场:本数据集中呈现的观点与意见**不反映**数据采集过程中参与的研究者或附属机构的立场。 - 合规要求:您必须遵守所有相关法律法规使用本数据集。 - 模型特定条款:当使用特定模型的直接输出时,使用者必须遵守该模型对应的使用条款。 - 不得识别身份:您**不得**尝试识别个人身份,或推断本数据集中包含的任何敏感个人信息。 - 禁止转让:您不得向任何第三方分发、复制、披露、转让、再授权、嵌入、托管或以其他方式转移本数据集。 - 要求删除的权利:在任何时候,我们可要求您删除所有您持有或控制的全部或部分对话数据集副本。您应及时遵守所有此类要求。在我们提出要求后,您需向我们提供书面确认,证明您已遵守该要求。 - 协议终止:我们可在任何时候,无论有无理由,向您发出通知后立即终止本协议。协议终止后,您获得的许可将立即失效,您应立即停止使用LMSYS-Chat-1M数据集,并销毁所有您持有或控制的LMSYS-Chat-1M数据集及相关材料的副本。 - 责任限制:在任何情况下,即使我们已被告知可能发生此类损害,我们均不对因本协议或其标的事项产生或与之相关的任何间接、附带、惩戒性、惩罚性、特殊或后果性损害(包括利润损失、业务中断或信息损失)承担责任。 在您遵守本协议条款与条件的前提下,我们授予您有限、非排他、不可转让、不可再授权的许可,允许您为研究和商业目的使用LMSYS-Chat-1M数据集(包括对话数据与标注),以研究、开发和改进用于软件、算法、机器学习模型、技术和相关工艺。 ## 引用 @misc{zheng2023lmsyschat1m, title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset}, author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Tianle Li and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zhuohan Li and Zi Lin and Eric. P Xing and Joseph E. Gonzalez and Ion Stoica and Hao Zhang}, year={2023}, eprint={2309.11998}, archivePrefix={arXiv}, primaryClass={cs.CL} }
提供机构:
maas
创建时间:
2023-12-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作