lmsys-chat-1m-synth
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tokyotech-llm/lmsys-chat-1m-synth
下载链接
链接失效反馈官方服务:
资源简介:
# LMSYS-Chat-1M-Synth: Japanese/English Synthetic Conversation Dataset Derived from LMSYS-Chat-1M
This repository contains a series of Japanese and English conversation datasets derived from LMSYS-Chat-1M.
- [Llama-3.1-LMSYS-Chat-1M-Synth](./README_llama.md)
- Utilized in the post-training of [Llama-3.1-Swallow-8B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1) and [Llama-3.1-Swallow-70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1)
- [Gemma-2-LMSYS-Chat-1M-Synth](./README_gemma2.md)
- Utilized in the post-training of [Llama-3.1-Swallow-8B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3) and [Llama-3.1-Swallow-70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3)
- [Gemma-3-LMSYS-Chat-1M-Synth](./README_gemma3.md)
- Utilized in the post-training of [Llama-3.1-Swallow-8B-Instruct-v0.5](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5)
## Additional Materials
We distribute Python scripts used to develop the dataset under the `./materials/` directory. The directory includes scripts for generating assistant responses and scoring preferences. These scripts are provided **as-is** solely for reproducibility of research purpose. We do not support or take responsibility in using these scripts.
## License Information - Dataset
We publish the synthesized portion of the dataset under mixed licenses for each subset as follows:
### User Instructions Translated into Japanese
The subsets `lmsys-chat-1m-first-turn-user-instructions-ja.jsonl.gz.gpg` and `lmsys-chat-1m-first-turn-user-instructions-ja+unsafe.jsonl.gz.gpg`, termed "Japanese Instructions," are distributed under the [LMSYS-Chat-1M Dataset License Agreement](https://huggingface.co/datasets/lmsys/lmsys-chat-1m). **To access the original dataset and obtain the decryption key for the Japanese Instructions, you must agree to the license and provide your contact information.** **Please note that the "Right to Request Deletion" clause from the LMSYS-Chat-1M Dataset License also applies to Japanese Instructions**: The original dataset authors retain the right to request you to delete all copies of the Japanese Instructions (in whole or in part) in your possession and control. You are required to comply with any and all such requests.
### Assistant Responses and Preference Scores
The subset `llama3.1-lmsys-chat-1m-synth-ja+en.jsonl.gz` is distributed under the [LLAMA 3.1 COMMUNITY LICENSE AGREEMENT](https://www.llama.com/llama3_1/license/).
The subsets `gemma2-lmsys-chat-1m-synth-ja+en.jsonl.gz` and `gemma3-lmsys-chat-1m-synth-ja+en.jsonl.gz` are distributed under the [GEMMA TERMS OF USE](https://ai.google.dev/gemma/terms).
## License Information - Scripts
We distribute the Python and Shell scripts (located in the `./scripts/` or `./materials/` directories) under the Apache License, Version 2.0.
## Acknowledgments
This work was supported by a project from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) aiming at "establishment of research and development centers to ensure the transparency and reliability of generative AI models," along with other contributions.
We gratefully acknowledge Lianmin Zheng, the author of the original LMSYS-Chat-1M paper, for granting permission to distribute LMSYS-Chat-1M-Synth-Ja-and-En dataset as a derivative work of the original dataset.
End of document
# LMSYS-Chat-1M-Synth:源自LMSYS-Chat-1M的日英合成对话数据集
本仓库包含一系列源自LMSYS-Chat-1M的日语与英语对话数据集。
- [Llama-3.1-LMSYS-Chat-1M-Synth](./README_llama.md)
- 用于[Llama-3.1-Swallow-8B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1)与[Llama-3.1-Swallow-70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1)的后训练
- [Gemma-2-LMSYS-Chat-1M-Synth](./README_gemma2.md)
- 用于[Llama-3.1-Swallow-8B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3)与[Llama-3.1-Swallow-70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3)的后训练
- [Gemma-3-LMSYS-Chat-1M-Synth](./README_gemma3.md)
- 用于[Llama-3.1-Swallow-8B-Instruct-v0.5](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5)的后训练
## 附加材料
我们将用于构建该数据集的Python脚本发布于`./materials/`目录下。该目录包含用于生成助手回复与评分偏好的脚本。本脚本仅按**现状**提供,仅用于保障研究可复现性。我们不提供相关支持,也不对脚本的使用承担任何责任。
## 数据集许可信息
我们按照各子集的混合许可协议发布数据集的合成部分,具体如下:
### 译为日语的用户指令
子集`lmsys-chat-1m-first-turn-user-instructions-ja.jsonl.gz.gpg`与`lmsys-chat-1m-first-turn-user-instructions-ja+unsafe.jsonl.gz.gpg`(统称为“日语指令集”),将依据[LMSYS-Chat-1M数据集许可协议](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)进行分发。**若要获取原始数据集并获取日语指令集的解密密钥,您必须同意该许可协议并提供联系方式。** **请注意,LMSYS-Chat-1M数据集许可协议中的“请求删除权”条款同样适用于日语指令集**:原始数据集作者保留请求您删除所持有或控制的全部日语指令集(全部或部分副本)的权利,您须遵守此类所有请求。
### 助手回复与偏好评分
子集`llama3.1-lmsys-chat-1m-synth-ja+en.jsonl.gz`将依据[LLAMA 3.1社区许可协议](https://www.llama.com/llama3_1/license/)进行分发。
子集`gemma2-lmsys-chat-1m-synth-ja+en.jsonl.gz`与`gemma3-lmsys-chat-1m-synth-ja+en.jsonl.gz`将依据[GEMMA使用条款](https://ai.google.dev/gemma/terms)进行分发。
## 脚本许可信息
我们将位于`./scripts/`或`./materials/`目录下的Python与Shell脚本依据Apache许可协议第2.0版进行分发。
## 致谢
本研究得到了日本文部科学省(Ministry of Education, Culture, Sports, Science, and Technology,MEXT)发起的“建立保障生成式AI(Generative AI)模型透明度与可靠性的研发中心”项目及其他贡献的支持。
我们衷心感谢原始LMSYS-Chat-1M论文的作者郑连民(Lianmin Zheng),感谢其允许将LMSYS-Chat-1M-Synth-Ja-and-En数据集作为原始数据集的衍生作品进行分发。
文档结束
提供机构:
maas
创建时间:
2025-10-12



