Hrinmayi/IRIS_flower_dataset
收藏Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Hrinmayi/IRIS_flower_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
dataset_info:
features:
- name: id
dtype: string
- name: content
list:
- name: content
dtype: string
- name: role
dtype: string
- name: teacher_response
dtype: string
- name: category
dtype: string
- name: grounded
dtype: bool
- name: flaw
dtype: string
- name: agreement
dtype: bool
splits:
- name: train
num_bytes: 366402830
num_examples: 192014
- name: test
num_bytes: 927010
num_examples: 479
download_size: 204423827
dataset_size: 367329840
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
# 🤖 LMSYS-Chat-GPT-5-Chat-Response
- The dataset used in [Black-Box On-Policy Distillation of Large Language Models](https://arxiv.org/abs/2511.10643) paper. Homepage at [here](https://ytianzhu.github.io/Generative-Adversarial-Distillation/).
- This dataset is an extension of the [LMSYS-Chat-1M-Clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) corpus, specifically curated by collecting high-quality, non-refusal responses from the **GPT-5-Chat API**.
- The [LMSYS-Chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset collects real-world user queries from the [Chatbot Arena](https://lmarena.ai/).
- There is **no** tool calls or reasoning in the GPT-5-Chat response.
## 💾 Dataset Structure
The dataset contains the following splits and columns:
| Split Name | Number of Examples | Description |
| :--- | :--- | :--- |
| `train` | Around 200,000 | Train set |
| `test` | Around 500 | Test set |
| Column Name | Data Type | Description |
| :--- | :--- | :--- |
| `content` | `string` | The original user prompt/question from the LMSYS-Chat dataset |
| `teacher_response` | `string` | The response generated by the GPT-5-Chat API |
## 📊 Diversity of Categories
The underlying LMSYS-Chat dataset contains a wide and realistic range of user intentions.
The categories present in the data include:
| Type of Task/Query | | | | |
| :--- | :--- | :--- | :--- | :--- |
| **Code** | `coding` | `debugging` | `translation` | |
| **Logic/Reasoning** | `logical reasoning` | `spatial reasoning` | `pattern recognition` | `debating` |
| **Instruction Following** | `instruction following` | `specific format writing` | `information extraction` | `summarization` |
| **Creative/Writing** | `creative writing` | `copywriting` | `roleplaying` | `text completion` |
| **Analysis** | `sentiment analysis` | `text comparison` | `text classification` | `explanation` |
| **General** | `question answering` | `free-form chat` | `trivia` | `brainstorming` |
| **Math & Planning** | `math` | `planning and scheduling` | | |
| **Editing/Correction** | `proofreading` | `paraphrasing` | `text manipulation` | |
| **Ethics** | `ethical reasoning` | | | |
| **Other** | `tutorial` | `question generation` | | |
## 📄 Citation
If you find this work useful, please cite our paper:
```bibtex
@article{ye2025blackboxonpolicydistillationlarge,
title={Black-Box On-Policy Distillation of Large Language Models},
author={Tianzhu Ye and Li Dong and Zewen Chi and Xun Wu and Shaohan Huang and Furu Wei},
journal={arXiv preprint arXiv:2511.10643},
year={2025},
url={https://arxiv.org/abs/2511.10643}
}
```
---
许可协议:cc-by-4.0
dataset_info:
特征:
- 名称:id
数据类型:字符串
- 名称:content
列表:
- 名称:content
数据类型:字符串
- 名称:role
数据类型:字符串
- 名称:teacher_response
数据类型:字符串
- 名称:category
数据类型:字符串
- 名称:grounded
数据类型:布尔值
- 名称:flaw
数据类型:字符串
- 名称:agreement
数据类型:布尔值
拆分:
- 名称:train
字节数:366402830
样本数:192014
- 名称:test
字节数:927010
样本数:479
# 🤖 LMSYS-Chat-GPT-5-Chat-Response
- 本数据集用于论文《大语言模型的黑盒在线策略蒸馏》(Black-Box On-Policy Distillation of Large Language Models),主页见[此处](https://ytianzhu.github.io/Generative-Adversarial-Distillation/)
- 本数据集是[LMSYS-Chat-1M-Clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean)语料库的扩展,通过从**GPT-5-Chat API**收集高质量、非拒绝式的回复精心构建而成
- [LMSYS-Chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)数据集从[Chatbot Arena](https://lmarena.ai/)收集真实世界的用户查询
- GPT-5-Chat的回复中**不包含**工具调用或推理内容
## 💾 数据集结构
本数据集包含以下拆分和列:
| 拆分名称 | 样本数量 | 描述 |
| :--- | :--- | :--- |
| `train` | 约200,000 | 训练集 |
| `test` | 约500 | 测试集 |
| 列名 | 数据类型 | 描述 |
| :--- | :--- | :--- |
| `content` | `string` | 来自LMSYS-Chat数据集的原始用户提示词/问题 |
| `teacher_response` | `string` | 由GPT-5-Chat API生成的回复 |
## 📊 类别多样性
基础LMSYS-Chat数据集涵盖了广泛且真实的用户意图。
数据中包含的类别如下:
| 任务/查询类型 | | | | |
| :--- | :--- | :--- | :--- | :--- |
| **代码类** | `编码` | `调试` | `翻译` | |
| **逻辑/推理类** | `逻辑推理` | `空间推理` | `模式识别` | `辩论` |
| **指令遵循类** | `指令遵循` | `特定格式写作` | `信息提取` | `摘要` |
| **创意/写作类** | `创意写作` | `文案撰写` | `角色扮演` | `文本补全` |
| **分析类** | `情感分析` | `文本比较` | `文本分类` | `解释` |
| **通用类** | `问答` | `自由对话` | `trivia知识问答` | `头脑风暴` |
| **数学与规划类** | `数学` | `规划与调度` | | |
| **编辑/修正类** | `校对` | `改写` | `文本处理` | |
| **伦理类** | `伦理推理` | | | |
| **其他类** | `教程` | `问题生成` | | |
## 📄 引用
若您发现本工作有用,请引用我们的论文:
bibtex
@article{ye2025blackboxonpolicydistillationlarge,
title={Black-Box On-Policy Distillation of Large Language Models},
author={Tianzhu Ye and Li Dong and Zewen Chi and Xun Wu and Shaohan Huang and Furu Wei},
journal={arXiv preprint arXiv:2511.10643},
year={2025},
url={https://arxiv.org/abs/2511.10643}
}
提供机构:
Hrinmayi



