five

Hrinmayi/IRIS_flower_dataset

收藏
Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Hrinmayi/IRIS_flower_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 dataset_info: features: - name: id dtype: string - name: content list: - name: content dtype: string - name: role dtype: string - name: teacher_response dtype: string - name: category dtype: string - name: grounded dtype: bool - name: flaw dtype: string - name: agreement dtype: bool splits: - name: train num_bytes: 366402830 num_examples: 192014 - name: test num_bytes: 927010 num_examples: 479 download_size: 204423827 dataset_size: 367329840 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # 🤖 LMSYS-Chat-GPT-5-Chat-Response - The dataset used in [Black-Box On-Policy Distillation of Large Language Models](https://arxiv.org/abs/2511.10643) paper. Homepage at [here](https://ytianzhu.github.io/Generative-Adversarial-Distillation/). - This dataset is an extension of the [LMSYS-Chat-1M-Clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean) corpus, specifically curated by collecting high-quality, non-refusal responses from the **GPT-5-Chat API**. - The [LMSYS-Chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset collects real-world user queries from the [Chatbot Arena](https://lmarena.ai/). - There is **no** tool calls or reasoning in the GPT-5-Chat response. ## 💾 Dataset Structure The dataset contains the following splits and columns: | Split Name | Number of Examples | Description | | :--- | :--- | :--- | | `train` | Around 200,000 | Train set | | `test` | Around 500 | Test set | | Column Name | Data Type | Description | | :--- | :--- | :--- | | `content` | `string` | The original user prompt/question from the LMSYS-Chat dataset | | `teacher_response` | `string` | The response generated by the GPT-5-Chat API | ## 📊 Diversity of Categories The underlying LMSYS-Chat dataset contains a wide and realistic range of user intentions. The categories present in the data include: | Type of Task/Query | | | | | | :--- | :--- | :--- | :--- | :--- | | **Code** | `coding` | `debugging` | `translation` | | | **Logic/Reasoning** | `logical reasoning` | `spatial reasoning` | `pattern recognition` | `debating` | | **Instruction Following** | `instruction following` | `specific format writing` | `information extraction` | `summarization` | | **Creative/Writing** | `creative writing` | `copywriting` | `roleplaying` | `text completion` | | **Analysis** | `sentiment analysis` | `text comparison` | `text classification` | `explanation` | | **General** | `question answering` | `free-form chat` | `trivia` | `brainstorming` | | **Math & Planning** | `math` | `planning and scheduling` | | | | **Editing/Correction** | `proofreading` | `paraphrasing` | `text manipulation` | | | **Ethics** | `ethical reasoning` | | | | | **Other** | `tutorial` | `question generation` | | | ## 📄 Citation If you find this work useful, please cite our paper: ```bibtex @article{ye2025blackboxonpolicydistillationlarge, title={Black-Box On-Policy Distillation of Large Language Models}, author={Tianzhu Ye and Li Dong and Zewen Chi and Xun Wu and Shaohan Huang and Furu Wei}, journal={arXiv preprint arXiv:2511.10643}, year={2025}, url={https://arxiv.org/abs/2511.10643} } ```

--- 许可协议:cc-by-4.0 dataset_info: 特征: - 名称:id 数据类型:字符串 - 名称:content 列表: - 名称:content 数据类型:字符串 - 名称:role 数据类型:字符串 - 名称:teacher_response 数据类型:字符串 - 名称:category 数据类型:字符串 - 名称:grounded 数据类型:布尔值 - 名称:flaw 数据类型:字符串 - 名称:agreement 数据类型:布尔值 拆分: - 名称:train 字节数:366402830 样本数:192014 - 名称:test 字节数:927010 样本数:479 # 🤖 LMSYS-Chat-GPT-5-Chat-Response - 本数据集用于论文《大语言模型的黑盒在线策略蒸馏》(Black-Box On-Policy Distillation of Large Language Models),主页见[此处](https://ytianzhu.github.io/Generative-Adversarial-Distillation/) - 本数据集是[LMSYS-Chat-1M-Clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean)语料库的扩展,通过从**GPT-5-Chat API**收集高质量、非拒绝式的回复精心构建而成 - [LMSYS-Chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)数据集从[Chatbot Arena](https://lmarena.ai/)收集真实世界的用户查询 - GPT-5-Chat的回复中**不包含**工具调用或推理内容 ## 💾 数据集结构 本数据集包含以下拆分和列: | 拆分名称 | 样本数量 | 描述 | | :--- | :--- | :--- | | `train` | 约200,000 | 训练集 | | `test` | 约500 | 测试集 | | 列名 | 数据类型 | 描述 | | :--- | :--- | :--- | | `content` | `string` | 来自LMSYS-Chat数据集的原始用户提示词/问题 | | `teacher_response` | `string` | 由GPT-5-Chat API生成的回复 | ## 📊 类别多样性 基础LMSYS-Chat数据集涵盖了广泛且真实的用户意图。 数据中包含的类别如下: | 任务/查询类型 | | | | | | :--- | :--- | :--- | :--- | :--- | | **代码类** | `编码` | `调试` | `翻译` | | | **逻辑/推理类** | `逻辑推理` | `空间推理` | `模式识别` | `辩论` | | **指令遵循类** | `指令遵循` | `特定格式写作` | `信息提取` | `摘要` | | **创意/写作类** | `创意写作` | `文案撰写` | `角色扮演` | `文本补全` | | **分析类** | `情感分析` | `文本比较` | `文本分类` | `解释` | | **通用类** | `问答` | `自由对话` | `trivia知识问答` | `头脑风暴` | | **数学与规划类** | `数学` | `规划与调度` | | | | **编辑/修正类** | `校对` | `改写` | `文本处理` | | | **伦理类** | `伦理推理` | | | | | **其他类** | `教程` | `问题生成` | | | ## 📄 引用 若您发现本工作有用,请引用我们的论文: bibtex @article{ye2025blackboxonpolicydistillationlarge, title={Black-Box On-Policy Distillation of Large Language Models}, author={Tianzhu Ye and Li Dong and Zewen Chi and Xun Wu and Shaohan Huang and Furu Wei}, journal={arXiv preprint arXiv:2511.10643}, year={2025}, url={https://arxiv.org/abs/2511.10643} }
提供机构:
Hrinmayi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作