five

orca-agentinstruct-1M-v1-cleaned

收藏
魔搭社区2025-12-05 更新2024-11-23 收录
下载链接:
https://modelscope.cn/datasets/mlabonne/orca-agentinstruct-1M-v1-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
# 🐋 Orca-AgentInstruct-1M-v1-cleaned This is a cleaned version of the [microsoft/orca-agentinstruct-1M-v1](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1) dataset released by Microsoft. > orca-agentinstruct-1M-v1 is a fully synthetic dataset using only raw text publicly available on the web as seed data. It is a subset of the full AgentInstruct dataset (~25M samples) that created Orca-3-Mistral. Compared to Mistral 7B Instruct, the authors claim 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Here's what I changed: 1. Splits are unified into one, with a new "split" column 2. Strings were converted into lists of dicts to ensure compatibility with most frameworks 3. Empty system prompts were removed so you don't get weird errors Data categories in the dataset: - creative_content - text_modification - struct2text_flow - rc - rag - text_extraction - mcq - follow_up - analytical_reasoning - fermi - fs_cot_flow - code_ - brain_teaser - text_classification - open_domain_q

# 🐋 Orca-AgentInstruct-1M-v1-cleaned 本数据集为微软(Microsoft)发布的[microsoft/orca-agentinstruct-1M-v1](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1)数据集的清洗版本。 > orca-agentinstruct-1M-v1是一个全合成数据集,仅以互联网公开的原始文本作为种子数据。它是用于构建Orca-3-Mistral的完整AgentInstruct数据集(约2500万样本)的子集。相较于Mistral 7B Instruct,作者宣称该数据集可在AGIEval上实现40%的性能提升、MMLU上提升19%、GSM8K上提升54%、BBH上提升38%,以及AlpacaEval上提升45%。 本次清洗所做的调整如下: 1. 将所有数据划分统一为单一划分,并新增"split"列 2. 将字符串格式转换为字典列表格式,以兼容绝大多数主流框架 3. 移除了空系统提示词,以避免出现异常运行错误 数据集包含以下数据类别: - 创意内容(creative_content) - 文本修改(text_modification) - 结构化转文本流程(struct2text_flow) - 阅读理解(Reading Comprehension, RC) - 检索增强生成(Retrieval-Augmented Generation, RAG) - 文本抽取(text_extraction) - 多项选择题(Multiple Choice Question, MCQ) - 后续交互(follow_up) - 分析推理(analytical_reasoning) - 费米问题(fermi) - 少样本思维链流程(Few-Shot Chain-of-Thought Flow, fs_cot_flow) - 代码生成(code_) - 脑筋急转弯(brain_teaser) - 文本分类(text_classification) - 开放域问答(open_domain_q)
提供机构:
maas
创建时间:
2025-03-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作