orca-agentinstruct-1M-v1
收藏魔搭社区2026-04-11 更新2024-11-23 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/orca-agentinstruct-1M-v1
下载链接
链接失效反馈官方服务:
资源简介:
### Dataset Card
This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework.
AgentInstruct is an extensible agentic framework for synthetic data generation.
This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative writing, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base LLM.
Using the full AgentInstruct data (a superset of this dataset with ~25M instruction pairs), we post-trained Mistral-7b. When comparing the resulting model Orca-3-Mistral to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval.
Please refer to AgentInstruct: Toward Generative Teaching with Agentic Flows for further details.
### Dataset Sources
**Repository:** microsoft/orca-agentinstruct-1M-v1
**Paper:** AgentInstruct: Toward Generative Teaching with Agentic Flows
### Direct Use
This is a general dataset created by the AgentInstruct methodology. It contains diverse synthetic data. This could be used for research on instruction tuning and synthetic data use and generaiton.
### Out-of-Scope Use
This dataset is being shared for research purposes. For training models to perform real-world tasks, we recommend further testing and validation where needed.
This dataset is not intended for use in educational systems or organizations, or for use in health systems.
### Dataset Structure
### Data Instances
A typical data entry in the dataset consists of a conversation i.e. sequence of question and its corresponding answer. Below is an example from the dataset:
```json
{
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "Argue the importance of having a strong understanding of physics and mechanics in the design of electron beam systems for semiconductor equipment."
},
{
"role": "assistant",
"content": "Having a strong understanding of physics and mechanics is crucial in the design of electron beam systems for semiconductor equipment. This is because electron beam systems rely heavily on the principles of physics and mechanics to function effectively and efficiently.\n\nFirstly, a strong understanding of physics is necessary to design the electron gun, which is the heart of the electron beam system. The electron gun generates and accelerates the electrons that are used to create the beam. The design of the electron gun must take into account the properties of electrons, such as their charge, mass, and velocity, as well as the principles of electromagnetism and quantum mechanics. Without a thorough understanding of these principles, it would be impossible to design an electron gun that can generate and accelerate electrons with the required energy and precision.\n\nSecondly, a strong understanding of mechanics is essential in the design of the electron beam system's optics. The optics are responsible for shaping and focusing the electron beam, which is critical for achieving high-resolution imaging and precise material processing. The design of the optics must take into account the principles of optics, such as refraction, reflection, and diffraction, as well as the mechanical properties of the components, such as their stiffness, thermal expansion, and vibration resistance. Without a thorough understanding of these principles, it would be impossible to design optics that can shape and focus the electron beam with the required precision and stability.\n\nIn conclusion, a strong understanding of physics and mechanics is essential in the design of electron beam systems for semiconductor equipment. Without this understanding, it would be impossible to design electron guns and optics that can generate and shape the electron beam with the required precision and stability. Therefore, designers of electron beam systems must have a solid foundation in physics and mechanics to ensure that their designs are effective and efficient."
},
{
"role": "user",
"content": "Could you perhaps rephrase your entire explanation using only haiku poetry? It would certainly give your answer a more artistic touch, don't you think?"
},
{
"role": "assistant",
"content": "Electron streams flow,\nPhysics guides their path and speed,\nSemiconduct's soul.\n\nElectron gun's heart,\nQuantum dance, charged particles,\nPrecision's fine art.\n\nOptics shape the beam,\nMechanics' firm hand guides light,\nFocused like a dream.\n\nIn the silicon dance,\nPhysics, mechanics embrace,\nAdvance our expanse.\n\nDesign's core must know,\nNature's laws to harness force,\nFor tech to grow, glow."
},
{
"role": "user",
"content": "Just to make sure I've got it right, your haikus suggest that the electron gun and optics are like artists, requiring precision and a deep understanding of natural laws to create their masterpiece?"
},
{
"role": "assistant",
"content": "Indeed, that's a poetic interpretation. The haikus suggest that the electron gun (the creator of the electron beam) and the optics (which shape and focus the beam) require a precise understanding of physics and mechanics, much like artists need a deep understanding of their tools and materials to create a masterpiece. The natural laws are the guidelines within which they must work to achieve the desired outcome in semiconductor equipment design."
},
{
"role": "user",
"content": "I'd like to understand more about the 'precision's fine art' you mentioned. Can you explain how precision is achieved in the design of the electron gun and optics?"
},
{
"role": "assistant",
"content": "Certainly! Achieving precision in the design of the electron gun and optics in electron beam systems involves several key factors:\n\n1. **Electron Gun Design:**\n - **Vacuum Environment:** The electron gun operates in a high vacuum to prevent electron scattering by air molecules, ensuring a precise and concentrated beam.\n - **Cathode Material:** The choice of cathode material affects the emission efficiency and stability of the electron beam. Materials like tungsten or lanthanum hexaboride are often used for their high melting points and low work functions.\n - **Anode Configuration:** The anode accelerates the electrons and its design, including the shape and voltage applied, determines the beam's energy and focus.\n - **Electromagnetic Lenses:** These lenses control the focus of the electron beam, and their design must account for spherical and chromatic aberrations to maintain precision.\n\n2. **Optics Design:**\n - **Electromagnetic Lenses:** Just like in the electron gun, these lenses in the optical column are used to focus and deflect the electron beam with high precision. The design and alignment of these lenses are critical for achieving the desired spot size and shape on the target.\n - **Stigmators:** These are used to correct for astigmatism in the electron beam, ensuring that the beam remains symmetrical and focused over a range of operating conditions.\n - **Apertures:** Placed at strategic points in the optical column, apertures help to shape the beam and eliminate electrons that stray from the desired path, improving resolution and reducing aberrations.\n - **Thermal Stability:** Precision in electron beam systems also depends on maintaining thermal stability to prevent drift or deformation of the optical components.\n\n3. **Control Systems:**\n - **Feedback Loops:** Precision is maintained through feedback systems that monitor the beam's characteristics and make real-time adjustments to the electron gun and optics to correct any deviations.\n - **Vibration Isolation:** The entire system must be isolated from external vibrations that could disturb the beam's path, often using pneumatic or magnetic isolation systems.\n\n4. **Material Science:**\n - **Material Properties:** The materials used in the construction of the electron gun and optics must be chosen for their stability under the operating conditions, including resistance to thermal expansion and electromagnetic interference.\n\n5. **Computational Modeling:**\n - **Simulation Software:** Advanced software is used to model the electron beam's behavior and interactions with the system's components, allowing for the optimization of the design before physical construction.\n\nIn summary, 'precision's fine art' in the context of electron beam systems refers to the meticulous design and engineering of the electron gun and optics, which are achieved through a deep understanding of physics and materials science, as well as the implementation of sophisticated control and feedback mechanisms."
}
],
}
```
### Data Fields
The dataset comprises the following fields:
messages: String representing a conversation in OpenAI format.
```json
[
{"role": "system", "content": "...."},
{"role": "user", "content": "...."},
{"role": "assistant", "content": "...."},
{"role": "user", "content": "...."},
{"role": "assistant", "content": "...."},
....
]
```
To read the conversation use `json.loads()`
### Data Splits
train
### Dataset Creation
### Source Data
Please refer to AgentInstruct: Toward Generative Teaching with Agentic Flows for further detail
### Data Collection and Processing
Please refer to AgentInstruct: Toward Generative Teaching with Agentic Flows for further details for details about the dataset construction.
### Who are the source data producers?
Microsoft
### Annotation process
We generate questions and answers using using Azure GPT-4.
### Personal and Sensitive Information
None
### Bias, Risks, and Limitations
• This dataset is in English.
• The dataset inherits the biases, errors, and omissions known to exist in data used for seed sources and models used for data generaiton.
• This dataset is not intended to represent any specific domain, and contains generic data. However, the AgentInstruct methodology, which was used to create this dataset, can be used to generate high-quality domain specific data, which can be used to fine-tune any existing model for a specific domain.
• The dataset is synthetically gnerated and hence may contain inaccuracies that do not accurately reflect real-world phenomena.
• The synthetic nature of this dataset may limit its ability to generalize to real-world cases.
• The data is intended for research and exoerumentation for model training and synthetic data generation.
### Citation
If you find this work useful in your method, you can cite the paper as below:
@misc{
title={ AgentInstruct: Toward Generative Teaching with Agentic Flows},
author={Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah},
year={2024},
eprint={ 2407.03502},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Dataset Card Contact
orca-team@microsoft.com
### 数据集卡片
本数据集为完全合成的指令对集合,其提示词与回复均通过**AgentInstruct框架(AgentInstruct)**合成生成。AgentInstruct是一款可扩展的智能体框架,专用于合成数据生成。
本数据集包含约100万条由AgentInstruct生成的指令对,生成过程仅以互联网公开的原始文本内容作为种子数据。该数据集覆盖文本编辑、创意写作、代码编写、阅读理解等多元能力场景,可用于任意基础大语言模型(LLM)的指令微调。
借助完整的AgentInstruct数据集(本数据集的超集,包含约2500万条指令对),我们对Mistral-7b基座模型开展了后训练。将所得模型Orca-3-Mistral与采用相同基座模型的Mistral-7b-Instruct进行对比,可见其在多项基准测试中实现显著提升:例如在AGIEval上提升40%、MMLU上提升19%、GSM8K上提升54%、BBH上提升38%、AlpacaEval上提升45%。详细信息请参阅论文《AgentInstruct:面向智能体流的生成式教学(AgentInstruct: Toward Generative Teaching with Agentic Flows)》。
### 数据集来源
**代码仓库:** microsoft/orca-agentinstruct-1M-v1
**相关论文:** AgentInstruct: Toward Generative Teaching with Agentic Flows
### 直接使用场景
本数据集基于AgentInstruct方法构建,属于通用型数据集,包含多样化的合成数据,可用于指令微调、合成数据应用与生成相关的研究工作。
### 超出适用范围的使用场景
本数据集仅用于研究目的。若需将其用于训练模型以完成现实世界任务,建议根据实际需求开展额外的测试与验证工作。
本数据集不得应用于教育系统或组织机构,亦不得用于医疗健康体系。
### 数据集结构
#### 数据实例
本数据集中的典型数据条目为一段对话序列,即问题与对应回复的组合。以下为数据集中的一则示例:
json
{
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "请论述深入理解物理与力学知识在半导体设备电子束系统设计中的重要性。"
},
{
"role": "assistant",
"content": "深入理解物理与力学知识,对于半导体设备电子束系统的设计至关重要。这是因为电子束系统的高效稳定运行高度依赖物理与力学原理。
首先,物理知识是电子枪设计的核心基础。电子枪作为电子束系统的核心部件,负责产生并加速电子以形成电子束。其设计需充分考虑电子的电荷、质量、速度等属性,以及电磁学与量子力学原理。若缺乏对这些原理的深入掌握,便无法设计出可按要求精准产生并加速电子的电子枪。
其次,力学知识对电子束系统的光学部件设计不可或缺。光学部件负责塑形与聚焦电子束,这是实现高分辨率成像与精准材料加工的关键。光学部件的设计需兼顾光学原理(如折射、反射与衍射)以及组件的力学特性(如刚度、热膨胀系数与抗振性)。若未充分掌握这些原理,便无法设计出可按要求精准塑形与聚焦电子束的光学系统。
综上,深入理解物理与力学知识,是半导体设备电子束系统设计的必要条件。缺乏相关知识,便无法设计出可精准产生与塑形电子束的电子枪与光学系统。因此,电子束系统的设计者必须具备扎实的物理与力学基础,以确保设计的有效性与高效性。"
},
{
"role": "user",
"content": "能否仅用俳句的形式重述上述全部解释?这样能让回答更具艺术感,您觉得呢?"
},
{
"role": "assistant",
"content": "电子束流转,
物理引航速率途,
半导体之魂。
电子枪之芯,
量子舞起带电粒,
精准艺之魂。
光学束塑形,
力学掌灯导光路,
凝眸如幻梦。
硅基舞翩跹,
物理力学相携手,
拓界展新篇。
设计之根本,
谙熟自然驭力法,
科技向荣辉。"
},
{
"role": "user",
"content": "为确保理解无误,您的俳句是否意指电子枪与光学部件如同艺术家一般,需凭借精准操作与对自然法则的深刻理解来创作出杰作?"
},
{
"role": "assistant",
"content": "您的解读颇具诗意。这些俳句确实意在表明,电子枪(电子束的创造者)与光学系统(负责塑形与聚焦电子束)需要对物理与力学知识具备精准的掌握,正如艺术家需深入理解自身的工具与材料以创作出杰作一般。自然法则是他们在半导体设备设计中达成预期目标所必须遵循的准则。"
},
{
"role": "user",
"content": "我想进一步了解您提到的“精准艺之魂”,能否解释一下电子枪与光学部件的设计中,精准性是如何实现的?"
},
{
"role": "assistant",
"content": "当然可以!电子束系统中电子枪与光学部件的精准设计可通过以下若干关键维度实现:
1. **电子枪设计:**
- **真空环境:** 电子枪需运行于高真空环境中,以避免空气分子对电子的散射,确保电子束的精准性与集中度。
- **阴极材料:** 阴极材料的选择会影响电子束的发射效率与稳定性,通常选用钨或六硼化镧等材料,因其熔点高且逸出功低。
- **阳极结构:** 阳极负责加速电子,其形状与施加电压等设计参数决定了电子束的能量与聚焦效果。
- **电磁透镜:** 此类透镜用于控制电子束的聚焦,其设计需考虑球差与色差,以维持精准性。
2. **光学部件设计:**
- **电磁透镜:** 与电子枪中的透镜类似,光学柱中的电磁透镜用于高精度聚焦与偏转电子束。透镜的设计与对齐对实现靶材上的目标光斑尺寸与形状至关重要。
- **像散校正器:** 用于校正电子束的像散,确保电子束在多种运行条件下均保持对称与聚焦。
- **光阑:** 安装于光学柱的关键位置,用于塑形电子束并剔除偏离目标路径的电子,以提升分辨率并减少像差。
- **热稳定性:** 电子束系统的精准性还依赖于热稳定性,以防止光学组件发生漂移或变形。
3. **控制系统:**
- **反馈环路:** 通过监测电子束特性并实时调整电子枪与光学部件的参数以修正偏差,可维持精准性。
- **振动隔离:** 整个系统需与外部振动隔离,以防干扰电子束的路径,通常采用气动或磁隔离系统。
4. **材料科学:**
- **材料特性:** 电子枪与光学部件的选材需考虑其在运行条件下的稳定性,包括抗热膨胀性与抗电磁干扰性。
5. **计算建模:**
- **仿真软件:** 借助先进软件模拟电子束的行为及其与系统组件的相互作用,可在物理构建前优化设计方案。
综上,电子束系统中的“精准艺之魂”,指的是对电子枪与光学部件的精细化设计与工程实现,这依赖于对物理与材料科学的深刻理解,以及复杂控制与反馈机制的应用。"
}
],
}
### 数据字段
本数据集包含以下字段:
messages:采用OpenAI格式的对话字符串。
json
[
{"role": "system", "content": "...."},
{"role": "user", "content": "...."},
{"role": "assistant", "content": "...."},
{"role": "user", "content": "...."},
{"role": "assistant", "content": "...."},
....
]
可通过`json.loads()`读取该对话内容。
### 数据划分
训练集
### 数据集构建
#### 源数据
详细信息请参阅《AgentInstruct:面向智能体流的生成式教学》。
#### 数据收集与处理
有关数据集构建的详细信息,请参阅《AgentInstruct:面向智能体流的生成式教学》。
#### 源数据生产者
微软(Microsoft)
#### 标注流程
我们通过Azure GPT-4生成问题与回复。
#### 个人与敏感信息
无
#### 偏差、风险与局限性
• 本数据集采用英语编写。
• 本数据集继承了种子数据源与数据生成所用模型中已知存在的偏差、错误与遗漏。
• 本数据集无意代表任何特定领域,仅包含通用数据。不过,用于构建本数据集的AgentInstruct方法可用于生成高质量的领域专属数据,以用于特定领域的现有模型微调。
• 本数据集为合成生成,因此可能包含无法准确反映现实世界现象的不准确之处。
• 本数据集的合成特性可能限制其泛化至现实场景的能力。
• 本数据仅用于模型训练与合成数据生成相关的研究与实验。
### 引用
若您的研究中用到了本工作,可按如下方式引用该论文:
bibtex
@misc{
title={AgentInstruct: Toward Generative Teaching with Agentic Flows},
author={Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah},
year={2024},
eprint={2407.03502},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
### 数据集卡片联系人
orca-team@microsoft.com
提供机构:
maas
创建时间:
2024-11-16



