openeurollm/orca-agentinstruct-1M-v1-decontaminated
收藏Hugging Face2026-03-29 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/orca-agentinstruct-1M-v1-decontaminated
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cdla-permissive-2.0
size_categories:
- 1M<n<10M
task_categories:
- question-answering
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: analytical_reasoning
num_bytes: 98028645
num_examples: 24983
- name: brain_teaser
num_bytes: 119629503
num_examples: 49962
- name: code_
num_bytes: 589940236
num_examples: 99931
- name: creative_content
num_bytes: 280573464
num_examples: 49961
- name: fermi
num_bytes: 75519245
num_examples: 24984
- name: follow_up
num_bytes: 849785042
num_examples: 98947
- name: fs_cot_flow
num_bytes: 103809090
num_examples: 24883
- name: mcq
num_bytes: 220938985
num_examples: 99873
- name: open_domain_qa
num_bytes: 588970010
num_examples: 271991
- name: rag
num_bytes: 409622336
num_examples: 49867
- name: rc
num_bytes: 274752766
num_examples: 49910
- name: struct2text_flow
num_bytes: 240292423
num_examples: 49955
- name: text_classification
num_bytes: 145309572
num_examples: 49994
- name: text_extraction
num_bytes: 297532567
num_examples: 49887
- name: text_modification
num_bytes: 334112924
num_examples: 49919
download_size: 4628816808
dataset_size: 4628816808
configs:
- config_name: default
data_files:
- split: analytical_reasoning
path: data/analytical_reasoning-*
- split: brain_teaser
path: data/brain_teaser-*
- split: code_
path: data/code_-*
- split: creative_content
path: data/creative_content-*
- split: fermi
path: data/fermi-*
- split: follow_up
path: data/follow_up-*
- split: fs_cot_flow
path: data/fs_cot_flow-*
- split: mcq
path: data/mcq-*
- split: open_domain_qa
path: data/open_domain_qa-*
- split: rag
path: data/rag-*
- split: rc
path: data/rc-*
- split: struct2text_flow
path: data/struct2text_flow-*
- split: text_classification
path: data/text_classification-*
- split: text_extraction
path: data/text_extraction-*
- split: text_modification
path: data/text_modification-*
decontamination:
source_dataset: microsoft/orca-agentinstruct-1M-v1
benchmarks:
- path: HuggingFaceH4/MATH-500
subset: default
split: test
- path: HuggingFaceH4/aime_2024
subset: default
split: train
- path: math-ai/aime25
subset: default
split: test
- path: math-ai/amc23
subset: default
split: test
- path: daman1209arora/jeebench
subset: default
split: test
- path: Idavidrein/gpqa
subset: gpqa_diamond
split: train
- path: ali-elganzory/livecodebench-code_generation_lite
subset: release_v6
split: test
- path: openai/openai_humaneval
subset: openai_humaneval
split: test
- path: google-research-datasets/mbpp
subset: full
split: train+test+validation+prompt
- path: google/IFEval
subset: default
split: train
- path: tatsu-lab/alpaca_eval
subset: alpaca_eval
split: eval
- path: lmarena-ai/arena-hard-auto
subset: default
split: train
contamination_stats:
- subset: default
split: creative_content
total: 50000
removed: 39
- subset: default
split: text_modification
total: 50000
removed: 81
- subset: default
split: struct2text_flow
total: 50000
removed: 45
- subset: default
split: rc
total: 50000
removed: 90
- subset: default
split: rag
total: 50000
removed: 133
- subset: default
split: text_extraction
total: 50000
removed: 113
- subset: default
split: mcq
total: 99986
removed: 113
- subset: default
split: follow_up
total: 279935
removed: 107
- subset: default
split: analytical_reasoning
total: 25000
removed: 17
- subset: default
split: fermi
total: 25000
removed: 16
- subset: default
split: fs_cot_flow
total: 25000
removed: 117
- subset: default
split: code_
total: 100000
removed: 69
- subset: default
split: brain_teaser
total: 50000
removed: 38
- subset: default
split: text_classification
total: 50000
removed: 6
- subset: default
split: open_domain_qa
total: 272370
removed: 379
---
## Decontamination
This dataset is a decontaminated version of [microsoft/orca-agentinstruct-1M-v1](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1).
### Benchmarks used
- **MATH500**: `HuggingFaceH4/MATH-500` (subset=default, split=test)
- **AIME24**: `HuggingFaceH4/aime_2024` (subset=default, split=train)
- **AIME25**: `math-ai/aime25` (subset=default, split=test)
- **AMC23**: `math-ai/amc23` (subset=default, split=test)
- **JEEBench**: `daman1209arora/jeebench` (subset=default, split=test)
- **GPQADiamond**: `Idavidrein/gpqa` (subset=gpqa_diamond, split=train)
- **LiveCodeBench**: `ali-elganzory/livecodebench-code_generation_lite` (subset=release_v6, split=test)
- **HumanEval**: `openai/openai_humaneval` (subset=openai_humaneval, split=test)
- **MBPP**: `google-research-datasets/mbpp` (subset=full, split=train+test+validation+prompt)
- **IFEval**: `google/IFEval` (subset=default, split=train)
- **AlpacaEval**: `tatsu-lab/alpaca_eval` (subset=alpaca_eval, split=eval)
- **Arena-Hard-v2.0**: `lmarena-ai/arena-hard-auto` (subset=default, split=train) (data_files=['data/arena-hard-v2.0/question.jsonl'])
### Decontamination settings
<table>
<thead>
<tr><th>Parameter</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>N-gram size</td><td>8</td></tr>
<tr><td>Match threshold</td><td>0.5</td></tr>
</tbody>
</table>
### Split and benchmark details
<table>
<thead>
<tr>
<th>Subset</th>
<th>Split</th>
<th>Docs in split (dataset)</th>
<th>Benchmark</th>
<th>Contaminated (dataset)</th>
<th>Contamination rate (dataset)</th>
<th>Docs (benchmark)</th>
<th>Contaminated (benchmark)</th>
<th>Contamination rate (benchmark)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="180">default</td>
<td rowspan="12">analytical_reasoning</td>
<td rowspan="12">25,000</td>
<td>MATH500</td>
<td>16</td>
<td>0.0640%</td>
<td>500</td>
<td>3</td>
<td>0.6000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>1</td>
<td>0.0040%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>0</td>
<td>0.0000%</td>
<td>805</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">brain_teaser</td>
<td rowspan="12">50,000</td>
<td>MATH500</td>
<td>34</td>
<td>0.0680%</td>
<td>500</td>
<td>10</td>
<td>2.00%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>1</td>
<td>0.0020%</td>
<td>40</td>
<td>1</td>
<td>2.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>2</td>
<td>0.0040%</td>
<td>974</td>
<td>1</td>
<td>0.1027%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>1</td>
<td>0.0020%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">code_</td>
<td rowspan="12">100,000</td>
<td>MATH500</td>
<td>28</td>
<td>0.0280%</td>
<td>500</td>
<td>5</td>
<td>1.00%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>8</td>
<td>0.0080%</td>
<td>164</td>
<td>5</td>
<td>3.05%</td>
</tr>
<tr>
<td>MBPP</td>
<td>30</td>
<td>0.0300%</td>
<td>974</td>
<td>17</td>
<td>1.75%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>2</td>
<td>0.0020%</td>
<td>805</td>
<td>2</td>
<td>0.2484%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>2</td>
<td>0.0020%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">creative_content</td>
<td rowspan="12">50,000</td>
<td>MATH500</td>
<td>14</td>
<td>0.0280%</td>
<td>500</td>
<td>4</td>
<td>0.8000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>1</td>
<td>0.0020%</td>
<td>164</td>
<td>2</td>
<td>1.22%</td>
</tr>
<tr>
<td>MBPP</td>
<td>4</td>
<td>0.0080%</td>
<td>974</td>
<td>4</td>
<td>0.4107%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>20</td>
<td>0.0400%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">fermi</td>
<td rowspan="12">25,000</td>
<td>MATH500</td>
<td>10</td>
<td>0.0400%</td>
<td>500</td>
<td>2</td>
<td>0.4000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>1</td>
<td>0.0040%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>5</td>
<td>0.0200%</td>
<td>974</td>
<td>2</td>
<td>0.2053%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>0</td>
<td>0.0000%</td>
<td>805</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">follow_up</td>
<td rowspan="12">279,935</td>
<td>MATH500</td>
<td>51</td>
<td>0.0182%</td>
<td>500</td>
<td>13</td>
<td>2.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>2</td>
<td>0.0007%</td>
<td>515</td>
<td>2</td>
<td>0.3883%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>4</td>
<td>0.0014%</td>
<td>164</td>
<td>3</td>
<td>1.83%</td>
</tr>
<tr>
<td>MBPP</td>
<td>27</td>
<td>0.0096%</td>
<td>974</td>
<td>20</td>
<td>2.05%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>21</td>
<td>0.0075%</td>
<td>805</td>
<td>4</td>
<td>0.4969%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>2</td>
<td>0.0007%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">fs_cot_flow</td>
<td rowspan="12">25,000</td>
<td>MATH500</td>
<td>66</td>
<td>0.2640%</td>
<td>500</td>
<td>17</td>
<td>3.40%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>1</td>
<td>0.0040%</td>
<td>40</td>
<td>1</td>
<td>2.50%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>2</td>
<td>0.0080%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>1</td>
<td>0.0040%</td>
<td>164</td>
<td>1</td>
<td>0.6098%</td>
</tr>
<tr>
<td>MBPP</td>
<td>14</td>
<td>0.0560%</td>
<td>974</td>
<td>8</td>
<td>0.8214%</td>
</tr>
<tr>
<td>IFEval</td>
<td>5</td>
<td>0.0200%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>25</td>
<td>0.1000%</td>
<td>805</td>
<td>7</td>
<td>0.8696%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>4</td>
<td>0.0160%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="12">mcq</td>
<td rowspan="12">99,986</td>
<td>MATH500</td>
<td>82</td>
<td>0.0820%</td>
<td>500</td>
<td>18</td>
<td>3.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>1</td>
<td>0.0010%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>17</td>
<td>0.0170%</td>
<td>974</td>
<td>6</td>
<td>0.6160%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>7</td>
<td>0.0070%</td>
<td>805</td>
<td>5</td>
<td>0.6211%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>7</td>
<td>0.0070%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="12">open_domain_qa</td>
<td rowspan="12">272,370</td>
<td>MATH500</td>
<td>283</td>
<td>0.1039%</td>
<td>500</td>
<td>71</td>
<td>14.20%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>3</td>
<td>0.0011%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>2</td>
<td>0.0007%</td>
<td>515</td>
<td>2</td>
<td>0.3883%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>56</td>
<td>0.0206%</td>
<td>974</td>
<td>17</td>
<td>1.75%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>26</td>
<td>0.0095%</td>
<td>805</td>
<td>4</td>
<td>0.4969%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>9</td>
<td>0.0033%</td>
<td>750</td>
<td>2</td>
<td>0.2667%</td>
</tr>
<tr>
<td rowspan="12">rag</td>
<td rowspan="12">50,000</td>
<td>MATH500</td>
<td>61</td>
<td>0.1220%</td>
<td>500</td>
<td>13</td>
<td>2.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>1</td>
<td>0.0020%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>2</td>
<td>0.0040%</td>
<td>164</td>
<td>3</td>
<td>1.83%</td>
</tr>
<tr>
<td>MBPP</td>
<td>40</td>
<td>0.0800%</td>
<td>974</td>
<td>31</td>
<td>3.18%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>25</td>
<td>0.0500%</td>
<td>805</td>
<td>7</td>
<td>0.8696%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>4</td>
<td>0.0080%</td>
<td>750</td>
<td>4</td>
<td>0.5333%</td>
</tr>
<tr>
<td rowspan="12">rc</td>
<td rowspan="12">50,000</td>
<td>MATH500</td>
<td>59</td>
<td>0.1180%</td>
<td>500</td>
<td>15</td>
<td>3.00%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>1</td>
<td>0.0020%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>16</td>
<td>0.0320%</td>
<td>974</td>
<td>6</td>
<td>0.6160%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>12</td>
<td>0.0240%</td>
<td>805</td>
<td>3</td>
<td>0.3727%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>2</td>
<td>0.0040%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">struct2text_flow</td>
<td rowspan="12">50,000</td>
<td>MATH500</td>
<td>29</td>
<td>0.0580%</td>
<td>500</td>
<td>4</td>
<td>0.8000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>1</td>
<td>0.0020%</td>
<td>974</td>
<td>2</td>
<td>0.2053%</td>
</tr>
<tr>
<td>IFEval</td>
<td>2</td>
<td>0.0040%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>13</td>
<td>0.0260%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">text_classification</td>
<td rowspan="12">50,000</td>
<td>MATH500</td>
<td>3</td>
<td>0.0060%</td>
<td>500</td>
<td>2</td>
<td>0.4000%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>0</td>
<td>0.0000%</td>
<td>974</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>IFEval</td>
<td>0</td>
<td>0.0000%</td>
<td>541</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>3</td>
<td>0.0060%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>0</td>
<td>0.0000%</td>
<td>750</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td rowspan="12">text_extraction</td>
<td rowspan="12">50,000</td>
<td>MATH500</td>
<td>68</td>
<td>0.1360%</td>
<td>500</td>
<td>23</td>
<td>4.60%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>3</td>
<td>0.0060%</td>
<td>164</td>
<td>4</td>
<td>2.44%</td>
</tr>
<tr>
<td>MBPP</td>
<td>34</td>
<td>0.0680%</td>
<td>974</td>
<td>24</td>
<td>2.46%</td>
</tr>
<tr>
<td>IFEval</td>
<td>2</td>
<td>0.0040%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>4</td>
<td>0.0080%</td>
<td>805</td>
<td>3</td>
<td>0.3727%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>2</td>
<td>0.0040%</td>
<td>750</td>
<td>1</td>
<td>0.1333%</td>
</tr>
<tr>
<td rowspan="12">text_modification</td>
<td rowspan="12">50,000</td>
<td>MATH500</td>
<td>54</td>
<td>0.1080%</td>
<td>500</td>
<td>14</td>
<td>2.80%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AMC23</td>
<td>0</td>
<td>0.0000%</td>
<td>40</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>1</td>
<td>0.0020%</td>
<td>515</td>
<td>1</td>
<td>0.1942%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>0</td>
<td>0.0000%</td>
<td>198</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>0</td>
<td>0.0000%</td>
<td>1055</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>0</td>
<td>0.0000%</td>
<td>164</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>MBPP</td>
<td>8</td>
<td>0.0160%</td>
<td>974</td>
<td>7</td>
<td>0.7187%</td>
</tr>
<tr>
<td>IFEval</td>
<td>1</td>
<td>0.0020%</td>
<td>541</td>
<td>1</td>
<td>0.1848%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>12</td>
<td>0.0240%</td>
<td>805</td>
<td>1</td>
<td>0.1242%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>6</td>
<td>0.0120%</td>
<td>750</td>
<td>4</td>
<td>0.5333%</td>
</tr>
</tbody>
</table>
### Dataset summary
<table>
<thead>
<tr><th>Metric</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>Total documents in dataset</td><td>1,227,291</td></tr>
<tr><td>Contaminated documents (removed)</td><td>1,363</td></tr>
<tr><td>Documents after decontamination</td><td>1,225,928</td></tr>
<tr><td>Contamination rate (dataset)</td><td>0.1111%</td></tr>
</tbody>
</table>
---
### Dataset Card
This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework.
AgentInstruct is an extensible agentic framework for synthetic data generation.
This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative writing, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base LLM.
Using the full AgentInstruct data (a superset of this dataset with ~25M instruction pairs), we post-trained Mistral-7b. When comparing the resulting model Orca-3-Mistral to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval.
Please refer to AgentInstruct: Toward Generative Teaching with Agentic Flows for further details.
### Dataset Sources
**Repository:** microsoft/orca-agentinstruct-1M-v1
**Paper:** AgentInstruct: Toward Generative Teaching with Agentic Flows
### Direct Use
This is a general dataset created by the AgentInstruct methodology. It contains diverse synthetic data. This could be used for research on instruction tuning and synthetic data use and generaiton.
### Out-of-Scope Use
This dataset is being shared for research purposes. For training models to perform real-world tasks, we recommend further testing and validation where needed.
This dataset is not intended for use in educational systems or organizations, or for use in health systems.
### Dataset Structure
### Data Instances
A typical data entry in the dataset consists of a conversation i.e. sequence of question and its corresponding answer. Below is an example from the dataset:
```json
{
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "Argue the importance of having a strong understanding of physics and mechanics in the design of electron beam systems for semiconductor equipment."
},
{
"role": "assistant",
"content": "Having a strong understanding of physics and mechanics is crucial in the design of electron beam systems for semiconductor equipment. This is because electron beam systems rely heavily on the principles of physics and mechanics to function effectively and efficiently.\n\nFirstly, a strong understanding of physics is necessary to design the electron gun, which is the heart of the electron beam system. The electron gun generates and accelerates the electrons that are used to create the beam. The design of the electron gun must take into account the properties of electrons, such as their charge, mass, and velocity, as well as the principles of electromagnetism and quantum mechanics. Without a thorough understanding of these principles, it would be impossible to design an electron gun that can generate and accelerate electrons with the required energy and precision.\n\nSecondly, a strong understanding of mechanics is essential in the design of the electron beam system's optics. The optics are responsible for shaping and focusing the electron beam, which is critical for achieving high-resolution imaging and precise material processing. The design of the optics must take into account the principles of optics, such as refraction, reflection, and diffraction, as well as the mechanical properties of the components, such as their stiffness, thermal expansion, and vibration resistance. Without a thorough understanding of these principles, it would be impossible to design optics that can shape and focus the electron beam with the required precision and stability.\n\nIn conclusion, a strong understanding of physics and mechanics is essential in the design of electron beam systems for semiconductor equipment. Without this understanding, it would be impossible to design electron guns and optics that can generate and shape the electron beam with the required precision and stability. Therefore, designers of electron beam systems must have a solid foundation in physics and mechanics to ensure that their designs are effective and efficient."
},
{
"role": "user",
"content": "Could you perhaps rephrase your entire explanation using only haiku poetry? It would certainly give your answer a more artistic touch, don't you think?"
},
{
"role": "assistant",
"content": "Electron streams flow,\nPhysics guides their path and speed,\nSemiconduct's soul.\n\nElectron gun's heart,\nQuantum dance, charged particles,\nPrecision's fine art.\n\nOptics shape the beam,\nMechanics' firm hand guides light,\nFocused like a dream.\n\nIn the silicon dance,\nPhysics, mechanics embrace,\nAdvance our expanse.\n\nDesign's core must know,\nNature's laws to harness force,\nFor tech to grow, glow."
},
{
"role": "user",
"content": "Just to make sure I've got it right, your haikus suggest that the electron gun and optics are like artists, requiring precision and a deep understanding of natural laws to create their masterpiece?"
},
{
"role": "assistant",
"content": "Indeed, that's a poetic interpretation. The haikus suggest that the electron gun (the creator of the electron beam) and the optics (which shape and focus the beam) require a precise understanding of physics and mechanics, much like artists need a deep understanding of their tools and materials to create a masterpiece. The natural laws are the guidelines within which they must work to achieve the desired outcome in semiconductor equipment design."
},
{
"role": "user",
"content": "I'd like to understand more about the 'precision's fine art' you mentioned. Can you explain how precision is achieved in the design of the electron gun and optics?"
},
{
"role": "assistant",
"content": "Certainly! Achieving precision in the design of the electron gun and optics in electron beam systems involves several key factors:\n\n1. **Electron Gun Design:**\n - **Vacuum Environment:** The electron gun operates in a high vacuum to prevent electron scattering by air molecules, ensuring a precise and concentrated beam.\n - **Cathode Material:** The choice of cathode material affects the emission efficiency and stability of the electron beam. Materials like tungsten or lanthanum hexaboride are often used for their high melting points and low work functions.\n - **Anode Configuration:** The anode accelerates the electrons and its design, including the shape and voltage applied, determines the beam's energy and focus.\n - **Electromagnetic Lenses:** These lenses control the focus of the electron beam, and their design must account for spherical and chromatic aberrations to maintain precision.\n\n2. **Optics Design:**\n - **Electromagnetic Lenses:** Just like in the electron gun, these lenses in the optical column are used to focus and deflect the electron beam with high precision. The design and alignment of these lenses are critical for achieving the desired spot size and shape on the target.\n - **Stigmators:** These are used to correct for astigmatism in the electron beam, ensuring that the beam remains symmetrical and focused over a range of operating conditions.\n - **Apertures:** Placed at strategic points in the optical column, apertures help to shape the beam and eliminate electrons that stray from the desired path, improving resolution and reducing aberrations.\n - **Thermal Stability:** Precision in electron beam systems also depends on maintaining thermal stability to prevent drift or deformation of the optical components.\n\n3. **Control Systems:**\n - **Feedback Loops:** Precision is maintained through feedback systems that monitor the beam's characteristics and make real-time adjustments to the electron gun and optics to correct any deviations.\n - **Vibration Isolation:** The entire system must be isolated from external vibrations that could disturb the beam's path, often using pneumatic or magnetic isolation systems.\n\n4. **Material Science:**\n - **Material Properties:** The materials used in the construction of the electron gun and optics must be chosen for their stability under the operating conditions, including resistance to thermal expansion and electromagnetic interference.\n\n5. **Computational Modeling:**\n - **Simulation Software:** Advanced software is used to model the electron beam's behavior and interactions with the system's components, allowing for the optimization of the design before physical construction.\n\nIn summary, 'precision's fine art' in the context of electron beam systems refers to the meticulous design and engineering of the electron gun and optics, which are achieved through a deep understanding of physics and materials science, as well as the implementation of sophisticated control and feedback mechanisms."
}
],
}
```
### Data Fields
The dataset comprises the following fields:
messages: String representing a conversation in OpenAI format.
```json
[
{"role": "system", "content": "...."},
{"role": "user", "content": "...."},
{"role": "assistant", "content": "...."},
{"role": "user", "content": "...."},
{"role": "assistant", "content": "...."},
....
]
```
To read the conversation use `json.loads()`
### Data Splits
train
### Dataset Creation
### Source Data
Please refer to AgentInstruct: Toward Generative Teaching with Agentic Flows for further detail
### Data Collection and Processing
Please refer to AgentInstruct: Toward Generative Teaching with Agentic Flows for further details for details about the dataset construction.
### Who are the source data producers?
Microsoft
### Annotation process
We generate questions and answers using using Azure GPT-4.
### Personal and Sensitive Information
None
### Bias, Risks, and Limitations
• This dataset is in English.
• The dataset inherits the biases, errors, and omissions known to exist in data used for seed sources and models used for data generaiton.
• This dataset is not intended to represent any specific domain, and contains generic data. However, the AgentInstruct methodology, which was used to create this dataset, can be used to generate high-quality domain specific data, which can be used to fine-tune any existing model for a specific domain.
• The dataset is synthetically gnerated and hence may contain inaccuracies that do not accurately reflect real-world phenomena.
• The synthetic nature of this dataset may limit its ability to generalize to real-world cases.
• The data is intended for research and exoerumentation for model training and synthetic data generation.
### Citation
If you find this work useful in your method, you can cite the paper as below:
@misc{
title={ AgentInstruct: Toward Generative Teaching with Agentic Flows},
author={Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah},
year={2024},
eprint={ 2407.03502},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Dataset Card Contact
orca-team@microsoft.com
## 数据集基本信息
- 语言:英语(English)
- 许可证:CDLA许可2.0(cdla-permissive-2.0)
- 样本规模:100万 < 样本数量 < 1000万
- 任务类别:问答(question-answering)
### 数据集详情
#### 数据特征
数据包含`messages`字段,为对话列表,每个对话项包含两个子字段:
- `content`:字符串类型,代表对话内容
- `role`:字符串类型,代表对话角色
#### 数据划分
各数据划分的详细信息如下:
| 划分名称 | 字节大小 | 样本数量 |
|------------------------|----------------|----------|
| analytical_reasoning | 98028645 | 24983 |
| brain_teaser | 119629503 | 49962 |
| code_ | 589940236 | 99931 |
| creative_content | 280573464 | 49961 |
| fermi | 75519245 | 24984 |
| follow_up | 849785042 | 98947 |
| fs_cot_flow | 103809090 | 24883 |
| mcq | 220938985 | 99873 |
| open_domain_qa | 588970010 | 271991 |
| rag | 409622336 | 49867 |
| rc | 274752766 | 49910 |
| struct2text_flow | 240292423 | 49955 |
| text_classification | 145309572 | 49994 |
| text_extraction | 297532567 | 49887 |
| text_modification | 334112924 | 49919 |
下载总大小:4628816808字节,数据集总大小:4628816808字节
### 配置信息
默认配置(default)下的数据文件路径如下:
- analytical_reasoning: data/analytical_reasoning-*
- brain_teaser: data/brain_teaser-*
- code_: data/code_-*
- creative_content: data/creative_content-*
- fermi: data/fermi-*
- follow_up: data/follow_up-*
- fs_cot_flow: data/fs_cot_flow-*
- mcq: data/mcq-*
- open_domain_qa: data/open_domain_qa-*
- rag: data/rag-*
- rc: data/rc-*
- struct2text_flow: data/struct2text_flow-*
- text_classification: data/text_classification-*
- text_extraction: data/text_extraction-*
- text_modification: data/text_modification-*
## 去污染说明
本数据集源自微软(Microsoft)的`microsoft/orca-agentinstruct-1M-v1`,为其去污染版本。
### 所用基准数据集
1. **MATH500**:`HuggingFaceH4/MATH-500`(子集=default,划分=test)
2. **2024年美国数学邀请赛(AIME24)**:`HuggingFaceH4/aime_2024`(子集=default,划分=train)
3. **2025年美国数学邀请赛(AIME25)**:`math-ai/aime25`(子集=default,划分=test)
4. **2023年美国数学竞赛(AMC23)**:`math-ai/amc23`(子集=default,划分=test)
5. **JEEBench**:`daman1209arora/jeebench`(子集=default,划分=test)
6. **GPQADiamond**:`Idavidrein/gpqa`(子集=gpqa_diamond,划分=train)
7. **LiveCodeBench**:`ali-elganzory/livecodebench-code_generation_lite`(子集=release_v6,划分=test)
8. **HumanEval**:OpenAI(openai)的`openai_humaneval`(子集=openai_humaneval,划分=test)
9. **MBPP**:谷歌研究数据集(google-research-datasets)的`mbpp`(子集=full,划分=train+test+validation+prompt)
10. **IFEval**:Google(google)的`IFEval`(子集=default,划分=train)
11. **AlpacaEval**:`tatsu-lab/alpaca_eval`(子集=alpaca_eval,划分=eval)
12. **Arena-Hard-v2.0**:`lmarena-ai/arena-hard-auto`(子集=default,划分=train)
### 去污染参数设置
| 参数 | 数值 |
|--------------------|--------|
| N元组大小 | 8 |
| 匹配阈值 | 0.5 |
### 划分与基准污染详情
各数据划分与基准数据集的污染统计详情如下:
| 子集 | 划分名称 | 数据集中样本数 | 基准数据集 | 数据集中污染样本数 | 数据集污染率 | 基准数据集样本数 | 基准数据集污染样本数 | 基准数据集污染率 |
|--------|------------------------|----------------|------------------|--------------------|--------------|------------------|----------------------|------------------|
| default| analytical_reasoning | 25000 | MATH500 | 16 | 0.0640% | 500 | 3 | 0.6000% |
| default| analytical_reasoning | 25000 | AIME24 | 0 | 0.0000% | 30 | 0 | 0.0000% |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
## 数据集摘要
| 指标 | 数值 |
|--------------------------|--------------------|
| 数据集总样本数 | 1,227,291 |
| 已移除污染样本数 | 1,363 |
| 去污染后剩余样本数 | 1,225,928 |
| 数据集整体污染率 | 0.1111% |
## 数据集卡片
本数据集为完全合成的指令对集合,其提示与回复均基于AgentInstruct框架合成生成。AgentInstruct是一款可扩展的智能体框架,用于合成数据生成。
本数据集包含约100万条由AgentInstruct生成的指令对,仅以互联网上公开的原始文本内容作为生成种子。该数据集覆盖多种任务能力,包括文本编辑、创意写作、代码生成、阅读理解等,可用于任意基础大语言模型(Large Language Model, LLM)的指令微调。
我们基于完整的AgentInstruct数据集(为本数据集的超集,包含约2500万条指令对)对Mistral-7b进行了后训练。将得到的模型Orca-3-Mistral与使用相同基础模型的Mistral-7b-Instruct对比后,我们观察到其在众多基准测试中均取得显著性能提升:例如在AGIEval上提升40%,在MMLU上提升19%,在GSM8K上提升54%,在BBH上提升38%,在AlpacaEval上提升45%。更多细节请参考论文《AgentInstruct: Toward Generative Teaching with Agentic Flows》。
### 数据集来源
- **仓库地址**:`microsoft/orca-agentinstruct-1M-v1`
- **关联论文**:《AgentInstruct: Toward Generative Teaching with Agentic Flows》
### 直接用途
本数据集为基于AgentInstruct方法构建的通用数据集,包含多样化的合成数据,可用于指令微调、合成数据应用与生成相关的研究。
### 禁止使用场景
本数据集仅用于学术研究分享。若用于训练可落地的现实世界任务模型,建议根据需求开展额外的测试与验证工作。本数据集不得用于教育机构/系统或医疗系统相关场景。
### 数据集结构
#### 数据实例
数据集中的典型条目为一段对话,即一系列问题与对应回复的序列。示例如下:
json
{
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "Argue the importance of having a strong understanding of physics and mechanics in the design of electron beam systems for semiconductor equipment."
},
{
"role": "assistant",
"content": "Having a strong understanding of physics and mechanics is crucial in the design of electron beam systems for semiconductor equipment. This is because electron beam systems rely heavily on the principles of physics and mechanics to function effectively and efficiently.
Firstly, a strong understanding of physics is necessary to design the electron gun, which is the heart of the electron beam system. The electron gun generates and accelerates the electrons that are used to create the beam. The design of the electron gun must take into account the properties of electrons, such as their charge, mass, and velocity, as well as the principles of electromagnetism and quantum mechanics. Without a thorough understanding of these principles, it would be impossible to design an electron gun that can generate and accelerate electrons with the required energy and precision.
Secondly, a strong understanding of mechanics is essential in the design of the electron beam system's optics. The optics are responsible for shaping and focusing the electron beam, which is critical for achieving high-resolution imaging and precise material processing. The design of the optics must take into account the principles of optics, such as refraction, reflection, and diffraction, as well as the mechanical properties of the components, such as their stiffness, thermal expansion, and vibration resistance. Without a thorough understanding of these principles, it would be impossible to design optics that can shape and focus the electron beam with the required precision and stability.
In conclusion, a strong understanding of physics and mechanics is essential in the design of electron beam systems for semiconductor equipment. Without this understanding, it would be impossible to design electron guns and optics that can generate and shape the electron beam with the required precision and stability. Therefore, designers of electron beam systems must have a solid foundation in physics and mechanics to ensure that their designs are effective and efficient."
}
]
}
#### 数据字段
数据集包含以下字段:
- `messages`:以OpenAI格式表示的对话字符串,格式示例如下:
json
[
{"role": "system", "content": "...."},
{"role": "user", "content": "...."},
{"role": "assistant", "content": "...."},
{"role": "user", "content": "...."},
{"role": "assistant", "content": "...."},
....
]
可通过`json.loads()`解析该对话字段。
#### 数据划分
仅包含训练集(train)。
### 数据集创建
#### 源数据
源数据详情请参考《AgentInstruct: Toward Generative Teaching with Agentic Flows》。
#### 数据收集与处理
详情请参考《AgentInstruct: Toward Generative Teaching with Agentic Flows》。
#### 源数据生产者
微软(Microsoft)
#### 标注流程
我们通过Azure GPT-4生成问题与回复。
#### 个人与敏感信息
无个人或敏感信息。
### 偏见、风险与限制
1. 本数据集仅包含英文内容。
2. 本数据集继承了生成种子数据与生成模型本身已有的偏见、错误与遗漏。
3. 本数据集并非针对特定领域构建,仅包含通用数据。但本数据集所使用的AgentInstruct方法可用于生成高质量的特定领域合成数据,用于对现有模型进行特定领域的微调。
4. 本数据集为合成生成,因此可能包含与现实世界现象不符的不准确之处。
5. 本数据集的合成属性可能限制其在现实世界场景中的泛化能力。
6. 本数据集仅用于模型训练与合成数据生成相关的研究与实验。
### 引用
如果您的研究用到本数据集,请引用如下论文:
bibtex
@misc{
title={ AgentInstruct: Toward Generative Teaching with Agentic Flows},
author={Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah},
year={2024},
eprint={ 2407.03502},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
### 数据集卡片联系人
orca-team@microsoft.com
提供机构:
openeurollm



