laion/nemotron-terminal-adapters_math
收藏Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/laion/nemotron-terminal-adapters_math
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- question-answering
language:
- en
tags:
- code
- terminal
- agent
- trace
- sft
configs:
- config_name: default
data_files:
- split: train
path: data.parquet
---
# nemotron-terminal-adapters_math
Per-source partition of [nvidia/Nemotron-Terminal-Corpus](https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus),
filtered to `source == "adapters_math"`. The `difficulty` column preserves the original
`easy` / `medium` / `mixed` split (`na` for the `dataset_adapters/*` files, which
did not carry a difficulty label).
Partitioning scheme:
- **adapters_{code,math,swe}** — rows from `dataset_adapters/{code,math,swe}.parquet`
- **{skill}** (e.g. `debugging`, `security`, …) — rows from
`synthetic_tasks/skill_based/{easy,medium,mixed}/{skill}/data_filtered.parquet`
## Columns
Same as the source dataset (`conversations`, `agent`, `model`, `model_provider`,
`date`, `task`, `episode`, `run_id`, `trial_name`, `enable_thinking`) plus:
- `source` — the partition key (`"adapters_math"` throughout this repo)
- `difficulty` — `easy` / `medium` / `mixed` / `na`
- `original_source` — only present in `adapters_code`; preserves the original
`source` column value (`OpenCodeReasoning` or `synthetic`) from the upstream file.
## Citation
```bibtex
@misc{pi2026dataengineeringscalingllm,
title={On Data Engineering for Scaling LLM Terminal Capabilities},
author={Renjie Pi and Grace Lam and Mohammad Shoeybi and Pooya Jannaty and Bryan Catanzaro and Wei Ping},
year={2026},
eprint={2602.21193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.21193},
}
```
Original dataset license: CC-BY-4.0.
---
license: CC-BY-4.0
task_categories:
- 问答(question-answering)
language:
- 英语(en)
tags:
- 代码(code)
- 终端(terminal)
- AI智能体(Agent)
- 追踪(trace)
- 监督微调(SFT)
configs:
- config_name: default
data_files:
- split: 训练集(train)
path: data.parquet
---
# nemotron-terminal-adapters_math
本数据集为[nvidia/Nemotron-Terminal-Corpus](https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus)按源分区后的子集,仅筛选保留`source == "adapters_math"`的样本。其中`difficulty`字段保留了原始数据集的`easy`(简单)、`medium`(中等)、`mixed`(混合)划分标准;对于`dataset_adapters/*`系列文件,由于其未附带难度标签,故该字段值为`na`。
分区方案:
- **adapters_{code,math,swe}** — 取自`dataset_adapters/{code,math,swe}.parquet`的样本行
- **{skill}**(例如`debugging`(调试)、`security`(安全)等) — 取自`synthetic_tasks/skill_based/{easy,medium,mixed}/{skill}/data_filtered.parquet`的样本行
## 字段说明
字段与原始数据集一致,包含`conversations`、`agent`、`model`、`model_provider`、`date`、`task`、`episode`、`run_id`、`trial_name`、`enable_thinking`,额外新增字段如下:
- `source` — 分区键(本仓库中所有样本的该字段值均为`"adapters_math"`)
- `difficulty` — 难度标签,可选值为`easy` / `medium` / `mixed` / `na`
- `original_source` — 仅在`adapters_code`分区中存在,保留上游文件中原始的`source`字段值(`OpenCodeReasoning`或`synthetic`)
## 引用格式
bibtex
@misc{pi2026dataengineeringscalingllm,
title={面向扩展大语言模型(LLM)终端能力的数据工程研究},
author={Renjie Pi and Grace Lam and Mohammad Shoeybi and Pooya Jannaty and Bryan Catanzaro and Wei Ping},
year={2026},
eprint={2602.21193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.21193},
}
原始数据集许可证:CC-BY-4.0.
提供机构:
laion



