locailabs/nemotron_terminal_filtered
收藏Hugging Face2026-04-13 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/nemotron_terminal_filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- question-answering
language:
- en
tags:
- code
- terminal
- uncertainty-sampling
size_categories:
- 10K<n<100K
---
# Nemotron Terminal Filtered
An uncertainty-curated subset of NVIDIA's [Nemotron-Terminal-Corpus](https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus) (`dataset_adapters` split), selected for high-formation density for post-training [NVIDIA-Nemotron-3-Super-120B-A12B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16).
## Motivation
The full `dataset_adapters` split contains ~226k terminal execution trajectories. To curate a compact, high-value subset for post-training we score each sample by how hard the model finds it, using entropy as a proxy for uncertainty. The resulting **30,000 samples** represent the tasks where the model is most uncertain, and therefore stands to learn the most.
All original columns from the NVIDIA dataset are preserved, with `conversations` renamed to `messages` for OpenAI chat format compatibility.
## Method
1. For each sample, we extract the system message and first user message as a prompt.
2. The model generates 32 tokens at temperature 0 (greedy decoding, reasoning enabled) and we collect the top-20 logprobs per token. The 32-token window captures the model's initial reasoning about the task.
3. **Entropy** is computed per sample: the mean Shannon entropy of the renormalised top-k distribution across the 32-token window. High entropy means the model spreads probability across many alternatives — it is genuinely uncertain about what to produce.
4. Samples are ranked by entropy and the top 30,000 are selected.
## Columns
| Column | Description |
|---|---|
| `messages` | Multi-turn chat messages (renamed from `conversations`) |
| `agent` | Agent identifier |
| `model` | Model used for trajectory generation |
| `model_provider` | Provider of the model |
| `date` | Trajectory generation date |
| `task` | Task description |
| `episode` | Episode identifier |
| `run_id` | Run identifier |
| `trial_name` | Trial name |
| `enable_thinking` | Whether thinking/reasoning was enabled during trajectory generation |
| `source` | Source dataset the trajectory was adapted from (null for some subsets) |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("locailabs/nemotron_terminal_filtered", split="train")
```
## Source
This dataset is derived from:
> **Terminal-Corpus: Large-Scale SFT Dataset for Terminal Agents**
> NVIDIA — [nvidia/Nemotron-Terminal-Corpus](https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus)
Uncertainty scoring was performed against:
> [nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16)
```bibtex
@misc{pi2026dataengineeringscalingllm,
title={On Data Engineering for Scaling LLM Terminal Capabilities},
author={Renjie Pi and Grace Lam and Mohammad Shoeybi and Pooya Jannaty and Bryan Catanzaro and Wei Ping},
year={2026},
eprint={2602.21193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.21193},
}
```
提供机构:
locailabs



