five

locailabs/nemotron_terminal_filtered

收藏
Hugging Face2026-04-13 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/nemotron_terminal_filtered
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - question-answering language: - en tags: - code - terminal - uncertainty-sampling size_categories: - 10K<n<100K --- # Nemotron Terminal Filtered An uncertainty-curated subset of NVIDIA's [Nemotron-Terminal-Corpus](https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus) (`dataset_adapters` split), selected for high-formation density for post-training [NVIDIA-Nemotron-3-Super-120B-A12B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16). ## Motivation The full `dataset_adapters` split contains ~226k terminal execution trajectories. To curate a compact, high-value subset for post-training we score each sample by how hard the model finds it, using entropy as a proxy for uncertainty. The resulting **30,000 samples** represent the tasks where the model is most uncertain, and therefore stands to learn the most. All original columns from the NVIDIA dataset are preserved, with `conversations` renamed to `messages` for OpenAI chat format compatibility. ## Method 1. For each sample, we extract the system message and first user message as a prompt. 2. The model generates 32 tokens at temperature 0 (greedy decoding, reasoning enabled) and we collect the top-20 logprobs per token. The 32-token window captures the model's initial reasoning about the task. 3. **Entropy** is computed per sample: the mean Shannon entropy of the renormalised top-k distribution across the 32-token window. High entropy means the model spreads probability across many alternatives — it is genuinely uncertain about what to produce. 4. Samples are ranked by entropy and the top 30,000 are selected. ## Columns | Column | Description | |---|---| | `messages` | Multi-turn chat messages (renamed from `conversations`) | | `agent` | Agent identifier | | `model` | Model used for trajectory generation | | `model_provider` | Provider of the model | | `date` | Trajectory generation date | | `task` | Task description | | `episode` | Episode identifier | | `run_id` | Run identifier | | `trial_name` | Trial name | | `enable_thinking` | Whether thinking/reasoning was enabled during trajectory generation | | `source` | Source dataset the trajectory was adapted from (null for some subsets) | ## Usage ```python from datasets import load_dataset ds = load_dataset("locailabs/nemotron_terminal_filtered", split="train") ``` ## Source This dataset is derived from: > **Terminal-Corpus: Large-Scale SFT Dataset for Terminal Agents** > NVIDIA — [nvidia/Nemotron-Terminal-Corpus](https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus) Uncertainty scoring was performed against: > [nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) ```bibtex @misc{pi2026dataengineeringscalingllm, title={On Data Engineering for Scaling LLM Terminal Capabilities}, author={Renjie Pi and Grace Lam and Mohammad Shoeybi and Pooya Jannaty and Bryan Catanzaro and Wei Ping}, year={2026}, eprint={2602.21193}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.21193}, } ```
提供机构:
locailabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作