AmanPriyanshu/reasoning-sft-Nemotron-Cascade-SFT-SWE-210K
收藏Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/reasoning-sft-Nemotron-Cascade-SFT-SWE-210K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- reasoning
- sft
- chain-of-thought
- code
- swe-bench
- software-engineering
pretty_name: reasoning-sft-Nemotron-Cascade-SFT-SWE-210K
size_categories:
- 100K<n<1M
---
# reasoning-sft-Nemotron-Cascade-SFT-SWE-210K
Converted version of [nvidia/Nemotron-Cascade-SFT-SWE](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-SFT-SWE), filtered to `thinking=True` rows with exactly one valid `<think>...</think>` block.
## Format
Each row has three columns:
- `input` — list of dicts (conversation turns with `role` and `content`, system messages dropped, last assistant message removed)
- `response` — assistant response string including `<think>` reasoning block
- `domain` — `{category}_{source}` (e.g. `SWE Repair_SWE-Fixer-Train`)
## Stats
- Original rows: ~214K
- Converted (thinking=True, clean think tags): ~210K
## Usage
```
import random
import pyarrow.parquet as pq
from huggingface_hub import hf_hub_download
fpath = hf_hub_download(
repo_id="AmanPriyanshu/reasoning-sft-Nemotron-Cascade-SFT-SWE-210K",
repo_type="dataset",
filename="data_converted.parquet",
local_dir="./tmp_swe_peek"
)
pf = pq.ParquetFile(fpath)
rows = {"input": [], "response": [], "domain": []}
for batch in pf.iter_batches(batch_size=65_536):
d = batch.to_pydict()
rows["input"].extend(d["input"])
rows["response"].extend(d["response"])
rows["domain"].extend(d["domain"])
total = len(rows["input"])
for idx in random.sample(range(total), 3):
print(f"\n{'='*80}\nRow {idx:,} / {total:,} | domain: {rows['domain'][idx]}\n{'='*80}")
for msg in rows["input"][idx]:
print(f"\n [{msg['role']}]\n {msg['content'][:300]}{'...' if len(msg['content']) > 300 else ''}")
print(f"\n[response]\n{rows['response'][idx][:600]}{'...' if len(rows['response'][idx]) > 600 else ''}")
```
## License
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
## Credits
Original dataset: [nvidia/Nemotron-Cascade-SFT-SWE](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-SFT-SWE) by NVIDIA
Responses generated by DeepSeek-R1-0528
提供机构:
AmanPriyanshu



