AmanPriyanshu/reasoning-sft-extract-0
收藏Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/reasoning-sft-extract-0
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- reasoning
- sft
- chain-of-thought
- information-extraction
- structured-generation
- json-extraction
- synthetic
pretty_name: reasoning-sft-extract-0
size_categories:
- 100K<n<1M
---
# reasoning-sft-extract-0
280K rows converted from [HenriqueGodoy/extract-0](https://huggingface.co/datasets/HenriqueGodoy/extract-0) into a clean two-column SFT format with minimal `<think>` reasoning tags for structured JSON extraction training.
Multi-chunk `reference_text` arrays are resolved and joined so every row contains the full source document context.
## Format
Each row has two columns:
- **`input`** — JSON string encoding a list of dicts (conversation turns with `role` and `content`); system prompt + user turn containing both the extraction schema and document text
- **`response`** — string formatted as `<think>{brief reasoning}</think>\n{extracted JSON}`
The `<think>` block uses 100 unique minimal reasoning variations (10×10 slot fill) to avoid degenerate repetition while keeping reasoning short, consistent with the system prompt instruction to keep chain-of-thought brief.
## Schema
```
input (JSON string):
[
{"role": "system", "content": "You are a structured JSON extraction model..."},
{"role": "user", "content": "### Extraction Schema:\n{...}\n\n### Document:\n{...}"}
]
response (plain string):
<think>Scanning for matching entities.</think>
{"title": "Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models"}
```
## Stats
| Stat | Value |
|------|-------|
| Rows | 280,128 |
| Resolve coverage | 100% (0 empty docs) |
| Source documents | arXiv, PubMed, Wikipedia, FDA |
| Median total chars | ~2,558 |
| P95 tokens (est.) | ~760 |
| Fits 2048 tok ctx | 100% |
## Usage
```python
import json, random
import pyarrow.parquet as pq
from huggingface_hub import hf_hub_download
fpath = hf_hub_download(
repo_id="AmanPriyanshu/reasoning-sft-extract-0",
repo_type="dataset",
filename="train.parquet",
local_dir="./tmp_extract0_peek"
)
pf = pq.ParquetFile(fpath)
rows = {"input": [], "response": []}
for batch in pf.iter_batches(batch_size=65_536):
d = batch.to_pydict()
rows["input"].extend(d["input"])
rows["response"].extend(d["response"])
idx = random.randint(0, len(rows["input"]) - 1)
msgs = json.loads(rows["input"][idx])
for m in msgs:
print(f"[{m['role']}] {m['content'][:200]}...")
print(f"[response] {rows['response'][idx][:300]}...")
```
## License
Apache 2.0
## Credits
Original dataset: [HenriqueGodoy/extract-0](https://huggingface.co/datasets/HenriqueGodoy/extract-0) by Henrique Godoy — [paper](https://arxiv.org/abs/2509.22906)
提供机构:
AmanPriyanshu



