five

AmanPriyanshu/reasoning-sft-extract-0

收藏
Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/reasoning-sft-extract-0
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - reasoning - sft - chain-of-thought - information-extraction - structured-generation - json-extraction - synthetic pretty_name: reasoning-sft-extract-0 size_categories: - 100K<n<1M --- # reasoning-sft-extract-0 280K rows converted from [HenriqueGodoy/extract-0](https://huggingface.co/datasets/HenriqueGodoy/extract-0) into a clean two-column SFT format with minimal `<think>` reasoning tags for structured JSON extraction training. Multi-chunk `reference_text` arrays are resolved and joined so every row contains the full source document context. ## Format Each row has two columns: - **`input`** — JSON string encoding a list of dicts (conversation turns with `role` and `content`); system prompt + user turn containing both the extraction schema and document text - **`response`** — string formatted as `<think>{brief reasoning}</think>\n{extracted JSON}` The `<think>` block uses 100 unique minimal reasoning variations (10×10 slot fill) to avoid degenerate repetition while keeping reasoning short, consistent with the system prompt instruction to keep chain-of-thought brief. ## Schema ``` input (JSON string): [ {"role": "system", "content": "You are a structured JSON extraction model..."}, {"role": "user", "content": "### Extraction Schema:\n{...}\n\n### Document:\n{...}"} ] response (plain string): <think>Scanning for matching entities.</think> {"title": "Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models"} ``` ## Stats | Stat | Value | |------|-------| | Rows | 280,128 | | Resolve coverage | 100% (0 empty docs) | | Source documents | arXiv, PubMed, Wikipedia, FDA | | Median total chars | ~2,558 | | P95 tokens (est.) | ~760 | | Fits 2048 tok ctx | 100% | ## Usage ```python import json, random import pyarrow.parquet as pq from huggingface_hub import hf_hub_download fpath = hf_hub_download( repo_id="AmanPriyanshu/reasoning-sft-extract-0", repo_type="dataset", filename="train.parquet", local_dir="./tmp_extract0_peek" ) pf = pq.ParquetFile(fpath) rows = {"input": [], "response": []} for batch in pf.iter_batches(batch_size=65_536): d = batch.to_pydict() rows["input"].extend(d["input"]) rows["response"].extend(d["response"]) idx = random.randint(0, len(rows["input"]) - 1) msgs = json.loads(rows["input"][idx]) for m in msgs: print(f"[{m['role']}] {m['content'][:200]}...") print(f"[response] {rows['response'][idx][:300]}...") ``` ## License Apache 2.0 ## Credits Original dataset: [HenriqueGodoy/extract-0](https://huggingface.co/datasets/HenriqueGodoy/extract-0) by Henrique Godoy — [paper](https://arxiv.org/abs/2509.22906)
提供机构:
AmanPriyanshu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作