five

AmanPriyanshu/reasoning-sft-OpenThoughts3-1.2M-450K

收藏
Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/reasoning-sft-OpenThoughts3-1.2M-450K
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - reasoning - sft - chain-of-thought - math - code - science pretty_name: reasoning-sft-OpenThoughts3-1.2M-450K size_categories: - 100K<n<1M --- # reasoning-sft-OpenThoughts3-1.2M-450K Converted version of [open-thoughts/OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M), filtered to rows with exactly one valid `<think>...</think>` block. 750K rows were dropped due to missing or malformed think tags. ## Format Each row has three columns: - `input` — list of dicts (conversation turns with `role` and `content`; `human` → `user`, last gpt turn removed) - `response` — gpt response string including `<think>` reasoning block - `domain` — task domain (`math`, `code`, `science`) ## Domain Distribution | Domain | Count | % | |---------|---------|---| | math | 274,262 | 60.86% | | science | 88,474 | 19.63% | | code | 87,874 | 19.50% | ## Usage ``` import random import pyarrow.parquet as pq from huggingface_hub import hf_hub_download fpath = hf_hub_download( repo_id="AmanPriyanshu/reasoning-sft-OpenThoughts3-1.2M-450K", repo_type="dataset", filename="data_converted.parquet", local_dir="./tmp_openthoughts_peek" ) pf = pq.ParquetFile(fpath) rows = {"input": [], "response": [], "domain": []} for batch in pf.iter_batches(batch_size=65_536): d = batch.to_pydict() rows["input"].extend(d["input"]) rows["response"].extend(d["response"]) rows["domain"].extend(d["domain"]) total = len(rows["input"]) for idx in random.sample(range(total), 3): print(f"\n{'='*80}\nRow {idx:,} / {total:,} | domain: {rows['domain'][idx]}\n{'='*80}") for msg in rows["input"][idx]: print(f"\n [{msg['role']}]\n {msg['content'][:300]}{'...' if len(msg['content']) > 300 else ''}") print(f"\n[response]\n{rows['response'][idx][:600]}{'...' if len(rows['response'][idx]) > 600 else ''}") ``` ## License Apache 2.0 ## Credits Original dataset: [open-thoughts/OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M) Annotations generated with QwQ-32B
提供机构:
AmanPriyanshu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作