hyunseoki/openthoughts3-dedup-index
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/hyunseoki/openthoughts3-dedup-index
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
language:
- en
tags:
- reasoning
- math
- code
- science
- deduplication
- openthoughts
size_categories:
- 10K<n<100K
source_datasets:
- open-thoughts/OpenThoughts3-1.2M
pretty_name: OpenThoughts3 Dedup Index
---
# OpenThoughts3 Dedup Index
A deduplicated index over
[`open-thoughts/OpenThoughts3-1.2M`](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M).
The upstream dataset contains ~18× duplicate problem statements (the same
question paired with many solver trajectories). This index keeps exactly one
canonical record per unique problem, making uniform random sampling of
distinct questions trivial.
## Summary
- **rows_scanned**: 1200000
- **unique_questions**: 65047
- **unique_with_gt_answer**: 45622
- **duplicate_ratio**: 18.45
- **domain_total_rows**:
- `code`: 250000
- `math`: 850000
- `science`: 100000
- **domain_unique_questions**:
- `code`: 5693
- `math`: 53105
- `science`: 6249
- **top_sources_by_unique**:
- `ai2-adapt-dev/openmath-2-math`: 53105
- `nvidia/OpenCodeReasoning`: 2007
- `organic-chemistry-questions`: 3743
- `stackexchange-physics`: 2506
- `stackexchange_codegolf`: 3686
## Schema
Each row of `openthoughts3_dedup.jsonl` has the following fields:
| Field | Type | Description |
|---|---|---|
| `hash` | str | md5 of normalized (whitespace-collapsed, lowercased) problem text |
| `problem` | str | The problem statement (the `human` turn of the upstream `conversations`) |
| `gt_answer` | str or null | `\boxed{...}` answer extracted from any matching upstream solver response (may be null for code-style problems without a boxed target) |
| `domain` | str | Upstream `domain` field: one of `math`, `code`, `science` |
| `source` | str | Upstream `source` field (e.g. `ai2-adapt-dev/openmath-2-math`, `stackexchange-physics`, `nvidia/OpenCodeReasoning`) |
| `difficulty` | str or null | Upstream `difficulty` value if present |
| `duplicate_count` | int | How many times this question appeared across the 1.2M source rows |
| `first_row_index` | int | Index within the upstream dataset of the first occurrence (for traceability) |
## Build
Produced by `scripts/build_openthoughts_dedup_index.py` in the
`memory_reasoning_split` research repo. The script streams the full
1.2M rows of the upstream dataset, MD5-hashes the normalized problem
text, keeps the first-seen record per hash, updates the cached
`gt_answer` if any later duplicate contained a boxed answer, and writes
one jsonl row per unique question plus a summary JSON.
## Intended use
Use this as the sampling pool when building self-distillation or
teacher-forcing reasoning datasets over OpenThoughts3 — uniform random
sampling on the raw 1.2M file is dominated by intra-cluster duplicates,
especially for the `code` split (44× duplicate ratio).
## License / Attribution
This index only stores problem statements and metadata derived from
OpenThoughts3. Please follow the upstream
[`open-thoughts/OpenThoughts3-1.2M`](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M)
license terms.
提供机构:
hyunseoki



