five

hyunseoki/openthoughts3-dedup-index

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/hyunseoki/openthoughts3-dedup-index
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - en tags: - reasoning - math - code - science - deduplication - openthoughts size_categories: - 10K<n<100K source_datasets: - open-thoughts/OpenThoughts3-1.2M pretty_name: OpenThoughts3 Dedup Index --- # OpenThoughts3 Dedup Index A deduplicated index over [`open-thoughts/OpenThoughts3-1.2M`](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M). The upstream dataset contains ~18× duplicate problem statements (the same question paired with many solver trajectories). This index keeps exactly one canonical record per unique problem, making uniform random sampling of distinct questions trivial. ## Summary - **rows_scanned**: 1200000 - **unique_questions**: 65047 - **unique_with_gt_answer**: 45622 - **duplicate_ratio**: 18.45 - **domain_total_rows**: - `code`: 250000 - `math`: 850000 - `science`: 100000 - **domain_unique_questions**: - `code`: 5693 - `math`: 53105 - `science`: 6249 - **top_sources_by_unique**: - `ai2-adapt-dev/openmath-2-math`: 53105 - `nvidia/OpenCodeReasoning`: 2007 - `organic-chemistry-questions`: 3743 - `stackexchange-physics`: 2506 - `stackexchange_codegolf`: 3686 ## Schema Each row of `openthoughts3_dedup.jsonl` has the following fields: | Field | Type | Description | |---|---|---| | `hash` | str | md5 of normalized (whitespace-collapsed, lowercased) problem text | | `problem` | str | The problem statement (the `human` turn of the upstream `conversations`) | | `gt_answer` | str or null | `\boxed{...}` answer extracted from any matching upstream solver response (may be null for code-style problems without a boxed target) | | `domain` | str | Upstream `domain` field: one of `math`, `code`, `science` | | `source` | str | Upstream `source` field (e.g. `ai2-adapt-dev/openmath-2-math`, `stackexchange-physics`, `nvidia/OpenCodeReasoning`) | | `difficulty` | str or null | Upstream `difficulty` value if present | | `duplicate_count` | int | How many times this question appeared across the 1.2M source rows | | `first_row_index` | int | Index within the upstream dataset of the first occurrence (for traceability) | ## Build Produced by `scripts/build_openthoughts_dedup_index.py` in the `memory_reasoning_split` research repo. The script streams the full 1.2M rows of the upstream dataset, MD5-hashes the normalized problem text, keeps the first-seen record per hash, updates the cached `gt_answer` if any later duplicate contained a boxed answer, and writes one jsonl row per unique question plus a summary JSON. ## Intended use Use this as the sampling pool when building self-distillation or teacher-forcing reasoning datasets over OpenThoughts3 — uniform random sampling on the raw 1.2M file is dominated by intra-cluster duplicates, especially for the `code` split (44× duplicate ratio). ## License / Attribution This index only stores problem statements and metadata derived from OpenThoughts3. Please follow the upstream [`open-thoughts/OpenThoughts3-1.2M`](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M) license terms.
提供机构:
hyunseoki
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作