radna0/harmony-nemotron-cpu-artifacts
收藏Hugging Face2026-01-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/radna0/harmony-nemotron-cpu-artifacts
下载链接
链接失效反馈官方服务:
资源简介:
---
language: [en]
tags: [harmony, nemotron, parquet, cpu-normalized, candidate-pools]
---
# Harmony CPU artifacts: Nemotron datasets (normalized + candidate pools)
This dataset repo is an **artifact store** produced on an EPYC CPU box. It contains:
- `normalized/` — CPU-normalized Parquet shards with a text-first Harmony format (`text`) plus `meta_*` and `quality_*` fields.
- `pools/` — candidate pool Parquet shards (subsets) for later GPU scoring (Modal NLL/PPL). **No GPU scoring has been run yet.**
- `reports/` — summary tables of counts per dataset/split/pool.
## Directory layout
- `normalized/<dataset_tag>/data/<split>/part-*.parquet`
- `normalized/<dataset_tag>/*__manifest.json`
- `normalized/<dataset_tag>/*__tools_catalog.json` (when present)
- `pools/<dataset_tag>/<pool_name>/<pool_name>__*.parquet`
- `pools/<dataset_tag>/pools_manifest.json`
Where `<dataset_tag>` is the HF dataset name with `/` replaced by `__`.
## Loading examples
```python
from datasets import load_dataset
# Load normalized shards for a split
paths = ["normalized/nvidia__Nemotron-Math-v2/data/high_part00/*.parquet"]
ds = load_dataset("parquet", data_files=paths, split="train")
# Load a candidate pool
paths = ["pools/nvidia__Nemotron-Math-v2/high_correctness/*.parquet"]
pool = load_dataset("parquet", data_files=paths, split="train")
```
See `reports/cpu_full_run_summary.md` for totals.
提供机构:
radna0



