jeremycochoy/contrastive-training-base-bundles
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/jeremycochoy/contrastive-training-base-bundles
下载链接
链接失效反馈官方服务:
资源简介:
# base_mixed_v1
Pre-shuffled, pre-mixed training bundle for the Base tier (201.3M params)
of the contrastive forecasting model family.
- **Total rows:** 106,100,279
- **Number of shards:** 10,624 (9,985 in base_mixed_v1/, 639 in base_mixed_v1/overflow/)
- **Bundle size:** ~397 GB
## Per-source row counts
- 0 gift: 99,333,452 (93.6%)
- 1 wiki_hourly: 3,715,121 (3.5%)
- 2 wiki_daily: 1,990,244 (1.9%)
- 3 wiki_stl_residual: 530,731 (0.5%)
- 4 wiki_stl_seasonal: 371,512 (0.4%)
- 5 wiki_stl_trend: 159,219 (0.2%)
- 6 synthetic: 0 (0%)
Note: synthetic is absent from this build (empty stage 1 dir). The mix
is 93.6% GIFT / 6.4% Wikimedia, all real data.
## Schema
| Column | Type | Notes |
|---|---|---|
| series | list<float32>[1025] | Fixed-length window |
| source_id | uint8 | 0=gift, 1..5=wiki sub-sources, 6=synthetic |
| meta | string | Source-specific metadata |
## Layout
Shards are split across two directories due to HF 10K-per-directory limit:
- through (9,985 files)
- through (639 files)
Both directories contain the same schema. Glob both when loading:
## Shuffling
Globally shuffled via 32-bucket two-pass shuffle. Every output shard is a
statistically uniform random sample of the entire input.
Generated by rnd/scripts/training_data_prep (PRs #194-#208).
提供机构:
jeremycochoy



