jeremycochoy/contrastive-training-base-bundles

Name: jeremycochoy/contrastive-training-base-bundles
Creator: jeremycochoy
Published: 2026-04-18 02:11:49
License: 暂无描述

Hugging Face2026-04-18 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/jeremycochoy/contrastive-training-base-bundles

下载链接

链接失效反馈

官方服务：

资源简介：

# base_mixed_v1 Pre-shuffled, pre-mixed training bundle for the Base tier (201.3M params) of the contrastive forecasting model family. - **Total rows:** 106,100,279 - **Number of shards:** 10,624 (9,985 in base_mixed_v1/, 639 in base_mixed_v1/overflow/) - **Bundle size:** ~397 GB ## Per-source row counts - 0 gift: 99,333,452 (93.6%) - 1 wiki_hourly: 3,715,121 (3.5%) - 2 wiki_daily: 1,990,244 (1.9%) - 3 wiki_stl_residual: 530,731 (0.5%) - 4 wiki_stl_seasonal: 371,512 (0.4%) - 5 wiki_stl_trend: 159,219 (0.2%) - 6 synthetic: 0 (0%) Note: synthetic is absent from this build (empty stage 1 dir). The mix is 93.6% GIFT / 6.4% Wikimedia, all real data. ## Schema | Column | Type | Notes | |---|---|---| | series | list<float32>[1025] | Fixed-length window | | source_id | uint8 | 0=gift, 1..5=wiki sub-sources, 6=synthetic | | meta | string | Source-specific metadata | ## Layout Shards are split across two directories due to HF 10K-per-directory limit: - through (9,985 files) - through (639 files) Both directories contain the same schema. Glob both when loading: ## Shuffling Globally shuffled via 32-bucket two-pass shuffle. Every output shard is a statistically uniform random sample of the entire input. Generated by rnd/scripts/training_data_prep (PRs #194-#208).

提供机构：

jeremycochoy

5,000+

优质数据集

54 个

任务类型

进入经典数据集