five

jeremycochoy/contrastive-training-base-bundles

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/jeremycochoy/contrastive-training-base-bundles
下载链接
链接失效反馈
官方服务:
资源简介:
# base_mixed_v1 Pre-shuffled, pre-mixed training bundle for the Base tier (201.3M params) of the contrastive forecasting model family. - **Total rows:** 106,100,279 - **Number of shards:** 10,624 (9,985 in base_mixed_v1/, 639 in base_mixed_v1/overflow/) - **Bundle size:** ~397 GB ## Per-source row counts - 0 gift: 99,333,452 (93.6%) - 1 wiki_hourly: 3,715,121 (3.5%) - 2 wiki_daily: 1,990,244 (1.9%) - 3 wiki_stl_residual: 530,731 (0.5%) - 4 wiki_stl_seasonal: 371,512 (0.4%) - 5 wiki_stl_trend: 159,219 (0.2%) - 6 synthetic: 0 (0%) Note: synthetic is absent from this build (empty stage 1 dir). The mix is 93.6% GIFT / 6.4% Wikimedia, all real data. ## Schema | Column | Type | Notes | |---|---|---| | series | list<float32>[1025] | Fixed-length window | | source_id | uint8 | 0=gift, 1..5=wiki sub-sources, 6=synthetic | | meta | string | Source-specific metadata | ## Layout Shards are split across two directories due to HF 10K-per-directory limit: - through (9,985 files) - through (639 files) Both directories contain the same schema. Glob both when loading: ## Shuffling Globally shuffled via 32-bucket two-pass shuffle. Every output shard is a statistically uniform random sample of the entire input. Generated by rnd/scripts/training_data_prep (PRs #194-#208).
提供机构:
jeremycochoy
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作