five

ServiceNow/CAF_7M

收藏
Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ServiceNow/CAF_7M
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* dataset_info: features: - name: dataset_name dtype: string - name: series_idx dtype: int64 - name: target_column dtype: string - name: start_idx dtype: int64 - name: freq dtype: string - name: context dtype: string - name: past_timestamp list: string - name: future_timestamp list: string - name: difficulty dtype: string splits: - name: train num_bytes: 64149891856 num_examples: 7433239 - name: test num_bytes: 3453784 num_examples: 904 download_size: 10630403192 dataset_size: 64153345640 license: apache-2.0 task_categories: - time-series-forecasting language: - en size_categories: - 1M<n<10M citation: | @misc{zheng2026overcomingmodalitygapcontextaided, title={Overcoming the Modality Gap in Context-Aided Forecasting}, author={Vincent Zhihao Zheng and Étienne Marcotte and Arjun Ashok and Andrew Robert Williams and Lijun Sun and Alexandre Drouin and Valentina Zantedeschi}, year={2026}, eprint={2603.12451}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2603.12451}, } --- CAF-7M is a semi-synthetic dataset to train or to test Context-Aided Forecasting models. This dataset was introduced in [Overcoming the Modality Gap in Context-Aided Forecasting](https://arxiv.org/abs/2603.12451). CAF-7M was created by augmenting time series data from [autogluon/chronos_datasets](https://huggingface.co/datasets/autogluon/chronos_datasets) with informative contexts generated using a quantized version of [Llama3.3-70B-Instruct](https://huggingface.co/RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16). The dataset contains 7'433'239 training samples and 904 testing samples. The testing samples have been further filtered to only keep those where GPT 5.2 creates a more accurate stochastic forecast (as measured by the CRPS) with the generated context than without it. Due to uncertainty in licenses, we do not redistribute the original time series data with this dataset. To recreate only the test portion of this dataset with both the time series windows and the associated context, you can run the following script: ```python from datasets import load_dataset, Sequence, Value ds = load_dataset("ServiceNow/CAF_7M", split="test") # Pre-load all required source datasets source_cache = {} for name in ds.unique("dataset_name"): if name not in source_cache: source_cache[name] = load_dataset("autogluon/chronos_datasets", name, split="train") # source_cache is a global, inherited by child processes via fork (Linux) def add_targets(entry): series = source_cache[entry["dataset_name"]][entry["series_idx"]] start = entry["start_idx"] past_len = len(entry["past_timestamp"]) future_len = len(entry["future_timestamp"]) values = series[entry["target_column"]] return { "past_target": values[start : start + past_len], "future_target": values[start + past_len : start + past_len + future_len], } new_features = ds.features.copy() new_features["past_target"] = Sequence(Value("float64")) new_features["future_target"] = Sequence(Value("float64")) ds = ds.map(add_targets, num_proc=16, features=new_features) ds.save_to_disk("CAF_7M_test_with_TS") ``` Recreating the full dataset is more time-consuming (48 hours on 16 CPUs in our test). Here is the script to do so: ```python from datasets import load_dataset, Sequence, Value ds = load_dataset("ServiceNow/CAF_7M") # Pre-load all required source datasets source_cache = {} for split in ds.keys(): for name in ds[split].unique("dataset_name"): if name not in source_cache: source_cache[name] = load_dataset("autogluon/chronos_datasets", name, split="train") # source_cache is a global, inherited by child processes via fork (Linux) def add_targets(entry): series = source_cache[entry["dataset_name"]][entry["series_idx"]] start = entry["start_idx"] past_len = len(entry["past_timestamp"]) future_len = len(entry["future_timestamp"]) values = series[entry["target_column"]] return { "past_target": values[start : start + past_len], "future_target": values[start + past_len : start + past_len + future_len], } for split in ds.keys(): new_features = ds[split].features.copy() new_features["past_target"] = Sequence(Value("float64")) new_features["future_target"] = Sequence(Value("float64")) ds[split] = ds[split].map(add_targets, num_proc=16, features=new_features) ds.save_to_disk("CAF_7M_with_TS") ``` The dataset contains the following features: * `dataset_name`: The name of the dataset in [autogluon/chronos_datasets](https://huggingface.co/datasets/autogluon/chronos_datasets) used to generate the entry. * `series_idx`: For multivariate series, which dimension the window is taken from. * `target_column`: Which series in the dataset the window is taken from. * `start_idx`: The position in the original timeseries of the first timestep of the window. * `freq`: One of (`15T`, `30T`, `D`, `H`, `M`, `W-SUN`), indicating the frequency of the timeseries. * `context`: The synthetically generated context, giving useful information to a forecasting model. * `past_timestamp`: A list of timestamps for the historical portion of the window. * `future_timestamp`: A list of timestamps for the forecast portion of the window. * `difficulty`: Only set for the testing samples. An indication about whether the window is hard to forecast without context. If `HARD`, Chronos gets a MASE of more than 1.5 for this window. If `EASY`, Chronos gets a MASE of less than 1.5 for it.
提供机构:
ServiceNow
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作