chronos_datasets
收藏魔搭社区2025-12-26 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/autogluon/chronos_datasets
下载链接
链接失效反馈官方服务:
资源简介:
# Chronos datasets
Time series datasets used for training and evaluation of the [Chronos](https://github.com/amazon-science/chronos-forecasting) forecasting models.
Note that some Chronos datasets (`ETTh`, `ETTm`, `brazilian_cities_temperature` and `spanish_energy_and_weather`) that rely on a custom builder script are available in the companion repo [`autogluon/chronos_datasets_extra`](https://huggingface.co/datasets/autogluon/chronos_datasets_extra).
See the [paper](https://arxiv.org/abs/2403.07815) for more information.
## Data format and usage
The recommended way to use these datasets is via https://github.com/autogluon/fev.
All datasets satisfy the following high-level schema:
- Each dataset row corresponds to a single (univariate or multivariate) time series.
- There exists one column with name `id` and type `string` that contains the unique identifier of each time series.
- There exists one column of type `Sequence` with dtype `timestamp[ms]`. This column contains the timestamps of the observations. Timestamps are guaranteed to have a regular frequency that can be obtained with [`pandas.infer_freq`](https://pandas.pydata.org/docs/reference/api/pandas.infer_freq.html).
- There exists at least one column of type `Sequence` with numeric (`float`, `double`, or `int`) dtype. These columns can be interpreted as target time series.
- For each row, all columns of type `Sequence` have same length.
- Remaining columns of types other than `Sequence` (e.g., `string` or `float`) can be interpreted as static covariates.
Datasets can be loaded using the 🤗 [`datasets`](https://huggingface.co/docs/datasets/en/index) library
```python
import datasets
ds = datasets.load_dataset("autogluon/chronos_datasets", "m4_daily", split="train")
ds.set_format("numpy") # sequences returned as numpy arrays
```
> **NOTE:** The `train` split of all datasets contains the full time series and has no relation to the train/test split used in the Chronos paper.
Example entry in the `m4_daily` dataset
```python
>>> ds[0]
{'id': 'T000000',
'timestamp': array(['1994-03-01T12:00:00.000', '1994-03-02T12:00:00.000',
'1994-03-03T12:00:00.000', ..., '1996-12-12T12:00:00.000',
'1996-12-13T12:00:00.000', '1996-12-14T12:00:00.000'],
dtype='datetime64[ms]'),
'target': array([1017.1, 1019.3, 1017. , ..., 2071.4, 2083.8, 2080.6], dtype=float32),
'category': 'Macro'}
```
## Changelog
- **v1.3.0 (2025-03-05)**: Fix incorrect timestamp frequency for `monash_hospital`
- **v1.2.0 (2025-01-03)**: Fix incorrect timestamp frequency for `dominick`
- **v1.1.0 (2024-11-14)**: Fix irregular timestamp frequency for `m4_quarterly`
- **v1.0.0 (2024-07-24)**: Initial release
### Converting to pandas
We can easily convert data in such format to a long format data frame
```python
def to_pandas(ds: datasets.Dataset) -> "pd.DataFrame":
"""Convert dataset to long data frame format."""
sequence_columns = [col for col in ds.features if isinstance(ds.features[col], datasets.Sequence)]
return ds.to_pandas().explode(sequence_columns).infer_objects()
```
Example output
```python
>>> print(to_pandas(ds).head())
id timestamp target category
0 T000000 1994-03-01 12:00:00 1017.1 Macro
1 T000000 1994-03-02 12:00:00 1019.3 Macro
2 T000000 1994-03-03 12:00:00 1017.0 Macro
3 T000000 1994-03-04 12:00:00 1019.2 Macro
4 T000000 1994-03-05 12:00:00 1018.7 Macro
```
### Dealing with large datasets
Note that some datasets, such as subsets of WeatherBench, are extremely large (~100GB). To work with them efficiently, we recommend either loading them from disk (files will be downloaded to disk, but won't be all loaded into memory)
```python
ds = datasets.load_dataset("autogluon/chronos_datasets", "weatherbench_daily", keep_in_memory=False, split="train")
```
or, for the largest datasets like `weatherbench_hourly_temperature`, reading them in streaming format (chunks will be downloaded one at a time)
```python
ds = datasets.load_dataset("autogluon/chronos_datasets", "weatherbench_hourly_temperature", streaming=True, split="train")
```
## Chronos training corpus with TSMixup & KernelSynth
The training corpus used for training the Chronos models can be loaded via the configs `training_corpus_tsmixup_10m` (10M TSMixup augmentations of real-world data) and `training_corpus_kernel_synth_1m` (1M synthetic time series generated with KernelSynth), e.g.,
```python
ds = datasets.load_dataset("autogluon/chronos_datasets", "training_corpus_tsmixup_10m", streaming=True, split="train")
```
Note that since data in the training corpus was obtained by combining various synthetic & real-world time series, the timestamps contain dummy values that have no connection to the original data.
## License
Different datasets available in this collection are distributed under different open source licenses. Please see `ds.info.license` and `ds.info.homepage` for each individual dataset.
## Citation
If you find these datasets useful for your research, please consider citing the associated paper:
```markdown
@article{ansari2024chronos,
author = {Ansari, Abdul Fatir and Stella, Lorenzo and Turkmen, Caner and Zhang, Xiyuan and Mercado, Pedro and Shen, Huibin and Shchur, Oleksandr and Rangapuram, Syama Syndar and Pineda Arango, Sebastian and Kapoor, Shubham and Zschiegner, Jasper and Maddix, Danielle C. and Wang, Hao and Mahoney, Michael W. and Torkkola, Kari and Gordon Wilson, Andrew and Bohlke-Schneider, Michael and Wang, Yuyang},
title = {Chronos: Learning the Language of Time Series},
journal = {arXiv preprint arXiv:2403.07815},
year = {2024}
}
```
# Chronos数据集
用于训练与评估[Chronos](https://github.com/amazon-science/chronos-forecasting)时序预测模型的时序数据集集合。
请注意,部分依赖自定义构建脚本的Chronos数据集(`ETTh`、`ETTm`、`brazilian_cities_temperature`及`spanish_energy_and_weather`)可在配套仓库[`autogluon/chronos_datasets_extra`](https://huggingface.co/datasets/autogluon/chronos_datasets_extra)中获取。
更多细节请参阅相关论文[https://arxiv.org/abs/2403.07815](https://arxiv.org/abs/2403.07815)。
## 数据格式与使用方式
推荐通过https://github.com/autogluon/fev使用此类数据集。
所有数据集均遵循以下通用规范:
- 每条数据集行对应一条(单变量或多变量)时序序列。
- 存在一个名为`id`、类型为字符串(string)的列,存储每条时序序列的唯一标识符。
- 存在一个类型为`Sequence`、数据类型为`timestamp[ms]`的列,存储观测值的时间戳。所有时间戳均为规则频率,可通过[`pandas.infer_freq`](https://pandas.pydata.org/docs/reference/api/pandas.infer_freq.html)获取其频率。
- 至少存在一个类型为`Sequence`、数据类型为数值型(float、double或int)的列,此类列可作为目标时序序列。
- 对每一行而言,所有`Sequence`类型的列长度均一致。
- 其余非`Sequence`类型的列(如字符串或浮点型)可作为静态协变量。
可通过🤗[`datasets`](https://huggingface.co/docs/datasets/en/index)库加载数据集:
python
import datasets
ds = datasets.load_dataset("autogluon/chronos_datasets", "m4_daily", split="train")
ds.set_format("numpy") # sequences returned as numpy arrays
> **注意:** 所有数据集的`train`划分包含完整的时序序列,与Chronos论文中使用的训练/测试划分无关。
### `m4_daily`数据集示例条目
python
>>> ds[0]
{'id': 'T000000',
'timestamp': array(['1994-03-01T12:00:00.000', '1994-03-02T12:00:00.000',
'1994-03-03T12:00:00.000', ..., '1996-12-12T12:00:00.000',
'1996-12-13T12:00:00.000', '1996-12-14T12:00:00.000'],
dtype='datetime64[ms]'),
'target': array([1017.1, 1019.3, 1017. , ..., 2071.4, 2083.8, 2080.6], dtype=float32),
'category': 'Macro'}
## 更新日志
- **v1.3.0(2025-03-05)**:修复`monash_hospital`的时间戳频率错误
- **v1.2.0(2025-01-03)**:修复`dominick`的时间戳频率错误
- **v1.1.0(2024-11-14)**:修正`m4_quarterly`的不规则时间戳频率问题
- **v1.0.0(2024-07-24)**:首次发布
### 转换为Pandas格式
我们可以轻松将此类格式的数据转换为长格式数据框:
python
def to_pandas(ds: datasets.Dataset) -> "pd.DataFrame":
"""Convert dataset to long data frame format."""
sequence_columns = [col for col in ds.features if isinstance(ds.features[col], datasets.Sequence)]
return ds.to_pandas().explode(sequence_columns).infer_objects()
示例输出:
python
>>> print(to_pandas(ds).head())
id timestamp target category
0 T000000 1994-03-01 12:00:00 1017.1 Macro
1 T000000 1994-03-02 12:00:00 1019.3 Macro
2 T000000 1994-03-03 12:00:00 1017.0 Macro
3 T000000 1994-03-04 12:00:00 1019.2 Macro
4 T000000 1994-03-05 12:00:00 1018.7 Macro
### 处理大型数据集
请注意,部分数据集(如WeatherBench的子集)规模极大(约100GB)。为高效处理此类数据集,我们推荐两种方式:一是从磁盘加载(文件会下载至本地,但不会全部加载至内存):
python
ds = datasets.load_dataset("autogluon/chronos_datasets", "weatherbench_daily", keep_in_memory=False, split="train")
二是对于`weatherbench_hourly_temperature`这类超大型数据集,采用流式读取方式(数据将分块逐一下载):
python
ds = datasets.load_dataset("autogluon/chronos_datasets", "weatherbench_hourly_temperature", streaming=True, split="train")
## 搭载TSMixup与KernelSynth的Chronos训练语料库
用于训练Chronos模型的训练语料库可通过以下配置加载:`training_corpus_tsmixup_10m`(针对真实数据的1000万条TSMixup增强时序)与`training_corpus_kernel_synth_1m`(通过KernelSynth生成的100万条合成时序序列),示例如下:
python
ds = datasets.load_dataset("autogluon/chronos_datasets", "training_corpus_tsmixup_10m", streaming=True, split="train")
请注意,由于训练语料库的数据由多种合成与真实时序序列合并而来,其时间戳为虚拟值,与原始数据无关。
## 许可协议
本集合中的不同数据集遵循不同的开源许可协议。请查阅各数据集的`ds.info.license`与`ds.info.homepage`获取详细信息。
## 引用
若您的研究中使用了此类数据集,请引用以下相关论文:
markdown
@article{ansari2024chronos,
author = {Ansari, Abdul Fatir and Stella, Lorenzo and Turkmen, Caner and Zhang, Xiyuan and Mercado, Pedro and Shen, Huibin and Shchur, Oleksandr and Rangapuram, Syama Syndar and Pineda Arango, Sebastian and Kapoor, Shubham and Zschiegner, Jasper and Maddix, Danielle C. and Wang, Hao and Mahoney, Michael W. and Torkkola, Kari and Gordon Wilson, Andrew and Bohlke-Schneider, Michael and Wang, Yuyang},
title = {Chronos: Learning the Language of Time Series},
journal = {arXiv preprint arXiv:2403.07815},
year = {2024}
}
提供机构:
maas
创建时间:
2025-07-30



