hq-bench/quito-corpus
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hq-bench/quito-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- time-series-forecasting
language:
- en
tags:
- time-series
- forecasting
- application-traffic
- cloud-computing
- training-data
- single-provenance
pretty_name: "Quito: Billion-Scale Time Series Corpus"
size_categories:
- 1B<n<10B
configs:
- config_name: hour
data_files:
- split: train
path: v20260315/pretrain_hour-00001-of-00001.parquet
description: >
Hourly training corpus (1-hour granularity, Quito-Hour). 12,544 series, each with 15,356 time steps
spanning 2021-11-18 to 2023-08-19. Total: 1.0B tokens.
- config_name: min
data_files:
- split: train
path: v20260315/pretrain_min-00001-of-00001.parquet
description: >
10-minute training corpus (10-min granularity, Quito-Min). 22,522 series, each with 5,904 time steps
spanning 2023-07-10 to 2023-08-19. Total: 0.7B tokens.
---
# Quito
**Quito** is a billion-scale, single-provenance time series dataset of application-traffic workloads
collected from Alipay's production platform, spanning nine business verticals from finance and
e-commerce to infrastructure and IoT.
> 🌐 **Project Page:** [hq-bench.github.io/quito](https://hq-bench.github.io/quito/)
> 📄 **Paper:** [arXiv:2603.26017](https://arxiv.org/abs/2603.26017)
> 💻 **Code:** [github.com/alipay/quito](https://github.com/alipay/quito)
> 📊 **Benchmark Set:** [hq-bench/quitobench](https://huggingface.co/datasets/hq-bench/quitobench)
---
## Dataset Overview
| | `hour` config | `min` config |
|---|---|---|
| Granularity | 1 hour | 10 minutes |
| # Series | 12,544 | 22,522 |
| Series length | 15,356 steps | 5,904 steps |
| Date range | 2021-11-18 → 2023-08-19 | 2023-07-10 → 2023-08-19 |
| # Variates / series | 5 | 5 |
| Total tokens | 1.0 Billion | 0.7 Billion |
The two subsets are drawn from **disjoint pools** of applications (no overlap in `item_id`s).
The differing start dates reflect the production system's tiered retention policy: hourly aggregates
are archived long-term, while 10-minute telemetry is retained for a shorter rolling window.
---
## Schema
Each row represents one timestamp of one series (long/tidy format).
| Column | Type | Description |
|---|---|---|
| `item_id` | int64 | Unique series identifier |
| `date_time` | datetime64[ns] | UTC timestamp |
| `ind_1` … `ind_5` | float64 | Five anonymised traffic variates (NaN for missing) |
To reconstruct a single multivariate series: filter by `item_id` and sort by `date_time`.
---
## Quick Start
```python
from datasets import load_dataset
# Load hourly training corpus
ds_hour = load_dataset("hq-bench/quito-corpus", "hour")
df_hour = ds_hour["train"].to_pandas()
# Load 10-minute training corpus
ds_min = load_dataset("hq-bench/quito-corpus", "min")
df_min = ds_min["train"].to_pandas()
```
### Iterate over individual series
```python
for item_id, series_df in df.groupby("item_id"):
series_df = series_df.sort_values("date_time")
# series_df has columns: date_time, ind_1 … ind_5
break # remove to iterate all series
```
---
## License
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
## Citation
```bibtex
@article{xue2026quitobench,
title = {{QuitoBench}: A High-Quality Open Time Series Forecasting Benchmark},
author = {Xue, Siqiao and Zhu, Zhaoyang and Zhang, Wei and
Cai, Rongyao and Wang, Rui and
Mu, Yixiang and Zhou, Fan and Li, Jianguo and Di, Peng and Yu, Hang},
journal = {arXiv preprint arXiv:2603.26017},
year = {2026},
url = {https://arxiv.org/abs/2603.26017}
}
```
提供机构:
hq-bench



