five

rajaykumar12959/synthetic-cache-workloads

收藏
Hugging Face2025-12-07 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/rajaykumar12959/synthetic-cache-workloads
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - time-series-forecasting - reinforcement-learning - tabular-regression tags: - cache-replacement - systems-ml - memory-management - locality - lru - lfu pretty_name: Synthetic Cache Workloads size_categories: - 100K<n<1M configs: - config_name: random data_files: "random_trace.csv" - config_name: zipf data_files: "zipf_trace.csv" - config_name: web_traffic data_files: "web_traffic_trace.csv" --- # Synthetic Cache Workloads (Oracle Traces) ## 1. Dataset Overview This dataset contains synthetic memory access traces generated to facilitate research in **Machine Learning for Systems (SysML)**, specifically for **Cache Replacement Policies**. Unlike standard raw trace logs (which only contain a list of accessed addresses), this dataset is **pre-processed with Feature Engineering and Oracle Labels**. It is designed to train Supervised Learning models or Reinforcement Learning agents to predict which cache blocks are "dead" and should be evicted. ### The "Oracle" Feature The standout feature of this dataset is the `time_to_next_request` column. * **What it is:** The calculated number of time steps until a specific memory block will be accessed again. * **Why it matters:** This represents **Belady’s Optimal Algorithm (OPT)**, the theoretical upper bound of cache performance. * **Usage:** By using this column as a **training target (label)**, you can train ML models to imitate the optimal policy (Imitation Learning), effectively teaching the model to "predict the future." --- ## 2. Configurations (Workloads) The dataset is split into three distinct configurations, representing different difficulty levels and locality patterns. You must specify the configuration name when loading the dataset. | Configuration | Rows | Pattern Description | Key Challenge | | :--- | :--- | :--- | :--- | | **`random`** | ~18,240 | **Uniform Random Distribution**. Accesses are completely independent of past history. | **Baseline Check:** Sophisticated algorithms (LRU, LFU, ML) should perform no better than random eviction here. If your model finds a pattern, it is overfitting. | | **`zipf`** | 100,000 | **Zipfian / Power-Law Distribution**. A small set of "hot" items accounts for the vast majority of accesses. | **Frequency Bias:** Tests the ability to identify globally popular items. The "hot" items are static. Ideal for **LFU** (Least Frequently Used) benchmarks. | | **`web_traffic`** | 100,000 | **Bursty Temporal Locality**. Mimics web server traffic where items are accessed in bursts and popularity shifts over time (Concept Drift). | **Recency Bias:** Tests the ability to adapt to changing trends. Ideal for **LRU** (Least Recently Used) benchmarks. | --- ## 3. Data Dictionary Each row in the dataset represents a single memory access request. | Column Name | Data Type | Role | Description | | :--- | :--- | :--- | :--- | | `timestamp` | `int64` | Feature | The logical clock tick of the access request. | | `item_id` | `int64` | Feature | The Unique ID of the memory page/block being requested. | | `time_since_last_access` | `int64` | **Input Feature** | **Recency:** How many time steps have passed since this item was last seen. High values = Likely to be evicted by LRU. | | `frequency_count` | `int64` | **Input Feature** | **Frequency:** The total number of times this item has been accessed so far. Low values = Likely to be evicted by LFU. | | `time_to_next_request` | `int64` | **Target (Label)** | **The Oracle:** The number of time steps until this item is needed again. <br>• **Low value:** Keep in cache (needed soon). <br>• **High value:** Evict (not needed for a long time). | > **⚠️ CRITICAL WARNING:** The `time_to_next_request` column is strictly for **training** (calculating loss) or **offline evaluation**. You **must not** provide this feature to your model during inference, as it contains future information that a real CPU cache does not have. --- ## 4. How to Use ### Loading a Specific Workload Since the dataset uses configurations, you must specify which subset you want to load. ```python from datasets import load_dataset # 1. Load the "Web Traffic" workload (Best for testing LRU-like behavior) dataset = load_dataset("rajaykumar12959/synthetic-cache-workloads", "web_traffic") # Access the training data data = dataset['train'] # Print the first row print(data[0]) # Output example: # {'timestamp': 1, 'item_id': 502, 'time_since_last_access': 0, 'frequency_count': 1, 'time_to_next_request': 12} ``` --- ## 5. Dataset Creation and Methodology ### Generation Process The traces were generated using synthetic probabilistic models to simulate distinct memory access patterns. The generation process involved two stages: **Trace Generation** and **Feature Engineering**. 1. **Trace Generation:** * **Random Workload:** Generated using a Uniform Distribution, where every memory block has an equal probability of being accessed ($P(x) = 1/N$). * **Zipfian Workload:** Generated using a Zipfian (Power-Law) distribution with a skew parameter ($a > 1$). This simulates environments where a minority of items receive the majority of traffic. * **Web Traffic Workload:** Generated to simulate Temporal Locality. This workload utilizes a probability distribution that favors recently accessed items, introducing "bursts" of activity and concept drift over time. 2. **Feature Engineering (The Oracle):** After generating the raw sequence of memory accesses, a post-processing script iterated through the trace to calculate the **Oracle Label** (`time_to_next_request`). * For every access at time $t$, the script scanned the future accesses ($t+1, t+2, ...$) to find the next occurrence of the same `item_id`. * The distance to this next occurrence represents the optimal eviction priority according to Belady’s algorithm. ### Preprocessing * **Cold Start Handling:** Initial accesses to items (where no history exists) have `time_since_last_access` and `frequency_count` initialized to 0. * **Finite Horizon:** For the last occurrence of an item in the trace, the `time_to_next_request` is set to a sufficiently large integer (representing infinity) to indicate that the item will not be reused within the observation window. --- ## 6. Considerations and Limitations While this dataset is designed to benchmark cache replacement policies, users should be aware of the following limitations: * **Synthetic Nature:** The data is mathematically generated and does not capture hardware-level nuances such as bus latency, coherence traffic, or multi-core contention. It models *logical* page/block accesses only. * **Stationarity:** The `zipf` workload is stationary (the popularity of items does not change). In contrast, the `web_traffic` workload exhibits non-stationary behavior (concept drift). Models trained solely on `zipf` may fail to generalize to environments where popularity changes rapidly. * **Oracle Look-ahead:** The target variable `time_to_next_request` implies perfect knowledge of the future. In a real-time deployment, this value is unavailable. This dataset is intended for **offline training** (Supervised Learning) or for establishing an upper-bound performance baseline. --- ## 7. License This dataset is released under the **Apache License 2.0**. You are free to use, modify, and distribute this dataset for academic, commercial, or private use, provided that you include the original copyright notice and citation. --- ## 8. Citation If you utilize this dataset in your research or technical reports, please cite it using the following BibTeX entry: ```bibtex @misc{synthetic_cache_workloads, author = {rajaykumar12959}, title = {Synthetic Cache Workloads: Oracle Traces for ML Cache Replacement}, year = {2025}, publisher = {Hugging Face}, version = {1.0}, howpublished = {\url{[https://huggingface.co/datasets/rajaykumar12959/synthetic-cache-workloads](https://huggingface.co/datasets/rajaykumar12959/synthetic-cache-workloads)}} } ```
提供机构:
rajaykumar12959
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作