five

honicky/hdfs-logs-encoded-blocks

收藏
Hugging Face2024-12-01 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/honicky/hdfs-logs-encoded-blocks
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: en tags: - log-analysis - hdfs - anomaly-detection license: mit configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: event_encoded dtype: string - name: tokenized_block sequence: int64 - name: block_id dtype: string - name: label dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 1159074302 num_examples: 460048 - name: validation num_bytes: 145089712 num_examples: 57506 - name: test num_bytes: 144844752 num_examples: 57507 download_size: 173888975 dataset_size: 1449008766 --- # HDFS Logs Train/Val/Test Splits This dataset contains preprocessed HDFS log sequences split into train, validation, and test sets for anomaly detection tasks. ## Dataset Description The dataset is derived from the HDFS log dataset, which contains system logs from a Hadoop Distributed File System (HDFS). Each sequence represents a block of log messages, labeled as either normal or anomalous. The dataset has been preprocessed using the Drain algorithm to extract structured fields and identify event types. ### Data Fields - `block_id`: Unique identifier for each HDFS block, used to group log messages into blocks - `event_encoded`: The preprocessed log sequence with event IDs and parameters - `tokenized_block`: The tokenized log sequence, used for training - `label`: Classification label ('Normal' or 'Anomaly') ### Data Splits - Training set: 460,049 sequences (80%) - Validation set: 57,506 sequences (10%) - Test set: 57,506 sequences (10%) The splits are stratified by the Label field to maintain class distribution across splits. ## Source Data Original data source: https://zenodo.org/records/8196385/files/HDFS_v1.zip?download=1 ## Preprocessing We preprocess the logs using the Drain algorithm to extract structured fields and identify event types. We then encode the logs using a pretrained tokenizer and add special tokens to separate event types. This dataset should be immediately usable for training and testing models for log-based anomaly detection. ## Intended Uses This dataset is designed for: - Training log anomaly detection models - Evaluating log sequence prediction models - Benchmarking different approaches to log-based anomaly detection see [honicky/pythia-14m-hdfs-logs](https://huggingface.co/honicky/pythia-14m-hdfs-logs) for an example model. ## Citation If you use this dataset, please cite the original HDFS paper: ```bibtex @inproceedings{xu2009detecting, title={Detecting large-scale system problems by mining console logs}, author={Xu, Wei and Huang, Ling and Fox, Armando and Patterson, David and Jordan, Michael I}, booktitle={Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles}, pages={117--132}, year={2009} } ```

This dataset contains preprocessed HDFS log sequences split into train, validation, and test sets for anomaly detection tasks. The dataset is derived from the HDFS log dataset, which contains system logs from a Hadoop Distributed File System (HDFS). Each sequence represents a block of log messages, labeled as either normal or anomalous. The dataset has been preprocessed using the Drain algorithm to extract structured fields and identify event types. Data fields include block_id (unique identifier for each HDFS block), event_encoded (the preprocessed log sequence with event IDs and parameters), tokenized_block (the tokenized log sequence, used for training), and label (classification label, Normal or Anomaly). The dataset is split into a training set of 460,049 sequences (80%), a validation set of 57,506 sequences (10%), and a test set of 57,506 sequences (10%), stratified by the Label field to maintain class distribution.
提供机构:
honicky
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作