Paranal/parlogs-observations

Name: Paranal/parlogs-observations
Creator: Paranal
Published: 2024-01-09 21:58:38
License: 暂无描述

Hugging Face2024-01-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Paranal/parlogs-observations

下载链接

链接失效反馈

官方服务：

资源简介：

--- layout: default title: Home language: - en license: - lgpl-2.1 pretty_name: parlogs --- # Parlogs-Observations Dataset ## Dataset Summary Parlogs-Observations is a comprehensive dataset that includes the Very Large Telescope (VLT) logs for Template Execution of PIONIER, GRAVITY, and MATISSE instruments when they used Auxiliary Telescopes (ATs). It also encompasses all VLTI subsystems and ATs logs. This dataset aggregates logs based on instruments, time ranges, and subsystems, and contains template executions from 2019 in the VLTI infrastructure at Paranal. The dataset is formatted in single Parket files, which can be conveniently loaded, for example, with Pandas in Python. Parlogs-Observations is publicly available at 🤗 Hugging Face Dataset. ## Supported Tasks and Leaderboards The `parlogs-observations` dataset is a resource for researchers and practitioners in astronomy, data analysis, and machine learning. It enables a wide range of tasks focused on enhancing the understanding and operation of the Very Large Telescope Interferometer (VLTI) infrastructure. The following tasks are supported by the dataset: - **Anomaly Detection**: Users can identify unusual patterns or abnormal behavior in the log data that could indicate errors or bugs. This is crucial in providing operaional maintenance to the VLTI. - **System Diagnosis**: The dataset allows for diagnosing system failures or performance issues. By analyzing error logs, trace logs, or event logs, researchers can pinpoint and address the root causes of various operational issues. - **Performance Monitoring**: With this dataset, monitoring the performance of the VLTI systems becomes feasible. Users can track and analyze systems to understand resource usage, detect latency issues, or identify bottlenecks in the infrastructure. - **Predictive Maintenance**: Leveraging the dataset for predictive maintenance helps in foreseeing system failures or issues before they occur. This is achieved by analyzing trends and patterns in the log data to implement timely interventions. ## Overview ### Observations at Paranal At Paranal, the Very Large Telescope (VLT) is one of the world's most advanced optical telescopes, consisting of four Unit Telescopes and four movable Auxiliary Telescopes. Astronomical observations are configured into Observation Blocks (OBs), containing a sequence of Templates with parameters and scripts tailored to various scientific goals. Each template's execution follows a predictable behavior, allowing for detailed and systematic studies. The templates remain unchanged during a scientific period of six months, therefore the templates referred in parlogs-observations datasets can be considered as immutable source code. ### Machine Learning Techniques for parlogs-observations Given the structured nature of the dataset, various machine learning techniques can be applied to extract insights and build models for the tasks mentioned above. Some of these techniques include: - **Clustering Algorithms**: Such as K-means and hierarchical clustering to group similar log messages or events and identify nested patterns in log data. - **Classification Algorithms**: Including Support Vector Machines (SVM), Random Forests, and Naive Bayes classifiers for categorizing log messages and detecting anomalies. - **Sequence Analysis and Pattern Recognition**: Utilizing Hidden Markov Models (HMMs) and Frequent Pattern Mining to model sequences of log messages or events and discover common patterns in logs. - **Anomaly Detection Techniques**: Applying Isolation Forest and other advanced methods to identify outliers and anomalies in log data. - **Natural Language Processing (NLP) Techniques**: Leveraging Topic Modeling and Word Embeddings to uncover thematic structures in log messages and transform text into meaningful numerical representations. - **Deep Learning Techniques**: Employing Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), Transformers, and Autoencoders for sophisticated modeling and analysis of time-series log data. ## Data Structure and Naming Conventions The dataset is organized into Parket files follow a structured naming convention for easy identification and access based on the instrument, time range, and subsystems. This format ensures efficient data retrieval and manipulation, especially for large-scale data analysis: ``` {INSTRUMENT}-{TIME_RANGE}-{CONTENT}.parket ``` Where: - `INSTRUMENT` can be PIONIER, GRAVITY, or MATISSE. - `TIME_RANGE` is one of 1d, 1w, 1m, 6m. - `CONTENT` can be meta, traces, traces-SUBSYSTEMS, or traces-TELESCOPES. Example files: - PIONIER-1w-meta.parket - GRAVITY-1m-traces-SUBSYSTEMS.parket The "meta" file includes information about the template execution, while "traces" files contain event logs. The exisiting files are shown in the table below: | GRAVITY | PIONIER | MATISSE | |---------------------------|---------------------------|---------------------------| | GRAVITY-1d-meta.parket | PIONIER-1d-meta.parket | MATISSE-1d-meta.parket | | GRAVITY-1d-traces-SUBSYSTEMS.parket | PIONIER-1d-traces-SUBSYSTEMS.parket | MATISSE-1d-traces-SUBSYSTEMS.parket | | GRAVITY-1d-traces-TELESCOPES.parket | PIONIER-1d-traces-TELESCOPES.parket | MATISSE-1d-traces-TELESCOPES.parket | | GRAVITY-1d-traces.parket | PIONIER-1d-traces.parket | MATISSE-1d-traces.parket | | GRAVITY-1m-meta.parket | PIONIER-1m-meta.parket | MATISSE-1m-meta.parket | | GRAVITY-1m-traces-SUBSYSTEMS.parket | PIONIER-1m-traces-SUBSYSTEMS.parket | MATISSE-1m-traces-SUBSYSTEMS.parket | | GRAVITY-1m-traces-TELESCOPES.parket | PIONIER-1m-traces-TELESCOPES.parket | MATISSE-1m-traces-TELESCOPES.parket | | GRAVITY-1m-traces.parket | PIONIER-1m-traces.parket | MATISSE-1m-traces.parket | | GRAVITY-1w-meta.parket | PIONIER-1w-meta.parket | MATISSE-1w-meta.parket | | GRAVITY-1w-traces-SUBSYSTEMS.parket | PIONIER-1w-traces-SUBSYSTEMS.parket | MATISSE-1w-traces-SUBSYSTEMS.parket | | GRAVITY-1w-traces-TELESCOPES.parket | PIONIER-1w-traces-TELESCOPES.parket | MATISSE-1w-traces-TELESCOPES.parket | | GRAVITY-1w-traces.parket | PIONIER-1w-traces.parket | MATISSE-1w-traces.parket | | GRAVITY-6m-meta.parket | PIONIER-6m-meta.parket | MATISSE-6m-meta.parket | | GRAVITY-6m-traces-SUBSYSTEMS.parket | PIONIER-6m-traces-SUBSYSTEMS.parket | MATISSE-6m-traces-SUBSYSTEMS.parket | | GRAVITY-6m-traces-TELESCOPES.parket | PIONIER-6m-traces-TELESCOPES.parket | MATISSE-6m-traces-TELESCOPES.parket | | GRAVITY-6m-traces.parket | PIONIER-6m-traces.parket | MATISSE-6m-traces.parket | ## Combining Files Files from same instrument and within the same time range belong to the same trace_id. For instance, in the files: - PIONIER-1w-meta.parket - PIONIER-1w-traces.parket The trace_id=10 in PIONIER-1w-traces.parket file corresponds to the id=10 in the meta file PIONIER-1w-meta.parket. ## Data Instances A typical entry in the dataset might look like this: ```python # File: PIONIER-1m-traces.parket # Row: 12268 { "@timestamp": 1554253173950, "system": "PIONIER", "hostname": "wpnr", "loghost": "wpnr", "logtype": "LOG", "envname": "wpnr", "procname": "pnoControl", "procid": 208, "module": "boss", "keywname": "", "keywvalue": "", "keywmask": "", "logtext": "Executing START command ...", "trace_id": 49 } ``` ## Data Fields The dataset contains structured logs from software operations related to astronomical instruments. Each entry in the log provides detailed information regarding specific actions or events recorded by the system. Below is the description of each field in the log entries: | Field | Description | |-------------|---------------------------------------------------------------------------------------------------| | @timestamp | The timestamp of the log entry in milliseconds. | | system | The name of the system (e.g., PIONIER) from which the log entry originates. | | hostname | The hostname of the machine where the log entry was generated. | | loghost | The host of the logging system that generated the entry. | | logtype | Type of the log entry (e.g., LOG, FEVT, ERR), indicating its nature such as general log, event, or error. | | envname | The environment name where the log was generated, providing context for the log entry. | | procname | The name of the process that generated the log entry. | | procid | The process ID associated with the log entry. | | module | The module from which the log entry originated, indicating the specific part of the system. | | keywname | Name of any keyword associated with the log entry, if applicable. It is always paired with keywvalue | | keywvalue | Value of the keyword mentioned in `keywname`, if applicable. | | keywmask | Mask or additional context for the keyword, if applicable. | | logtext | The actual text of the log entry, providing detailed information about the event or action. | | trace_id | A unique identifier associated with each log entry, corresponds to id in metadata table. | ## Dataset Metadata Each Parket file contains metadata regarding its contents, which includes details about the instrument used, time range, and types of logs stored. This is the format of a sample template execution in the metadata: ```python # File: PIONIER-1m-meta.parket # Row: 49 { "START": "2019-04-03 00:59:33.005000", "END": "2019-04-03 01:01:25.719000", "TIMEOUT": false, "system": "PIONIER", "procname": "bob_ins", "TPL_ID": "PIONIER_obs_calibrator", "ERROR": false, "Aborted": false, "SECONDS": 112.0, "TEL": "AT" } ``` Where the fields are: | Field | Comment | | --------- | -------------------------------------------------------- | | START | The start timestamp of the template execution in milliseconds | | END | The end timestamp of the template execution in milliseconds | | TIMEOUT | Indicates if the execution exceeded a predefined time limit | | system | The name of the system used (e.g., PIONIER) | | procname | The process name associated with the template execution | | TPL_ID | The filename of the corresponding template file | | ERROR | Indicates if there was an error during execution | | Aborted | Indicates if the template execution was aborted (manually or because an error) | | SECONDS | The duration of the template execution in seconds | | TEL | The class of telescope used in the observation, in this dataset it is only AT | This structured format ensures a comprehensive understanding of each template's execution, providing insights into the operational dynamics of astronomical observations at Paranal. ## Loading Data The dataset can be loaded using Python libraries like Pandas. Here's an example of how to load a Parket file: ```python import pandas as pd df = pd.read_parket('PIONIER-1w-meta.parket') ```

提供机构：

Paranal

原始信息汇总

Parlogs-Observations 数据集概述

数据集概要

Parlogs-Observations 是一个综合数据集，包含使用辅助望远镜（ATs）时 PIONIER、GRAVITY 和 MATISSE 仪器的甚大望远镜（VLT）日志模板执行记录。该数据集还包括所有 VLTI 子系统和 ATs 日志。数据集根据仪器、时间范围和子系统聚合日志，并包含 2019 年在帕拉纳尔的 VLTI 基础设施中的模板执行记录。数据集以单个 Parket 文件格式存储，方便使用 Pandas 等工具加载。

支持的任务和排行榜

parlogs-observations 数据集适用于天文、数据分析和机器学习领域的研究人员和从业者。它支持以下任务：

异常检测：用户可以识别日志数据中的异常模式或异常行为，这有助于提供 VLTI 的操作维护。
系统诊断：通过分析错误日志、跟踪日志或事件日志，可以诊断系统故障或性能问题。
性能监控：用户可以跟踪和分析系统，以了解资源使用情况、检测延迟问题或识别基础设施中的瓶颈。
预测性维护：通过分析日志数据中的趋势和模式，可以预测系统故障或问题，从而及时进行干预。

概览

帕拉纳尔的观测

在帕拉纳尔，甚大望远镜（VLT）是世界上最先进的光学望远镜之一，由四个单元望远镜和四个可移动的辅助望远镜组成。天文观测被配置为观测块（OBs），包含一系列针对各种科学目标定制的模板和脚本。每个模板的执行遵循可预测的行为，允许进行详细和系统的研究。模板在六个月的科学周期内保持不变，因此 parlogs-observations 数据集中的模板可以被视为不可变的源代码。

机器学习技术

鉴于数据集的结构化性质，可以应用各种机器学习技术来提取见解并构建模型，包括：

聚类算法：如 K-means 和层次聚类，用于对相似的日志消息或事件进行分组并识别嵌套模式。
分类算法：包括支持向量机（SVM）、随机森林和朴素贝叶斯分类器，用于对日志消息进行分类和检测异常。
序列分析和模式识别：利用隐马尔可夫模型（HMMs）和频繁模式挖掘来建模日志消息或事件的序列并发现常见模式。
异常检测技术：应用隔离森林和其他高级方法来识别日志数据中的异常值和异常。
自然语言处理（NLP）技术：利用主题建模和词嵌入来揭示日志消息中的主题结构并将其转换为有意义的数值表示。
深度学习技术：使用循环神经网络（RNNs）、长短期记忆（LSTM）网络、卷积神经网络（CNNs）、图神经网络（GNNs）、Transformer 和自编码器进行复杂的时间序列日志数据建模和分析。

数据结构和命名约定

数据集组织为 Parket 文件，遵循基于仪器、时间范围和子系统的结构化命名约定。这种格式确保了高效的数据检索和操作，特别是对于大规模数据分析：

{INSTRUMENT}-{TIME_RANGE}-{CONTENT}.parket

其中：

INSTRUMENT 可以是 PIONIER、GRAVITY 或 MATISSE。
TIME_RANGE 可以是 1d、1w、1m、6m。
CONTENT 可以是 meta、traces、traces-SUBSYSTEMS 或 traces-TELESCOPES。

示例文件：

PIONIER-1w-meta.parket
GRAVITY-1m-traces-SUBSYSTEMS.parket

“meta”文件包含模板执行的信息，而“traces”文件包含事件日志。

现有文件如下表所示：

GRAVITY	PIONIER	MATISSE
GRAVITY-1d-meta.parket	PIONIER-1d-meta.parket	MATISSE-1d-meta.parket
GRAVITY-1d-traces-SUBSYSTEMS.parket	PIONIER-1d-traces-SUBSYSTEMS.parket	MATISSE-1d-traces-SUBSYSTEMS.parket
GRAVITY-1d-traces-TELESCOPES.parket	PIONIER-1d-traces-TELESCOPES.parket	MATISSE-1d-traces-TELESCOPES.parket
GRAVITY-1d-traces.parket	PIONIER-1d-traces.parket	MATISSE-1d-traces.parket
GRAVITY-1m-meta.parket	PIONIER-1m-meta.parket	MATISSE-1m-meta.parket
GRAVITY-1m-traces-SUBSYSTEMS.parket	PIONIER-1m-traces-SUBSYSTEMS.parket	MATISSE-1m-traces-SUBSYSTEMS.parket
GRAVITY-1m-traces-TELESCOPES.parket	PIONIER-1m-traces-TELESCOPES.parket	MATISSE-1m-traces-TELESCOPES.parket
GRAVITY-1m-traces.parket	PIONIER-1m-traces.parket	MATISSE-1m-traces.parket
GRAVITY-1w-meta.parket	PIONIER-1w-meta.parket	MATISSE-1w-meta.parket
GRAVITY-1w-traces-SUBSYSTEMS.parket	PIONIER-1w-traces-SUBSYSTEMS.parket	MATISSE-1w-traces-SUBSYSTEMS.parket
GRAVITY-1w-traces-TELESCOPES.parket	PIONIER-1w-traces-TELESCOPES.parket	MATISSE-1w-traces-TELESCOPES.parket
GRAVITY-1w-traces.parket	PIONIER-1w-traces.parket	MATISSE-1w-traces.parket
GRAVITY-6m-meta.parket	PIONIER-6m-meta.parket	MATISSE-6m-meta.parket
GRAVITY-6m-traces-SUBSYSTEMS.parket	PIONIER-6m-traces-SUBSYSTEMS.parket	MATISSE-6m-traces-SUBSYSTEMS.parket
GRAVITY-6m-traces-TELESCOPES.parket	PIONIER-6m-traces-TELESCOPES.parket	MATISSE-6m-traces-TELESCOPES.parket
GRAVITY-6m-traces.parket	PIONIER-6m-traces.parket	MATISSE-6m-traces.parket

合并文件

来自同一仪器和同一时间范围的文件属于同一 trace_id。例如，在文件：

PIONIER-1w-meta.parket
PIONIER-1w-traces.parket

PIONIER-1w-traces.parket 文件中的 trace_id=10 对应于 PIONIER-1w-meta.parket 文件中的 id=10。

数据实例

数据集中的一条典型记录可能如下所示：

python

文件: PIONIER-1m-traces.parket

行: 12268

{ "@timestamp": 1554253173950, "system": "PIONIER", "hostname": "wpnr", "loghost": "wpnr", "logtype": "LOG", "envname": "wpnr", "procname": "pnoControl", "procid": 208, "module": "boss", "keywname": "", "keywvalue": "", "keywmask": "", "logtext": "Executing START command ...", "trace_id": 49 }

数据字段

数据集包含与天文仪器相关的软件操作的结构化日志。每个日志条目提供有关系统记录的特定操作或事件的详细信息。以下是每个字段的描述：

字段	描述
@timestamp	日志条目的时间戳，以毫秒为单位。
system	日志条目来源的系统名称（例如，PIONIER）。
hostname	生成日志条目的机器的主机名。
loghost	生成日志条目的日志系统的主机。
logtype	日志条目的类型（例如，LOG、FEVT、ERR），指示其性质，如一般日志、事件或错误。
envname	生成日志的环境名称，为日志条目提供上下文。
procname	生成日志条目的进程名称。
procid	与日志条目关联的进程 ID。
module	日志条目来源的模块，指示系统的特定部分。
keywname	与日志条目关联的任何关键字的名称（如果适用）。它总是与 keywvalue 配对
keywvalue	`keywname` 中提到的关键字的值（如果适用）。
keywmask	关键字的掩码或附加上下文（如果适用）。
logtext	日志条目的实际文本，提供有关事件或操作的详细信息。
trace_id	与每个日志条目关联的唯一标识符，对应于元数据表中的 id。

数据集元数据

每个 Parket 文件包含其内容的元数据，包括使用的仪器、时间范围和存储的日志类型。以下是元数据中模板执行的示例格式：

python

文件: PIONIER-1m-meta.parket

行: 49

{ "START": "2019-04-03 00:59:33.005000", "END": "2019-04-03 01:01:25.719000", "TIMEOUT": false, "system": "PIONIER", "procname": "bob_ins", "TPL_ID": "PIONIER_obs_calibrator", "ERROR": false, "Aborted": false, "SECONDS": 112.0, "TEL": "AT" }

其中字段为：

字段	注释
START	模板执行的开始时间戳，以毫秒为单位
END	模板执行的结束时间戳，以毫秒为单位
TIMEOUT	指示执行是否超过预定时间限制
system	使用的系统名称（例如，PIONIER）
procname	与模板执行关联的进程名称
TPL_ID	相应模板文件的文件名
ERROR	指示执行期间是否发生错误
Aborted	指示模板执行是否被中止（手动或因为错误）
SECONDS	模板执行的持续时间，以秒为单位
TEL	观测中使用的望远镜类别，在本数据集中仅为 AT

这种结构化格式确保了对每个模板的执行进行全面了解，提供了对帕拉纳尔天文观测操作动态的洞察。

加载数据

可以使用 Pandas 等 Python 库加载数据集。以下是加载 Parket 文件的示例：

python import pandas as pd

df = pd.read_parket(PIONIER-1w-meta.parket)

5,000+

优质数据集

54 个

任务类型

进入经典数据集