petrrysavy/krebs

Name: petrrysavy/krebs
Creator: petrrysavy
Published: 2024-06-12 19:39:32
License: 暂无描述

Hugging Face2024-06-12 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/petrrysavy/krebs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-3.0 language: - en tags: - Krebs cycle - time-series - causal learning pretty_name: Krebs cycle dataset --- # The Krebs cycle dataset ## Motivation This dataset contains simulated time series that mimic Kreb's cycle. The intent of the datasets is for causal discovery from multivariate time series data, provide ground truth causal relationships as well as allow testing on multiple scenarios, including many short time series, few long time series, as well as relative data instead of absolute values. The dataset was created at the Czech Technical University in Prague as part of the [CoDiet project](https://www.codiet.eu/) https://www.codiet.eu/, which focuses on the relationship between diet and non-transmittable diseases. The contents of this repository are also described in the following paper: (TODO, we will provide a bibtex reference, once published) ``` Causal Learning in Biomedical Applications Petr Ryšavý, Xiaoyu He, Jakub Mareček ``` ## Dataset composition There are four datasets, each differing in the type of time series. The basic characteristics are described in the table below. Each of the datasets is contained in one of the subdirectories of this repository. | Dataset | N. features | Lenght | N. series | Initialization | Concentrations | |---------|-------------|--------|-----------|----------------|----------------| | KrebsN | 16 | 500 | 100 | Normal distribution | Absolute | | Krebs3 | 16 | 500 | 120 | Excitation of three | Relative | | KrebsL | 16 | 5000 | 10 | Normal distribution | Absolute | | KrebsS | 16 | 5 | 10000 | Normal distribution | Absolute | Each of the datasets was sampled using a simulator of the Krebs cycle. Individual compounds were created in a bounding box and spread throughout the box at random locations. In each time step, the molecules move in the box, and once a reaction can happen, the reactants are removed, and the product is created. As a result, the concentrations of the particles change, resulting in one data point per time step of the simulator. Each of the time series is in its individual file, with a name connected to the type of the series and the seed (timestamp) used for the dataset generation. Each of the rows in each of the time series files contains concentrations of one compound, where individual time steps are separated by tab character `\t`. The files with the individual time series are, therefore, in the TSV format (tab-separated-values) and can be opened in any text editor or tabular editor. The dataset contains no missing data. The source of randomness in the data comes from the initialization of the compound's concentrations and randomness of the location of the compounds in the bounding box. Despite the fact that it is unlikely, the datasets can contain repeated time series. The datasets are self-contained. If needed for testing, the recommended train-test split is so that the first x % of the dataset is used for training, and the remaining 1-x% is used for testing. The order in which individual time series are considered is at the root of the repository. The datasets do not contain any confidential, offensive, or similar type of data. ## Dataset collection The dataset is simulated, meaning that the data were generated by a computer program. The simulator is based on the [Chemistry Engine repository](https://github.com/AugustNagro/Chemistry-Engine) (https://github.com/AugustNagro/Chemistry-Engine) by August Nagro. You can find the code used to generate the dataset in [github repository at https://github.com/petrrysavy/krebsgenerator/](https://github.com/petrrysavy/krebsgenerator/). ## Uses The dataset is intended for testing and developing causal discovery algorithms. From the time series, one would naturally ask questions of whether higher levels of `FURMATE` in one-time step imply higher levels of `MALATE` in the next step, in an ideal case leading to the discovery of the whole cycle of reactions. The usage is, however, not limited to causal discovery; it is possible to predict concentrations at the next level or do any similar time-series analyses. ## Distribution The dataset is available at [the HuggingFace repository at https://huggingface.co/datasets/petrrysavy/krebs/tree/main](https://huggingface.co/datasets/petrrysavy/krebs/tree/main). The dataset is available under the CC-BY-3.0 license. The authors bear all responsibility in case of violation of rights. To download the dataset, use ``` git clone git@hf.co:datasets/petrrysavy/krebs/ ``` The metadata to the project in [JSON format can be found at https://huggingface.co/api/datasets/petrrysavy/krebs/croissant](https://huggingface.co/api/datasets/petrrysavy/krebs/croissant). ## Example of Usages in Custom Projects An example usage of the dataset can be found at [github repository at https://github.com/petrrysavy/krebsdynotears](https://github.com/petrrysavy/krebsdynotears). The repository shows usage of the dataset to evaluate [DyNoTears (see https://arxiv.org/abs/2002.00498)](https://arxiv.org/abs/2002.00498), a State-of-the-art method for dynamic Bayesian networks. The repository also provides an example of how to load the code into Python language, here: ``` import os import pandas as pd with open("krebsN.txt", "r") as file: lines = file.readlines() files = ["krebsN" + os.sep + line.strip() for line in lines] data = [pd.read_table(path + os.sep + file, header=None, index_col=0).transpose() for file in files] # data now contains a list of pandas data frames, one per single time-series # columns of the data frames are concentrations of one of the 16 compounds # rows correspond to individual time-steps, sorted by increasing time ``` ## Maintenance With queries, requests, and errands about the dataset, please contact either Petr Ryšavý [petr.rysavy@fel.cvut.cz](petr.rysavy@fel.cvut.cz), or Jakub Mareček [jakub.marecek@fel.cvut.cz](jakub.marecek@fel.cvut.cz). The authors of the repository are open to proposed changes and extensions of the dataset; the simplest way to do so is to open a pull request in HuggingFace, which will be merged after validation. The history of the dataset can be seen in the [commit history at https://huggingface.co/datasets/petrrysavy/krebs/commits/main](https://huggingface.co/datasets/petrrysavy/krebs/commits/main).

提供机构：

petrrysavy

原始信息汇总

数据集概述

数据集内容

包含模拟的时间序列数据，模拟了克雷布斯循环（Krebs cycle）。

数据集用途

主要用于因果发现研究。

许可证

遵循CC-BY-3.0许可证。

5,000+

优质数据集

54 个

任务类型

进入经典数据集