petrrysavy/krebs
收藏Hugging Face2024-06-12 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/petrrysavy/krebs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-3.0
language:
- en
tags:
- Krebs cycle
- time-series
- causal learning
pretty_name: Krebs cycle dataset
---
# The Krebs cycle dataset
## Motivation
This dataset contains simulated time series that mimic Kreb's cycle.
The intent of the datasets is for causal discovery from multivariate
time series data, provide ground truth causal relationships as well as
allow testing on multiple scenarios, including many short time series,
few long time series, as well as relative data instead of absolute values.
The dataset was created at the Czech Technical University in Prague
as part of the [CoDiet project](https://www.codiet.eu/) https://www.codiet.eu/,
which focuses on the relationship between diet and non-transmittable diseases.
The contents of this repository are also described in the following
paper: (TODO, we will provide a bibtex reference, once published)
```
Causal Learning in Biomedical Applications
Petr Ryšavý, Xiaoyu He, Jakub Mareček
```
## Dataset composition
There are four datasets, each differing in the type of time series. The
basic characteristics are described in the table below. Each of the
datasets is contained in one of the subdirectories of this repository.
| Dataset | N. features | Lenght | N. series | Initialization | Concentrations |
|---------|-------------|--------|-----------|----------------|----------------|
| KrebsN | 16 | 500 | 100 | Normal distribution | Absolute |
| Krebs3 | 16 | 500 | 120 | Excitation of three | Relative |
| KrebsL | 16 | 5000 | 10 | Normal distribution | Absolute |
| KrebsS | 16 | 5 | 10000 | Normal distribution | Absolute |
Each of the datasets was sampled using a simulator of the Krebs cycle.
Individual compounds were created in a bounding box and spread throughout
the box at random locations. In each time step, the molecules move in
the box, and once a reaction can happen, the reactants are removed, and
the product is created. As a result, the concentrations of the particles
change, resulting in one data point per time step of the simulator.
Each of the time series is in its individual file, with a name connected
to the type of the series and the seed (timestamp) used for the
dataset generation. Each of the rows in each of the time series files contains
concentrations of one compound, where individual time steps are separated
by tab character `\t`. The files with the individual time series are, therefore,
in the TSV format (tab-separated-values) and can be opened in any text editor
or tabular editor.
The dataset contains no missing data. The source of randomness in the data
comes from the initialization of the compound's concentrations and randomness
of the location of the compounds in the bounding box. Despite the fact that
it is unlikely, the datasets can contain repeated time series. The datasets
are self-contained.
If needed for testing, the recommended
train-test split is so that the first x % of the dataset is used for training,
and the remaining 1-x% is used for testing. The order in which individual
time series are considered is at the root of the repository.
The datasets do not contain any confidential, offensive, or similar type of data.
## Dataset collection
The dataset is simulated, meaning that the data were generated by a computer
program. The simulator is based on the
[Chemistry Engine repository](https://github.com/AugustNagro/Chemistry-Engine)
(https://github.com/AugustNagro/Chemistry-Engine) by August Nagro. You can
find the code used to generate the dataset in
[github repository at https://github.com/petrrysavy/krebsgenerator/](https://github.com/petrrysavy/krebsgenerator/).
## Uses
The dataset is intended for testing and developing causal discovery algorithms.
From the time series, one would naturally ask questions of whether higher levels
of `FURMATE` in one-time step imply higher levels of `MALATE` in the next step, in
an ideal case leading to the discovery of the whole cycle of reactions. The usage is,
however, not limited to causal discovery; it is possible to predict concentrations
at the next level or do any similar time-series analyses.
## Distribution
The dataset is available at
[the HuggingFace repository at https://huggingface.co/datasets/petrrysavy/krebs/tree/main](https://huggingface.co/datasets/petrrysavy/krebs/tree/main).
The dataset is available under the CC-BY-3.0 license. The authors
bear all responsibility in case of violation of rights. To download the dataset, use
```
git clone git@hf.co:datasets/petrrysavy/krebs/
```
The metadata to the project in [JSON format can be found at https://huggingface.co/api/datasets/petrrysavy/krebs/croissant](https://huggingface.co/api/datasets/petrrysavy/krebs/croissant).
## Example of Usages in Custom Projects
An example usage of the dataset can be found at [github repository at https://github.com/petrrysavy/krebsdynotears](https://github.com/petrrysavy/krebsdynotears).
The repository shows usage of the dataset to evaluate [DyNoTears (see https://arxiv.org/abs/2002.00498)](https://arxiv.org/abs/2002.00498),
a State-of-the-art method for dynamic Bayesian networks. The repository also provides an example
of how to load the code into Python language, here:
```
import os
import pandas as pd
with open("krebsN.txt", "r") as file:
lines = file.readlines()
files = ["krebsN" + os.sep + line.strip() for line in lines]
data = [pd.read_table(path + os.sep + file, header=None, index_col=0).transpose() for file in files]
# data now contains a list of pandas data frames, one per single time-series
# columns of the data frames are concentrations of one of the 16 compounds
# rows correspond to individual time-steps, sorted by increasing time
```
## Maintenance
With queries, requests, and errands about the dataset, please contact either
Petr Ryšavý [petr.rysavy@fel.cvut.cz](petr.rysavy@fel.cvut.cz),
or Jakub Mareček [jakub.marecek@fel.cvut.cz](jakub.marecek@fel.cvut.cz).
The authors of the repository are open to proposed changes and extensions
of the dataset; the simplest way to do so is to open a pull request in
HuggingFace, which will be merged after validation. The history of the dataset
can be seen in the
[commit history at https://huggingface.co/datasets/petrrysavy/krebs/commits/main](https://huggingface.co/datasets/petrrysavy/krebs/commits/main).
提供机构:
petrrysavy
原始信息汇总
数据集概述
数据集内容
- 包含模拟的时间序列数据,模拟了克雷布斯循环(Krebs cycle)。
数据集用途
- 主要用于因果发现研究。
许可证
- 遵循CC-BY-3.0许可证。



