Genomic_Benchmarks

Name: Genomic_Benchmarks
Creator: maas
Published: 2025-06-30 14:26:36
License: 暂无描述

魔搭社区2025-06-30 更新2025-07-05 收录

下载链接：

https://modelscope.cn/datasets/lgq12697/Genomic_Benchmarks

下载链接

链接失效反馈

官方服务：

资源简介：

# Source From [ML-Bioinfo-CEITEC/genomic_benchmarks](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks) # Datasets Each folder contains either one benchmark or a set of benchmarks. See [docs/](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/docs/) for code used to create these benchmarks. ### Naming conventions * *dummy_...*: small datasets, used for testing purposes * *demo_...*: middle size datasets, not necesarily biologically relevant or fully reproducible, used in demos ### Versioning We recommend to check the version number when working with the dataset (i.e. not using default `None`). The version should be set to 0 when the dataset is proposed, after inicial curration it should be changed to 1 and then increased after every modification. ### Data format Each benchmark should contain `metadata.yaml` file with its main folder with the specification in YAML format, namely * **the version** of the benchmark (0 = in development) * **the classes** of genomic sequences, for each class we further need to specify - *url* with the reference - *type* of the reference (currently, only fa.gz implemented) - *extra_processing*, a parameter helping to overcome some know issues with identifiers matching The main folder should also contain two folders, `train` and `test`. Both those folders should contain gzipped CSV files, one for each class (named `class_name.csv.gz`). The format of gzipped CSV files closely resemble BED format, the column names must be the following: * **id**: id of a sequence * **region**: chromosome/transcript/... to be matched with the reference * **start**, **end**: genomic interval specification (0-based, i.e. same as in Python) * **strand**: either '+' or '-' ### To contribute a new datasets Create a new branch. Add the new subfolders to `datasets` and `docs`. The subfolder of `docs` should contain a description of the dataset in `README.md`. If the dataset comes with the paper, link the paper. If the dataset is not taken from the paper, make sure you have described and understand the biological process behind it. If you have access to `cloud_cache` folder on GDrive, upload your file there and update `CLOUD_CACHE` in [cloud_caching.py](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/src/genomic_benchmarks/loc2seq/cloud_caching.py). ### To review a new dataset Make sure you can run and reproduce the code. Check you can download the actual sequences and/or create a data loader. Do you understand what is behind these data? (either from the paper or the description) Ask for clarification if needed.

## 来源源自[ML-Bioinfo-CEITEC/genomic_benchmarks](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks) ## 数据集说明每个文件夹对应一个基准测试任务或一组基准测试任务。如需获取用于构建此类基准测试的代码，请参阅[docs/目录](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/docs/)。 ### 命名规范 * *dummy_* 前缀：小型数据集，仅用于测试用途 * *demo_* 前缀：中等规模数据集，不一定具备生物学相关性或可完全复现，仅用于演示场景 ### 版本管理我们建议在使用该数据集时确认其版本号（请勿使用默认的`None`）。数据集提案时版本号应设为0，经过初始整理后应将版本号更新为1，后续每经过一次修改，版本号均需递增。 ### 数据格式每个基准测试任务的主文件夹中应包含`metadata.yaml`文件，该文件采用YAML格式编写规范说明，具体包含以下内容： * **基准测试版本**：0代表开发中状态 * **基因组序列类别**：针对每个类别，需额外指定以下参数： - *url*：参考序列的下载链接 - *type*：参考序列的文件类型（目前仅支持fa.gz格式） - *extra_processing*：用于解决标识符匹配中已知问题的额外处理参数主文件夹中还应包含`train`（训练集）和`test`（测试集）两个子文件夹，每个子文件夹内均包含与类别数量对应的gzip压缩CSV文件，文件命名格式为`class_name.csv.gz`。 gzip压缩CSV文件的格式与BED格式高度相似，其列名必须严格遵循以下规范： * **id**：序列的唯一标识符 * **region**：需与参考序列匹配的染色体/转录本等区域标识 * **start**、**end**：基因组区间的起始与终止位置（采用0-based索引，与Python中的索引规则一致） * **strand**：序列的链方向，仅允许取值为'+'或'-' ### 新增数据集贡献流程 1. 创建新的代码分支 2. 将新增的子文件夹添加至`datasets`与`docs`目录中 3. `docs`目录下的对应子文件夹中需包含`README.md`文件，用于描述该数据集的详细信息 4. 若该数据集关联已发表论文，请附上论文链接；若数据集并非源自已发表论文，则需确保已明确描述并理解其背后的生物学过程若你有权限访问谷歌云端硬盘（GDrive）上的`cloud_cache`文件夹，请将你的文件上传至该目录，并更新[cloud_caching.py](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/src/genomic_benchmarks/loc2seq/cloud_caching.py)中的`CLOUD_CACHE`配置项。 ### 新增数据集审核流程审核时需确保能够运行并复现相关代码，确认可以下载真实序列数据或构建数据加载器。需明确理解该数据集背后的生物学背景（可通过论文或描述文档获取），如有疑问请及时要求补充说明。

提供机构：

maas

创建时间：

2025-06-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集