five

Data from: A cost-effective blood DNA methylation-based age estimation method in domestic cats, Tsushima leopard cats (Prionailurus bengalensis euptilurus), and Panthera species, using targeted bisulfite sequencing and machine learning models|生物信息学数据集|机器学习数据集

收藏
Mendeley Data2024-04-13 更新2024-06-27 收录
生物信息学
机器学习
下载链接:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.3r2280gn4
下载链接
链接失效反馈
资源简介:
# Datasets --- ### Appendix S1–S4 The files included the methylation data, sample information, and predicted age of each target species/species group. The data in the files are used to build age estimation models. 'domestic cat' in the filename means the file is for the domestic cat; 'leopard cat' means for the Tsushima leopard cat; 'panthera' means for the Panthera species (i.e., jaguar, leopard, lion, snow leopard, and tiger), and 'all' means for all the samples from all species. ### Appendix S5 The file contains the CpG selection results for the best age estimation model of each species/species group, the frequency of being selected in elastic net feature selection of each CpG site, correlation coefficients between the methylation rate and chronological age of each CpG site, and NCBI sequence ID with position. ### CpG No renamed fulllist\_all felidae.csv The file showed the list of CpGs, which were at least contained in one species. ### M%+sampleinfo\*.csv These files are the version of Appendix S1–S4 before adding the predicted age. ### indextable\_skf\_cor\*.csv Raw results of feature selection (correlation-based). ### indextable\_skf\_loio\_ela\*.csv Raw results of feature selection (elastic net-based, leave-one-individual-out cross-validation). ### indextable\_skf\_loso(\_raw)\_ela\*.csv Raw results of feature selection (elastic net-based, leave-one-species-out cross-validation). *P.S. Appendix S1-S5 are referred to in our paper. Other files were only used in the analysis.* # Description of the data sets and file structures ### Appendix S1–S4, M%+sampleinfo\*.csv * amp3_,amp4_, amp8_, amp9_, and bs38\_ in the head are the names of CpG sites. Columns with the heads showed the results of methylation rates. The proximal genes and positions in genomes could be referred to in Appendix S5 and CpG No renamed fulllist_all felidae.csv. * Health_condition_ed: health condition at the time of sampling (good, diseased). * Health_condition (Appendix S2–S4, species other than domestic cats): raw health condition data * Health condition information in Appendix S1 (domestic cats): * Health_condition_Healthy (column K): healthy sample Health_condition_CKD (column L): sample with chronic kidney disease Health_condition_Diabetes (column M): sample with diabetes Health_condition_Cancer (column N): sample with cancer Health_condition_DigestiveDisease (column O): sample with digestive diseases Health_condition_Others (column P): sample with other diseases * Fold: data was split into five folds (0–4) with similar age and species distribution using stratified k-fold. * Age_class: age class of each sample. * Predictedage_*: age predicted through the methods below. | Feature selection methods | Regression methods | Column name (after 'Predictedage\_') | | --------------------------- | ------------------------ | ------------------------------------ | | ---------elastic net------- | -------only once-------- | ela | | elastic net | elastic net | ela\_ela | | elastic net | SVMr | ela\_svmr | | cor ≥ 0.5 | elastic net | cor0\_5\_ela | | cor ≥ 0.7 | elastic net | cor0\_7\_ela | | cor ≥ 0.5 | SVMr | cor0\_5\_svmr | | cor ≥ 0.7 | SVMr | cos0\_7\_svmr | * For Appendix S2 and M%+sampleinfo_leopardcat_paper_final_fold+ageclass.csv * 'Age_stage_at_time_of_protection' shows the age stages estimated when the individuals were protected from morphological methods. * 'Death_date' shows the death date. No data here means the individuals are still alive in 2023. This data was not used in the analysis. * Empty cells mean no data. Captive-born individuals had no data in 'Age_stage_at_time_of_protection'. Wild-born individuals had no data in 'Age', 'Health_condition_ed','Fold', 'Age_class', which were only available for captive-born individuals with age known. The predicted epigenetic age was only calculated using the best model and summarized in 'Predictedage_ela_svmr'. * For Appendix S3 and M%+sampleinfo_panthera_paper_final_fold+ageclass.csv, Appendix S4 and M%+sampleinfo_all_paper_final_fold+relative_ageclass.csv * 'Predictedage_*_loso(_raw)' is age predicted under the model evaluation of leave-one-species-out-cross-validation. * For Appendix S4 * 'Predictedage_* ' is the predicted relative age of each sample. 'Predictedage_*_chronoloical age' is the predicted chronological age under the best models. * Empty cells mean no data. The summarizing standard for domestic cats and other species was different. Therefore, empty cells are in health condition-related columns. ### Appendix S5, CpG No renamed fulllist\_all felidae.csv * Columns E to M showed whether the CpG sites existed in each species group. 0 means the CpG does not exist in the species; 1 means the CpG exists in the species. Panthera_spp. (column L) included species in column G–K (i.e. jaguar, leopard, lion, snow leopard, and tiger). All_spp. (column M) included all species. ### Appendix S5 * Green, yellow, orange, and red columns represent different levels of correlation coefficients between methylation rates of selected CpG sites and chronological age. White columns are CpG sites that were not selected. Grey columns are CpG sites that did not exist in the species group. * Columns named "Features in the best model (correlation_coefficient)—Elastic net + SVMr (frequency ≥ 4 or 5)" showed the correlation coefficient between the chronological age and the methylation rates of features (i.e., CpGs) used in the best models. Elastic net-based feature selection followed by regression using SVMr (Elastic net + SVMr) produced the best models for all species groups. For some species groups, CpGs selected over four times in all five training data sets (frequency≥4) constructed the explanatory variables of the best models; for others, CpGs selected in all five training data sets (frequency ≥ 5) constructed the explanatory variables of the best models. # Code/Software 2023_Qi_etal_paper Rscript.R was run in R 4.3.1. 2023_Qi_etal_Pythonscript.py was run in Python 3.8.8.
创建时间:
2024-01-08
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国农村金融统计数据

该数据集包含了中国农村金融的统计信息,涵盖了农村金融机构的数量、贷款余额、存款余额、金融服务覆盖率等关键指标。数据按年度和地区分类,提供了详细的农村金融发展状况。

www.pbc.gov.cn 收录

中国1km分辨率逐月降水量数据集(1901-2023)

该数据集为中国逐月降水量数据,空间分辨率为0.0083333°(约1km),时间为1901.1-2023.12。数据格式为NETCDF,即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集,通过Delta空间降尺度方案在中国降尺度生成的。并且,使用496个独立气象观测点数据进行验证,验证结果可信。本数据集包含的地理空间范围是全国主要陆地(包含港澳台地区),不含南海岛礁等区域。为了便于存储,数据均为int16型存于nc文件中,降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理,Matlab发布了读入与存储nc文件的函数,读取函数为ncread,切换到nc文件存储文件夹,语句表达为:ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent]),其中XXX.nc为文件名,为字符串需要’’;var是从XXX.nc中读取的变量名,为字符串需要’’;i、j、t分别为读取数据的起始行、列、时间,leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样,研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令,可查看。数据坐标系统建议使用WGS84。

国家青藏高原科学数据中心 收录

中国区域交通网络数据集

该数据集包含中国各区域的交通网络信息,包括道路、铁路、航空和水路等多种交通方式的网络结构和连接关系。数据集详细记录了各交通节点的位置、交通线路的类型、长度、容量以及相关的交通流量信息。

data.stats.gov.cn 收录

Breast Ultrasound Images (BUSI)

小型(约500×500像素)超声图像,适用于良性和恶性病变的分类和分割任务。

github 收录

LIDC-IDRI

LIDC-IDRI 数据集包含来自四位经验丰富的胸部放射科医师的病变注释。 LIDC-IDRI 包含来自 1010 名肺部患者的 1018 份低剂量肺部 CT。

OpenDataLab 收录