five

GDC-SM: The GDC Schema Matching Benchmark

收藏
Zenodo2025-05-06 更新2026-05-29 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.14963587
下载链接
链接失效反馈
官方服务:
资源简介:
GDC-SM is a schema matching evaluation benchmark based on a real data harmonization scenario that is common in biomedical research: pooling datasets from multiple studies to increase the number of patients. We used the study by Li et al. (2023) in which the authors: 1) pooled data from ten studies (available as supplementary material in the original primary research papers) that cover multiple patient cohorts and cancer types; and 2) mapped each dataset into the GDC standard, a data standard set by the US National Cancer Institute's Genomics Data Commons (GDC) for cancer genomic data. We obtained datasets from the 10 studies related to tumor analysis (see table below), and, with the help of the biomedical experts, manually aligned and matched these datasets to the Genomics Data Commons (GDC) standard version 3.1.0. The GDC standard dictionary describes data using a graph model that includes names and descriptions for nodes and attributes, as well as acceptable values for some attributes. However, for compatibility with other schema matching methods and benchmarks, we transformed the model to a relational schema that contains only column names and domain information (i.e., we disregard column descriptions). We created a simplified table reflecting the GDC format, the "target" table, listing domain values for each column without repetition. The benchmark includes 10 pairs of input-target tables. The number of columns in the source tables ranges from 16 to 179, and the number of rows ranges from 93 to 225. Our simplified GDC target schema comprises a single table with 736 columns. While some columns have a small number of distinct values (e.g., binary yes/no attributes), some contain up to 4,478 distinct values. The ground truth was manually curated by multiple annotators -- at least 3 annotators per matching pair, who used a mix of manual and automated methods for identifying possible candidate matches (e.g., GDC search tools and bdi-kit). Given that the correctness of some matches is very challenging to determine even for bioinformatics experts (e.g., it may require reading the original papers or asking data producers), the final match decisions were made by consensus based on what users would expect from an algorithm given the limited context.  For more details, refer to our paper: Yurong Liu, Eduardo Pena, Aécio Santos, Eden Wu, and Juliana Freire. "Magneto: Combining Small and Large Language Models for Schema Matching." To appear at PVLDB 2025. Arxiv preprint 2412.08194. For citations, please consider citing the paper above and this dataset as described below:  Santos, A., Wu, E., Lopez, R., Keegan, S., Pena, E., Liu, W., Liu, Y., Fenyo, D., & Freire, J. (2025). GDC-SM: The GDC Schema Matching Benchmark (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14963588 How to use The raw datasets used in this benchmark are sourced from publicly-available studies under Creative Commons licenses. However, some of these licenses include the NoDerivatives (ND) restriction, which prevents others from redistributing modified versions of the material, including transformed or processed datasets. Therefore, we only provide the ground-truth annotations and information about how to reconstruct the data tables from their original data source. We describe below how to recreate the CSV table files.  Generating data files from the publisher's original data For convenience, we provide the gdc_download.py script, which automates the process of downloading, extracting, and preparing the data. It includes the following steps: Download Data: It retrieves the dataset from the paper publisher's website (supplemental files distributed with the paper). Extract Data: If the file is in Excel format, it extracts the specified sheet to a dataframe file. Clean Data: Performs some cleaning steps to ensure datasets are the same as the ones used in the paper (e.g., for the Vasaikar.csv dataset, it removes the first and last rows). Save as CSV: The processed data is saved as a .csv file in the input-tables folder. Install the dependencies listed in the requirements.txt file and run the script, and it will automatically handle the downloading, extraction, and processing of the data for you. Included files The dataset release includes the following files: gdc-sm-data.zipContains manually curated ground-truth mappings and a standardized target schema based on a real-world biomedical data harmonization task. It includes: data/ground-truth/Ten CSV files (Cao.csv, Clark.csv, Dou.csv, Gillette.csv, Huang.csv, Krug.csv, McDermott.csv, Satpathy.csv, Vasaikar.csv, Wang.csv), each containing mappings between column names from a source dataset and attributes in the GDC target schema. These mappings serve as the manually curated ground truth for the schema matching task. data/target-tables/gdc_unique_columns_concat_values.csv, the GDC target schema formatted as a relational table. It lists all attributes from the GDC data dictionary along with their unique, concatenated domain values. gdc_download.pyA Python script that automates the download, extraction, and preparation of the source datasets directly from the original publishers. It handles Excel-to-CSV conversions, minor data cleaning steps, and file organization. When executed, this script creates two additional folders as output:  data/downloads/Contains the original spreadsheet files downloaded from the paper publisher's website. data/input-tables/Contains tables, in CSV format, to be used as source tables for the schema matching task. papers_info.jsonA metadata file listing key details for each source study, including publication titles, URLs to the papers, dataset links, and sheet names from which the benchmark datasets were extracted. Needed needed to run the gdc_download.py. requirements.txtSpecifies the Python package dependencies needed to run gdc_download.py . Source table information The table below includes details for each study, such as relevant URLs for the original spreadsheet files used in the benchmark and sheet names where the source tables appear. Each row corresponds to a study and provides the following information: First Author: Name of the study's first author. Paper Title: Title of the publication. Paper URL: Link to the paper. Dataset URL: Link to the supplementary file containing the dataset extract from the paper for convenience. Relevant File: Name of the file or table in the supplementary material. Sheet Name: Name of the sheet in the file that contains the data. First Author Paper Title Paper URL Dataset URL Sheet Name Clark, D.J. Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma https://doi.org/10.1016/j.cell.2019.10.007   https://ars.els-cdn.com/content/image/1-s2.0-S0092867419311237-mmc1.xlsx ccrcc_clinical_characteristics Krug, K. Proteogenomic Landscape of Breast Cancer Tumorigenesis and Targeted Therapy https://doi.org/10.1016/j.cell.2020.10.036 https://ars.els-cdn.com/content/image/1-s2.0-S0092867420314008-mmc1.xlsx A) Metadata Vasaikar, S. Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities https://doi.org/10.1016/j.cell.2019.03.030 https://ars.els-cdn.com/content/image/1-s2.0-S0092867419302922-mmc1.xlsx B-ClinicalData Wang, L.-B. Proteogenomic and metabolomic characterization of human glioblastoma https://doi.org/10.1016/j.ccell.2021.01.006 https://ars.els-cdn.com/content/image/1-s2.0-S1535610821000507-mmc2.xlsx clinical_data Huang, C. Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma https://doi.org/10.1016/j.ccell.2020.12.007 https://ars.els-cdn.com/content/image/1-s2.0-S1535610820306553-mmc2.xlsx Meta table Satpathy, S. A proteogenomic portrait of lung squamous cell carcinoma https://doi.org/10.1016/j.cell.2021.07.016 https://ars.els-cdn.com/content/image/1-s2.0-S0092867421008576-mmc1.xlsx Table S1A Gillette, M.A. Proteogenomic Characterization Reveals Therapeutic Vulnerabilities in Lung Adenocarcinoma https://doi.org/10.1016/j.cell.2020.06.013 https://ars.els-cdn.com/content/image/1-s2.0-S0092867420307443-mmc1.xlsx Annotions_S1A McDermott, J.E. Proteogenomic Characterization of Ovarian HGSC Implicates Mitotic Kinases, Replication Stress in Observed Chromosomal Instability https://doi.org/10.1016/j.xcrm.2020.100004 https://ars.els-cdn.com/content/image/1-s2.0-S2666379120300045-mmc2.xlsx BaselineClinicalData Cao, L. Proteogenomic characterization of pancreatic ductal adenocarcinoma https://doi.org/10.1016/j.cell.2021.08.023 https://ars.els-cdn.com/content/image/1-s2.0-S0092867421009971-mmc1.xlsx Clinical_data Dou, Y. Proteogenomic Characterization of Endometrial Carcinoma https://doi.org/10.1016/j.cell.2020.01.026 https://ars.els-cdn.com/content/image/1-s2.0-S0092867420301070-mmc1.xlsx UCEC_CPTAC3_meta_table_V2.1 Acknowledgements This work was supported by NSF awards IIS-2106888 and OAC-2411221, and the DARPA Automating Scientific Knowledge Extraction and Modeling (ASKEM) program, agreement No. HR0011262087, and the ARPA-H BDF program. The views, opinions, and findings expressed are those of the authors and should not be interpreted as representing the official views or policies of DARPA, ARPA-H, the U.S. Government, or NSF. We thank our collaborators, Dr. Tonia Korves and the MITRE team, for contributing their experiences in biomedical data integration and an initial proof of concept of this benchmark.
提供机构:
Zenodo
创建时间:
2025-04-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作