vida-nyu/magneto-gdc-synthetic
收藏Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/vida-nyu/magneto-gdc-synthetic
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含用于在生物医学领域中微调Magneto模式检索器的合成训练数据。数据集包括来自GDC(Genomic Data Commons)目标模式的736个锚定列,每个列通过多种合成变体增强,以创建用于对比学习的多样化训练集。合成数据通过两种互补的增强策略生成:基于LLM的增强(生成语义等效但句法多样的列变体)和基于结构的增强(应用字符替换、删除和值采样等扰动)。数据集包含4,416个合成列变体,分为原始、精确和语义三种增强类型。数据集结构包括锚定列名称、增强类型、变体名称和域值等字段,适用于自监督对比学习任务。
This dataset contains synthetically generated training data used to fine-tune the Magneto schema retriever for schema matching tasks in the biomedical domain. The dataset includes 736 anchor columns from the GDC (Genomic Data Commons) target schema, each augmented with multiple synthetic variants to create a diverse training set for contrastive learning. The synthetic data was generated using two complementary augmentation strategies: LLM-based augmentation (generating semantically equivalent but syntactically diverse column variants) and structure-based augmentation (applying perturbations like character replacements, deletions, and value sampling). The dataset contains 4,416 synthetic column variants, categorized into original, exact, and semantic augmentation types. The dataset structure includes fields such as anchor column name, augmentation type, variant name, and domain values, and is designed for self-supervised contrastive learning tasks.
提供机构:
vida-nyu



