anonymous-042/flyaoc

Name: anonymous-042/flyaoc
Creator: anonymous-042
Published: 2026-04-30 15:49:25
License: 暂无描述

Hugging Face2026-04-30 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/anonymous-042/flyaoc

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other pretty_name: FlyAOC task_categories: - text-generation - question-answering - text-retrieval language: - en tags: - biology - drosophila - genetics - scientific-literature - ontology-curation - benchmark size_categories: - 10K<n<100K --- # FlyAOC: Agentic Ontology Curation Benchmark FlyAOC is a benchmark for evaluating AI agents on end-to-end ontology curation from scientific literature. Given a *Drosophila melanogaster* gene symbol, systems search a corpus of full-text papers and produce structured annotations for gene function, expression, and synonyms. This anonymous review package contains the benchmark inputs and labels, not model prediction dumps. ## Files | File | Description | |---|---| | `corpus.jsonl` | 16,898 full-text PMC-OA articles converted from BioC JSON. | | `pmc_license_manifest.jsonl` | Per-record provenance and license metadata extracted from the BioC-PMC source files. | | `benchmark.jsonl` | The 100 benchmark genes, with FlyBase IDs, symbols, Gene Snapshot summaries, PMCID retrieval sets, and canonical verified labels for all three tasks. | | `ground_truth_hidden.jsonl` | Hidden-term benchmark variant labels. | | `hidden_go_terms.json` | The GO terms hidden for the specificity-gap setting. | | `ontologies/go-basic.obo` | Gene Ontology source used for Task 1 term lookup and semantic evaluation. | | `ontologies/fly_anatomy.obo` | FlyBase anatomy ontology source used for Task 2 anatomy lookup and semantic evaluation. | | `ontologies/fly_development.obo` | FlyBase developmental stage ontology source used for Task 2 stage lookup and semantic evaluation. | | `croissant.json` | Croissant metadata with core and minimal Responsible AI fields. | ## Data Schema Each `corpus.jsonl` record contains: - `pmcid`: PubMed Central identifier. - `title`: article title. - `abstract`: article abstract. - `sections`: mapping from section type to paragraphs, using section keys such as `INTRO`, `METHODS`, `RESULTS`, `DISCUSS`, and `CONCL`. Each `benchmark.jsonl` record contains one gene: - `gene_id`, `gene_symbol`, `summary`, `pmcids` - `task1_function`: Gene Ontology annotations with GO ID, qualifier, aspect, evidence reference, and corpus-grounding fields. - `task2_expression`: expression annotations with anatomy/stage ontology IDs, assay metadata, evidence reference, and corpus-grounding fields. - `task3_synonyms`: full-name and symbol synonyms with corpus-grounding fields. ## Intended Use FlyAOC is intended for evaluating systems that retrieve and synthesize structured biological annotations from a large literature corpus. The primary use case is benchmark evaluation of curation agents under controlled retrieval budgets. The dataset is not intended to train production biomedical systems without additional validation by domain experts. ## Provenance and Annotation The literature corpus was retrieved from the PubMed Central Open Access subset via the BioC-PMC API. Benchmark labels are derived from FlyBase release FB2025_04 and then annotated with corpus-grounding labels that indicate whether the supporting source is present in the provided corpus. The included ontology files define the controlled vocabularies used by the benchmark tools and semantic evaluation. The hidden-term variant removes selected GO terms from ontology search to test whether systems can describe missing concepts when no suitable ontology term is available. ## License and Access This package has mixed provenance and should not be treated as having a single blanket license. - Literature records come from the PubMed Central Open Access subset. Article licenses vary by paper; see `pmc_license_manifest.jsonl` for per-record license metadata. - FlyBase-derived benchmark labels and FlyBase ontology files are based on FlyBase data released under CC-BY 4.0. - Gene Ontology files are released under CC-BY 4.0. - Users are responsible for following the terms associated with each source record. Users with stricter licensing requirements may use the PMCID manifest to re-fetch source articles from PMC directly. ## Responsible AI Notes ### Limitations The benchmark covers 100 well-studied *Drosophila* genes and open-access literature available through PMC-OA. It does not represent all genes, all organisms, non-English literature, paywalled papers, unpublished curation evidence, or all valid biological annotations. ### Biases The corpus reflects publication and open-access biases in the scientific record. Well-studied genes, English-language publications, and journals indexed in PMC-OA are overrepresented. FlyBase labels reflect expert curation priorities and may lag newer literature. ### Sensitive Information The dataset contains scientific articles and biological database annotations. It is not designed to contain human-subject records, demographic attributes, or private personal information. Some source articles may include author names, affiliations, and acknowledgments as part of the public scholarly record. ### Social Impact The benchmark may help improve tools that assist biological database curation and scientific literature review. Misuse risks include over-trusting automated annotations or deploying systems without expert review. FlyAOC should be used as an evaluation resource, not as a substitute for professional biological curation. ### Synthetic Data The corpus and benchmark labels are not synthetic. Model-generated predictions are not included in this dataset package. ## Loading ```python from datasets import load_dataset corpus = load_dataset("json", data_files="corpus.jsonl")["train"] benchmark = load_dataset("json", data_files="benchmark.jsonl")["train"] ``` For review, the intended hosted dataset path is: ```python from datasets import load_dataset corpus = load_dataset("anonymous-042/flyaoc", data_files="corpus.jsonl")["train"] benchmark = load_dataset("anonymous-042/flyaoc", data_files="benchmark.jsonl")["train"] ```

FlyAOC is a benchmark for evaluating AI agents on end-to-end ontology curation from scientific literature. Given a Drosophila melanogaster gene symbol, systems search a corpus of full-text papers and produce structured annotations for gene function, expression, and synonyms. The dataset includes 16,898 full-text PMC-OA articles, detailed information on 100 benchmark genes, hidden-term benchmark variant labels, and Gene Ontology and FlyBase anatomy ontology source files for Task 1 and Task 2. The primary use case is benchmark evaluation of curation agents under controlled retrieval budgets, not for training production biomedical systems. The dataset is derived from the PubMed Central Open Access subset and FlyBase, with mixed provenance and varying license requirements.

提供机构：

anonymous-042

搜集汇总

数据集介绍

构建方式

FlyAOC数据集的构建源于对果蝇（Drosophila melanogaster）基因功能注释自动化需求的深刻洞察。其文献语料库从PubMed Central开放获取子集中提取，经由BioC-PMC接口获取16,898篇全文文章，并转化为结构化JSON格式。基准标签则基于FlyBase FB2025_04版本，为100个经过精心遴选的果蝇基因手工标注了基因本体（GO）功能注释、表达模式（涵盖解剖结构和发育阶段本体）以及同义词信息。尤为巧妙的是，每个标签均通过语料库溯源字段验证其支持文献是否存在于所提供的语料库中，从而确保了评估的封闭性与可靠性。此外，研究团队还构建了隐式术语变体，通过隐藏部分GO术语来测试系统在缺乏合适本体术语时描述新概念的能力，这一设计为基准增添了更为真实的挑战维度。

特点

FlyAOC数据集的核心特征在于其作为端到端本体策展基准的完整性与挑战性。它涵盖了文本生成、问答与文本检索三类典型任务，全面评估AI代理在复杂科学文献中检索、综合并结构化生成生物学注释的能力。数据集不仅提供了详细的文章章节划分（如引言、方法、结果、讨论与结论），还包含了来自FlyBase的权威本体文件（基因本体、果蝇解剖学本体与发育阶段本体），为语义评估提供了坚实依据。特别值得强调的是，其隐式术语变体设计通过刻意隐藏部分标准术语，迫使系统展现概念描述与知识迁移的智能，这一特性在现有基准中极为罕见。同时，混合的许可协议与负责任的AI说明，也凸显了数据集对学术诚信与伦理使用的深刻考量。

使用方法

使用FlyAOC数据集进行评估轻便而直观，主要通过HuggingFace Datasets库加载两项核心文件：语料库（corpus.jsonl）与基准（benchmark.jsonl）。研究人员可借助`load_dataset("json", data_files="corpus.jsonl")`与`load_dataset("json", data_files="benchmark.jsonl")`指令，轻松获取包含16,898篇全文文章与100个基准基因数据的数据流。在评估流程中，系统需针对指定基因符号，从语料库中检索相关文献，并依据基准提供的有序结构化格式，输出涵盖功能注释、表达细节与同义词的富标签结果。数据集的评价指标可借助附带的OBO本体文件进行语义相似度计算，从而量化代理在真实文献策展场景中的表现。需注意的是，该基准专为受控检索预算下的评估设计，不应直接用于训练生产级生物医学系统，而更适合作为衡量AI策展能力的标尺。

背景与挑战

背景概述

在生物信息学领域，从海量科学文献中自动化提取结构化知识一直是核心挑战之一，而基因本体论（GO）的策展工作尤为关键。FlyAOC数据集由匿名研究团队于近年创建，旨在构建一个用于评估AI代理在端到端本体策展任务中性能的基准。该数据集聚焦于模式生物果蝇（Drosophila melanogaster）的基因功能、表达模式及同义词注释，其核心研究问题在于检验智能系统能否在有限检索预算内，从包含16,898篇全文文献的开放获取语料库中，精准检索并合成结构化生物学注释。依托FlyBase等权威数据库的标签体系，FlyAOC提供了一个可复现的评估框架，对推动自动化文献策展工具的发展、降低人工策展成本具有重要影响力。

当前挑战

FlyAOC所解决的领域挑战在于生物文献本体策展的自动化程度不足，传统依赖专家的人工策展方式难以应对文献数量的指数级增长，亟需能够理解复杂科学文本并生成标准化注释的智能代理。构建过程中面临多重难点：首先，语料库虽来自PMC开放获取子集，但文献许可协议各异，需精细处理版权合规与数据溯源问题；其次，基准标签源自FlyBase现有策展数据，但需额外添加语料库溯源标签以验证支持证据的收录情况，且部分GO术语被刻意隐藏，用于测试系统在缺失标准概念时的描述能力；最后，基因功能、表达与同义词三类任务覆盖多种本体结构，评估不仅需精确匹配，还需进行语义相似度计算，对模型的检索与推理能力提出了综合考验。

常用场景

经典使用场景

在生物医学自然语言处理与知识图谱构建的交汇地带，FlyAOC被设计为评估人工智能代理在端到端本体论策展任务上的基准测试集。给定一个果蝇基因符号，系统需要从规模达16,898篇全文论文的语料库中检索信息，完成三项核心注释任务：基因本体功能注释、基于解剖与发育阶段的表达模式标注，以及同义词识别。这一场景完美模拟了专业生物数据库维护中，将分散的文献证据转化为结构化、受控词表标注的完整流程，为衡量文献驱动的自动策展能力提供了标准化测试平台。

衍生相关工作

FlyAOC的发布催生了一系列围绕生物文献智能策展的经典工作。在模型架构层面，研究者将其作为评测平台，推动了基于检索增强生成的大型语言模型在结构化生物学知识提取中的适应与优化，例如设计多代理人协同框架，分别负责文献检索、证据推理与本体术语匹配。在方法论上，该基准衍生了关于上下文学习与特定领域本体对齐的研究，尤其是其隐藏术语变体设置（hidden-term variant）启发了对AI系统在处理未命名概念时创造新描述能力的探讨。此外，围绕FlyAOC展开的共享任务与公开排行榜，也激励了学术社区开发出多种融合语义相似度计算与文献溯源能力的创新注释方案。

数据集最近研究