南美白对虾养殖领域知识抽取数据集

Name: 南美白对虾养殖领域知识抽取数据集
Creator: 广东海洋大学
Published: 2025-10-16 00:00:00
License: 暂无描述

广东省数据知识产权存证登记平台2025-10-16 更新2026-04-17 收录

下载链接：

https://data.gpic.gd.cn/dataStorage/credentialInfo.jhtml?no=20250744000016575

下载链接

链接失效反馈

官方服务：

资源简介：

VIE（南美白对虾养殖领域知识抽取）数据集是首个聚焦该领域的中文知识抽取数据集，基于近10年CNKI北大核心/CSCD期刊及专业书籍构建，含67篇文献、2本专著，语料38.5万字符。其采用领域本体驱动的自顶向下构建法，定义10类核心实体（研究对象、疾病、饵料等）及10种语义关系（调控、共生、影响等）。数据集经水产养殖专家指导标注，标注一致性κ=0.87，采用BIO模式，形成12,814个实体、5,498个关系的高质量标注数据。经BERT-BiLSTM-CRF模型验证，命名实体识别F1值达82.8%，有效性与实用性获证。该数据集填补了水产养殖领域中文知识抽取空白，为智慧渔业、知识图谱构建及养殖智能问答等应用提供关键数据支撑，兼具学术价值与应用前景。

The VIE (Knowledge Extraction Dataset for Pacific White Shrimp (Litopenaeus vannamei) Aquaculture) is the first Chinese-language knowledge extraction dataset dedicated to this specific aquaculture domain. It was compiled using nearly a decade of CNKI-indexed Peking University core journals, CSCD journals, and professional books, containing 67 academic papers, 2 monographs, and a total corpus of 385,000 characters. This dataset adopts a top-down construction framework driven by domain ontology, defining 10 core entity categories (e.g., research objects, diseases, feeds, etc.) and 10 semantic relations (e.g., regulation, symbiosis, influence, etc.). Annotated under the supervision of aquaculture experts, it follows the BIO annotation schema, with an inter-annotator agreement Cohen's κ of 0.87, resulting in high-quality annotated data including 12,814 entities and 5,498 relational triples. Validated using the BERT-BiLSTM-CRF model, it achieved an F1-score of 82.8% for named entity recognition (NER), confirming its validity and practical applicability. This dataset fills a critical gap in Chinese knowledge extraction for the aquaculture field, providing key data support for applications such as smart fisheries, knowledge graph construction, and aquaculture-based intelligent question answering, and holds both significant academic value and promising application prospects.

提供机构：

广东海洋大学

创建时间：

2025-10-16

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是首个聚焦南美白对虾养殖领域的中文知识抽取数据集，基于近10年CNKI北大核心/CSCD期刊及专业书籍构建，包含67篇文献和2本专著，语料共38.5万字符。数据集采用领域本体驱动的自顶向下构建法，定义了10类核心实体和10种语义关系，经专家指导标注后形成12,814个实体和5,498个关系的高质量标注数据，命名实体识别F1值达82.8%。该数据集填补了水产养殖领域中文知识抽取的空白，可为智慧渔业、知识图谱构建及养殖智能问答等应用提供关键数据支撑。

以上内容由遇见数据集搜集并总结生成