LULCC-KnowText - annotated entities for knowledge extraction on Land Use and Land Cover change

Name: LULCC-KnowText - annotated entities for knowledge extraction on Land Use and Land Cover change
Creator: CIRAD Dataverse
Published: 2026-04-02 12:25:40
License: 暂无描述

DataCite Commons2026-04-02 更新2026-04-25 收录

下载链接：

https://dataverse.cirad.fr/citation?persistentId=doi:10.18167/DVN1/FRQGID

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains a subset of relevant sentences extracted from the Dataset "LULCC-KnowText - annotated text segments for knowledge extraction on Land Use and Land Cover change" (<a href="https://doi.org/10.18167/DVN1/F0HLEH">https://doi.org/10.18167/DVN1/F0HLEH</a>). These sentences were annotated at the entity level, including 16 types of entities. The dataset consists of 2 files: <ul> <li>annotated_entities.jsonl: A JSON lines file in which each entry corresponds to a text segment extracted from a scientific document and enriched with manually annotated entities for knowledge extraction. The top-level keys include : <ul> <li>id_segment: Identifier of the text segment manually labelled. This identifier links the segment to its corresponding scientific article.</li> <li>text: Raw text of the segment as it appears in the original scientific article.</li> <li>entities: List of entities manually annotated within the text segment.</li></ul> Each element in the entities list represents a single annotated entity and contains the following fields: <ul> <li> id: Unique identifier of the entity annotation within the segment.</li> <li> label: Semantic category assigned to the entity. Labels correspond to domain-specific concepts related to land use and land cover (e.g. LOC, LOC_LANDSCAPE, LULC, PRACTICE, CHANGE_UP, etc.).</li> <li> start: Start character offset of the entity in the text field (inclusive).</li> <li>end: End character offset of the entity in the text field (exclusive).</li> <li>value: Text span corresponding exactly to the annotated entity.</li> </ul> <li>entity_annotation_guidelines.pdf: The annotation guidelines used to manually annotate the entities.</li> </ul>

本数据集为从"LULCC-KnowText——面向土地利用与土地覆被变化知识抽取的标注文本片段数据集"（DOI: 10.18167/DVN1/F0HLEH）中提取的相关语句子集。该数据集所收录的语句已完成实体级标注，涵盖16类实体。本数据集包含两个文件： - **annotated_entities.jsonl**：该文件为JSON Lines格式，每条记录对应一篇学术文献中提取的文本片段，并附带人工标注的实体以支撑知识抽取任务。其顶层字段包括： - id_segment：文本片段的手动标注标识符，可将该片段与其对应的学术文献进行关联。 - text：该文本片段在原始学术文献中的原始文本内容。 - entities：该文本片段内人工标注的实体列表。 entities字段列表中的每个元素代表一个独立的标注实体，包含以下字段： - id：该片段内实体标注的唯一标识符。 - label：为实体分配的语义类别，标签对应土地利用与土地覆被领域的特定概念（例如LOC、LOC_LANDSCAPE、LULC、PRACTICE、CHANGE_UP等）。 - start：实体在text字段中的起始字符偏移量（包含该位置）。 - end：实体在text字段中的结束字符偏移量（不包含该位置）。 - value：与标注实体完全匹配的文本片段。 - **entity_annotation_guidelines.pdf**：用于人工实体标注的官方标注指南文档。

提供机构：

CIRAD Dataverse

创建时间：

2026-03-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集