LULCC-KnowText - annotated entities for knowledge extraction on Land Use and Land Cover change
收藏DataCite Commons2026-04-02 更新2026-04-25 收录
下载链接:
https://dataverse.cirad.fr/citation?persistentId=doi:10.18167/DVN1/FRQGID
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains a subset of relevant sentences extracted from the Dataset "LULCC-KnowText - annotated text segments for knowledge extraction on Land Use and Land Cover change" (<a href="https://doi.org/10.18167/DVN1/F0HLEH">https://doi.org/10.18167/DVN1/F0HLEH</a>). These sentences were annotated at the entity level, including 16 types of entities.</p>
<p>The dataset consists of 2 files:</p>
<ul>
<li><strong>annotated_entities.jsonl</strong>: A JSON lines file in which each entry corresponds to a text segment extracted from a scientific document and enriched with manually annotated entities for knowledge extraction. The top-level keys include :
<ul> <li>id_segment: Identifier of the text segment manually labelled. This identifier links the segment to its corresponding scientific article.</li>
<li>text: Raw text of the segment as it appears in the original scientific article.</li>
<li>entities: List of entities manually annotated within the text segment.</li></ul>
Each element in the entities list represents a single annotated entity and contains the following fields:
<ul>
<li> id: Unique identifier of the entity annotation within the segment.</li>
<li> label: Semantic category assigned to the entity. Labels correspond to domain-specific concepts related to land use and land cover (e.g. LOC, LOC_LANDSCAPE, LULC, PRACTICE, CHANGE_UP, etc.).</li>
<li> start: Start character offset of the entity in the text field (inclusive).</li>
<li>end: End character offset of the entity in the text field (exclusive).</li>
<li>value: Text span corresponding exactly to the annotated entity.</li>
</ul>
<li><strong>entity_annotation_guidelines.pdf</strong>: The annotation guidelines used to manually annotate the entities.</li>
</ul>
本数据集为从"LULCC-KnowText——面向土地利用与土地覆被变化知识抽取的标注文本片段数据集"(DOI: 10.18167/DVN1/F0HLEH)中提取的相关语句子集。该数据集所收录的语句已完成实体级标注,涵盖16类实体。
本数据集包含两个文件:
- **annotated_entities.jsonl**:该文件为JSON Lines格式,每条记录对应一篇学术文献中提取的文本片段,并附带人工标注的实体以支撑知识抽取任务。其顶层字段包括:
- id_segment:文本片段的手动标注标识符,可将该片段与其对应的学术文献进行关联。
- text:该文本片段在原始学术文献中的原始文本内容。
- entities:该文本片段内人工标注的实体列表。
entities字段列表中的每个元素代表一个独立的标注实体,包含以下字段:
- id:该片段内实体标注的唯一标识符。
- label:为实体分配的语义类别,标签对应土地利用与土地覆被领域的特定概念(例如LOC、LOC_LANDSCAPE、LULC、PRACTICE、CHANGE_UP等)。
- start:实体在text字段中的起始字符偏移量(包含该位置)。
- end:实体在text字段中的结束字符偏移量(不包含该位置)。
- value:与标注实体完全匹配的文本片段。
- **entity_annotation_guidelines.pdf**:用于人工实体标注的官方标注指南文档。
提供机构:
CIRAD Dataverse
创建时间:
2026-03-30



