five

LULCC-KnowText - annotated entities for knowledge extraction on Land Use and Land Cover change

收藏
DataCite Commons2026-04-02 更新2026-04-25 收录
下载链接:
https://dataverse.cirad.fr/citation?persistentId=doi:10.18167/DVN1/FRQGID
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains a subset of relevant sentences extracted from the Dataset "LULCC-KnowText - annotated text segments for knowledge extraction on Land Use and Land Cover change" (<a href="https://doi.org/10.18167/DVN1/F0HLEH">https://doi.org/10.18167/DVN1/F0HLEH</a>). These sentences were annotated at the entity level, including 16 types of entities.</p> <p>The dataset consists of 2 files:</p> <ul> <li><strong>annotated_entities.jsonl</strong>: A JSON lines file in which each entry corresponds to a text segment extracted from a scientific document and enriched with manually annotated entities for knowledge extraction. The top-level keys include : <ul> <li>id_segment: Identifier of the text segment manually labelled. This identifier links the segment to its corresponding scientific article.</li> <li>text: Raw text of the segment as it appears in the original scientific article.</li> <li>entities: List of entities manually annotated within the text segment.</li></ul> Each element in the entities list represents a single annotated entity and contains the following fields: <ul> <li> id: Unique identifier of the entity annotation within the segment.</li> <li> label: Semantic category assigned to the entity. Labels correspond to domain-specific concepts related to land use and land cover (e.g. LOC, LOC_LANDSCAPE, LULC, PRACTICE, CHANGE_UP, etc.).</li> <li> start: Start character offset of the entity in the text field (inclusive).</li> <li>end: End character offset of the entity in the text field (exclusive).</li> <li>value: Text span corresponding exactly to the annotated entity.</li> </ul> <li><strong>entity_annotation_guidelines.pdf</strong>: The annotation guidelines used to manually annotate the entities.</li> </ul>

本数据集为从"LULCC-KnowText——面向土地利用与土地覆被变化知识抽取的标注文本片段数据集"(DOI: 10.18167/DVN1/F0HLEH)中提取的相关语句子集。该数据集所收录的语句已完成实体级标注,涵盖16类实体。 本数据集包含两个文件: - **annotated_entities.jsonl**:该文件为JSON Lines格式,每条记录对应一篇学术文献中提取的文本片段,并附带人工标注的实体以支撑知识抽取任务。其顶层字段包括: - id_segment:文本片段的手动标注标识符,可将该片段与其对应的学术文献进行关联。 - text:该文本片段在原始学术文献中的原始文本内容。 - entities:该文本片段内人工标注的实体列表。 entities字段列表中的每个元素代表一个独立的标注实体,包含以下字段: - id:该片段内实体标注的唯一标识符。 - label:为实体分配的语义类别,标签对应土地利用与土地覆被领域的特定概念(例如LOC、LOC_LANDSCAPE、LULC、PRACTICE、CHANGE_UP等)。 - start:实体在text字段中的起始字符偏移量(包含该位置)。 - end:实体在text字段中的结束字符偏移量(不包含该位置)。 - value:与标注实体完全匹配的文本片段。 - **entity_annotation_guidelines.pdf**:用于人工实体标注的官方标注指南文档。
提供机构:
CIRAD Dataverse
创建时间:
2026-03-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作