IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- token-classification
language:
- en
- de
tags:
- agriculture
- Named_entity_recognition
- NER
- LLM
- Crops
- soil
- time_statement
- location
pretty_name: agri_fair_metadata_ner
size_categories:
- 1K<n<10K
configs:
- config_name: doc_split
data_files:
- split: train
path: doc_split/train-*
- split: test
path: doc_split/test-*
- config_name: sentence_split
data_files:
- split: train
path: sentence_split/train-*
- split: test
path: sentence_split/test-*
dataset_info:
- config_name: doc_split
features:
- name: file_name
dtype: string
- name: Tokens
sequence: string
- name: ner_tags
sequence: int64
- name: Labels
sequence: string
- name: number_of_tokens
dtype: int64
- name: Language
dtype: string
- name: source
dtype: string
- name: Label_counts
dtype: string
- name: number_of_annotations
dtype: int64
- name: doi
dtype: string
splits:
- name: train
num_bytes: 1830788
num_examples: 318
- name: test
num_bytes: 202721
num_examples: 31
download_size: 299399
dataset_size: 2033509
- config_name: sentence_split
features:
- name: file_name
dtype: string
- name: Tokens
sequence: string
- name: ner_tags
sequence: int64
- name: Labels
sequence: string
- name: number_of_tokens
dtype: int64
- name: Language
dtype: string
- name: source
dtype: string
- name: Label_counts
dtype: string
- name: number_of_annotations
dtype: int64
- name: doi
dtype: string
- name: sentence_id
dtype: string
splits:
- name: train
num_bytes: 2137912
num_examples: 2722
- name: test
num_bytes: 240463
num_examples: 319
download_size: 374603
dataset_size: 2378375
---
# Dataset Card for A Manually Annotated Agricultural Dataset for AI-Based NER and FAIR Metadata Enrichment
<!-- Provide a quick summary of the dataset. -->
Supported by [FAIRagro](https://fairagro.net/en/), the pilot use case “Increasing FAIRness of FAIRagro data through AI-supported metadata enrichment” addresses this gap by creating a manually annotated text corpus designed to support Named Entity Recognition (NER) models in agricultural research. NER models can automate metadata extraction from unstructured text, such as dataset abstracts, thereby enabling metadata enrichment.
## Dataset Details
## Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
This dataset contains Named Entity Recognition (NER) annotations derived from curated **CAS XMI files exported from INCEpTION**.
The corpus is provided in **three complementary formats**, each optimized for different model architectures and evaluation scenarios:
1. **Document-level tokenized (file-based) CSV**
2. **Sentence-level tokenized CSV**
## Annotation Entities, Their Attributes, and Definitions
| **Entity** | **Attribute** | **Definition** |
|----------------|----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Crop** | Crop species | The name of a taxonomic rank of a plant. This can either be a scientific name or a common name. Each mention of such a name is a distinct annotation. Consult taxonomies such as AGROVOC (https://agrovoc.fao.org/browse/agrovoc/en/) for reference. |
| **Crop** | Crop variety | The name of a specific variety of a plant. |
| **Soil** | Soil texture | Soil texture measures the proportion of sand, silt, and clay-sized particles in a soil sample. Annotate a soil texture if it is part of a soil texture classification, such as the USDA Soil Texture Classification (12 soil textures) or the textures from the Bodenkundliche Kartieranleitung. |
| **Soil** | Soil reference group | A categorization of soil groups following the WRB Reference Soil Group (RSG) definitions (https://inspire.ec.europa.eu/codelist/WRBReferenceSoilGroupValue). |
| **Soil** | Soil depth | Soil depth measures the depth from which a soil sample was taken. |
| **Soil** | Bulk density | The dry weight of soil divided by its volume. |
| **Soil** | pH value | Hydrogen ion concentration in a soil sample. |
| **Soil** | Organic carbon | Measurable components of soil organic matter in a soil sample. |
| **Soil** | Available nitrogen | Nitrogen that is present in a soil sample and available to plants. Only annotate explicit mentions of available nitrogen and ensure the reference is to soil nitrogen, not fertilizer nitrogen. |
| **Location** | Location name | The name of a location related to a dataset. These may include continents (e.g., “Europe”), countries (e.g., “Germany”), federal states (e.g., “Lower Saxony”), regions (e.g., “Kraichgau”), cities, villages, towns (e.g., “Quedlinburg”), or municipalities (e.g., “Grossbeeren”). |
| **Location** | Latitude | The north–south angular position of a location. Annotate the coordinate(s). |
| **Location** | Longitude | The west–east angular position of a location. Annotate the coordinate(s). |
| **Time statement** | Start time | A point in time when an event related to a dataset started (e.g., data collection). This may be a date, a season, or a combination. Annotate all relevant points if multiple events exist. If only one time point is known, use this property. |
| **Time statement** | End time | A point in time when an event related to a dataset ended (e.g., data collection). This may be a date, a season, or a combination. Annotate all relevant points if multiple events exist. |
| **Time statement** | Duration | A range between two time points. Use this property if start and end points are unknown. |
---
### Labels Mapping
#### Label2id
```json
{
"O": 0,
"B-soilReferenceGroup": 1,
"I-soilReferenceGroup": 2,
"B-soilOrganicCarbon": 3,
"I-soilOrganicCarbon": 4,
"B-soilTexture": 5,
"I-soilTexture": 6,
"B-startTime": 7,
"I-startTime": 8,
"B-endTime": 9,
"I-endTime": 10,
"B-city": 11,
"I-city": 12,
"B-duration": 13,
"I-duration": 14,
"B-cropSpecies": 15,
"I-cropSpecies": 16,
"B-soilAvailableNitrogen": 17,
"I-soilAvailableNitrogen": 18,
"B-soilDepth": 19,
"I-soilDepth": 20,
"B-region": 21,
"I-region": 22,
"B-country": 23,
"I-country": 24,
"B-longitude": 25,
"I-longitude": 26,
"B-latitude": 27,
"I-latitude": 28,
"B-cropVariety": 29,
"I-cropVariety": 30,
"B-soilPH": 31,
"I-soilPH": 32,
"B-soilBulkDensity": 33,
"I-soilBulkDensity": 34
}
```
#### id2label
```json
{
"0": "O",
"1": "B-soilReferenceGroup",
"2": "I-soilReferenceGroup",
"3": "B-soilOrganicCarbon",
"4": "I-soilOrganicCarbon",
"5": "B-soilTexture",
"6": "I-soilTexture",
"7": "B-startTime",
"8": "I-startTime",
"9": "B-endTime",
"10": "I-endTime",
"11": "B-city",
"12": "I-city",
"13": "B-duration",
"14": "I-duration",
"15": "B-cropSpecies",
"16": "I-cropSpecies",
"17": "B-soilAvailableNitrogen",
"18": "I-soilAvailableNitrogen",
"19": "B-soilDepth",
"20": "I-soilDepth",
"21": "B-region",
"22": "I-region",
"23": "B-country",
"24": "I-country",
"25": "B-longitude",
"26": "I-longitude",
"27": "B-latitude",
"28": "I-latitude",
"29": "B-cropVariety",
"30": "I-cropVariety",
"31": "B-soilPH",
"32": "I-soilPH",
"33": "B-soilBulkDensity",
"34": "I-soilBulkDensity"
}
```
### Dataset Versions
#### 1️⃣ Document-Level Tokenized Format (File-Based)
Each row in this CSV corresponds to a **complete document**, tokenized using spaCy.
##### **Columns**
| Column | Description |
|-----------------------|------------------------------------------|
| **file_name** | Unique filename ID of the document |
| **Tokens** | List of tokens (words) |
| **Labels** | BIO labels aligned 1:1 with tokens |
| **ner_tags** | Integer mapping of labels for training |
| **number_of_tokens** | Total token count |
| **Language** | `"en"` or `"de"` |
| **source** | Origin repository (`BonaRes` or `OpenAgrar`) |
| **Label_counts** | `Counter` object with annotation frequency |
| **number_of_annotations** | Sum of all annotated entity spans |
| **DOI** | Document DOI (when available) |
#### **Intended Use**
- Document-level transformer models (Longformer, BigBird, etc.)
- Corpus statistics and label distribution analysis
- Document classification + NER pipelines
---
### 2️⃣ Sentence-Level Tokenized Format
Each row corresponds to **a single sentence**, preserving alignment with the original document.
#### **Sentence Identifier Format**
```bash
fileID-sentenceIndex
Example:
73465-03
```
#### **Columns**
Same as file-level format, with:
| Column | Description |
|-------------|------------------------------------------------------------|
| **sentence_id** | Sentence identifier combining file name and sentence index |
#### **Intended Use**
- Classical BERT-style NER (max length ≈512 tokens)
- Models with fixed-length input windows
- Fine-grained sentence-level training and evaluation
## Code Repository
All scripts used to preprocess the INCEpTION XMI files, generate the tokenized datasets,
convert annotations into BIO format, create JSON span annotations, and build the
HuggingFace-ready dataset version are openly available in the following GitHub repository:
🔗 **Dataset Processing Code Repository:**
https://github.com/fairagro/pilot-uc-textmining-metadata
The current version is v1.0.0
The repository includes:
- Fetching the metadata from the research data infrastructures
- CAS → BIO conversion scripts
- Sentence and document tokenization routines
- Gazetteer-based location normalization
- DOI mapping utilities
- Span annotation builder (HuggingFace-like JSON format)
- Dataset export pipeline (CSV, JSON, HuggingFace)
- Example configuration files
- A full reproducible workflow for regenerating the dataset
## Authors and Affiliations
| Name | ORCID | Affiliation |
|--------------------|---------------------|-------------------------------------------------------------------|
| **Abanoub Abdelmalak** | [0009-0001-0892-3614](https://orcid.org/0009-0001-0892-3614) | ZB MED – Information Centre for Life Sciences; University of Bonn |
| **Gabriel Schneider** | [0000-0001-6573-3115](https://orcid.org/0000-0001-6573-3115) | ZB MED – Information Centre for Life Sciences; University of Bonn |
| **Heike Riegler** | [0000-0002-1302-4533](https://orcid.org/0000-0002-1302-4533) | Julius Kühn-Institut |
| **Kristin Meier** | [0009-0003-1966-9679](https://orcid.org/0009-0003-1966-9679) | Leibniz Centre for Agricultural Landscape Research (ZALF) |
| **Xenia Specka** | [0000-0002-1890-0192](https://orcid.org/0000-0002-1890-0192) | Leibniz Centre for Agricultural Landscape Research (ZALF) |
| **Nikolai Svoboda** | [0000-0003-3860-4400](https://orcid.org/0000-0003-3860-4400) | Leibniz Centre for Agricultural Landscape Research (ZALF) |
| **Murtuza Husain** | [0009-0004-1496-5644](https://orcid.org/0009-0004-1496-5644) | ZB MED – Information Centre for Life Sciences; University of Bonn |
| **Juliane Fluck** | [0000-0003-1379-7023](https://orcid.org/0000-0003-1379-7023) | ZB MED – Information Centre for Life Sciences; University of Bonn |
## Licensing
The FAIRagro Metadata Enrichment NER Dataset is released under the:
Creative Commons Attribution 4.0 International (CC BY 4.0) License
License URL:
https://creativecommons.org/licenses/by/4.0/
## Cite as:
```bash
@dataset{abdelmalak_fairagro_ner_2025,
author = {Abdelmalak, Abanoub and Schneider, Gabriel and Riegler, Heike and Meier, Kristin and Specka, Xenia and Svoboda, Nikolai and Husain, Murtuza and Fluck, Juliane},
title = {{FAIRagro NER Dataset: Increasing FAIRness of FAIRagro Data Through AI-Supported Metadata Enrichment}},
year = {2025},
publisher = {Fachrepositorium Lebenswissenschaften (FRL)},
doi = {10.4126/FRL01-6526458},
url = {https://doi.org/10.4126/FRL01-6526458},
note = {Version 1.0}
}
```
提供机构:
IT-ZBMED



