five

IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment

收藏
Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - token-classification language: - en - de tags: - agriculture - Named_entity_recognition - NER - LLM - Crops - soil - time_statement - location pretty_name: agri_fair_metadata_ner size_categories: - 1K<n<10K configs: - config_name: doc_split data_files: - split: train path: doc_split/train-* - split: test path: doc_split/test-* - config_name: sentence_split data_files: - split: train path: sentence_split/train-* - split: test path: sentence_split/test-* dataset_info: - config_name: doc_split features: - name: file_name dtype: string - name: Tokens sequence: string - name: ner_tags sequence: int64 - name: Labels sequence: string - name: number_of_tokens dtype: int64 - name: Language dtype: string - name: source dtype: string - name: Label_counts dtype: string - name: number_of_annotations dtype: int64 - name: doi dtype: string splits: - name: train num_bytes: 1830788 num_examples: 318 - name: test num_bytes: 202721 num_examples: 31 download_size: 299399 dataset_size: 2033509 - config_name: sentence_split features: - name: file_name dtype: string - name: Tokens sequence: string - name: ner_tags sequence: int64 - name: Labels sequence: string - name: number_of_tokens dtype: int64 - name: Language dtype: string - name: source dtype: string - name: Label_counts dtype: string - name: number_of_annotations dtype: int64 - name: doi dtype: string - name: sentence_id dtype: string splits: - name: train num_bytes: 2137912 num_examples: 2722 - name: test num_bytes: 240463 num_examples: 319 download_size: 374603 dataset_size: 2378375 --- # Dataset Card for A Manually Annotated Agricultural Dataset for AI-Based NER and FAIR Metadata Enrichment <!-- Provide a quick summary of the dataset. --> Supported by [FAIRagro](https://fairagro.net/en/), the pilot use case “Increasing FAIRness of FAIRagro data through AI-supported metadata enrichment” addresses this gap by creating a manually annotated text corpus designed to support Named Entity Recognition (NER) models in agricultural research. NER models can automate metadata extraction from unstructured text, such as dataset abstracts, thereby enabling metadata enrichment. ## Dataset Details ## Dataset Description <!-- Provide a longer summary of what this dataset is. --> This dataset contains Named Entity Recognition (NER) annotations derived from curated **CAS XMI files exported from INCEpTION**. The corpus is provided in **three complementary formats**, each optimized for different model architectures and evaluation scenarios: 1. **Document-level tokenized (file-based) CSV** 2. **Sentence-level tokenized CSV** ## Annotation Entities, Their Attributes, and Definitions | **Entity** | **Attribute** | **Definition** | |----------------|----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **Crop** | Crop species | The name of a taxonomic rank of a plant. This can either be a scientific name or a common name. Each mention of such a name is a distinct annotation. Consult taxonomies such as AGROVOC (https://agrovoc.fao.org/browse/agrovoc/en/) for reference. | | **Crop** | Crop variety | The name of a specific variety of a plant. | | **Soil** | Soil texture | Soil texture measures the proportion of sand, silt, and clay-sized particles in a soil sample. Annotate a soil texture if it is part of a soil texture classification, such as the USDA Soil Texture Classification (12 soil textures) or the textures from the Bodenkundliche Kartieranleitung. | | **Soil** | Soil reference group | A categorization of soil groups following the WRB Reference Soil Group (RSG) definitions (https://inspire.ec.europa.eu/codelist/WRBReferenceSoilGroupValue). | | **Soil** | Soil depth | Soil depth measures the depth from which a soil sample was taken. | | **Soil** | Bulk density | The dry weight of soil divided by its volume. | | **Soil** | pH value | Hydrogen ion concentration in a soil sample. | | **Soil** | Organic carbon | Measurable components of soil organic matter in a soil sample. | | **Soil** | Available nitrogen | Nitrogen that is present in a soil sample and available to plants. Only annotate explicit mentions of available nitrogen and ensure the reference is to soil nitrogen, not fertilizer nitrogen. | | **Location** | Location name | The name of a location related to a dataset. These may include continents (e.g., “Europe”), countries (e.g., “Germany”), federal states (e.g., “Lower Saxony”), regions (e.g., “Kraichgau”), cities, villages, towns (e.g., “Quedlinburg”), or municipalities (e.g., “Grossbeeren”). | | **Location** | Latitude | The north–south angular position of a location. Annotate the coordinate(s). | | **Location** | Longitude | The west–east angular position of a location. Annotate the coordinate(s). | | **Time statement** | Start time | A point in time when an event related to a dataset started (e.g., data collection). This may be a date, a season, or a combination. Annotate all relevant points if multiple events exist. If only one time point is known, use this property. | | **Time statement** | End time | A point in time when an event related to a dataset ended (e.g., data collection). This may be a date, a season, or a combination. Annotate all relevant points if multiple events exist. | | **Time statement** | Duration | A range between two time points. Use this property if start and end points are unknown. | --- ### Labels Mapping #### Label2id ```json { "O": 0, "B-soilReferenceGroup": 1, "I-soilReferenceGroup": 2, "B-soilOrganicCarbon": 3, "I-soilOrganicCarbon": 4, "B-soilTexture": 5, "I-soilTexture": 6, "B-startTime": 7, "I-startTime": 8, "B-endTime": 9, "I-endTime": 10, "B-city": 11, "I-city": 12, "B-duration": 13, "I-duration": 14, "B-cropSpecies": 15, "I-cropSpecies": 16, "B-soilAvailableNitrogen": 17, "I-soilAvailableNitrogen": 18, "B-soilDepth": 19, "I-soilDepth": 20, "B-region": 21, "I-region": 22, "B-country": 23, "I-country": 24, "B-longitude": 25, "I-longitude": 26, "B-latitude": 27, "I-latitude": 28, "B-cropVariety": 29, "I-cropVariety": 30, "B-soilPH": 31, "I-soilPH": 32, "B-soilBulkDensity": 33, "I-soilBulkDensity": 34 } ``` #### id2label ```json { "0": "O", "1": "B-soilReferenceGroup", "2": "I-soilReferenceGroup", "3": "B-soilOrganicCarbon", "4": "I-soilOrganicCarbon", "5": "B-soilTexture", "6": "I-soilTexture", "7": "B-startTime", "8": "I-startTime", "9": "B-endTime", "10": "I-endTime", "11": "B-city", "12": "I-city", "13": "B-duration", "14": "I-duration", "15": "B-cropSpecies", "16": "I-cropSpecies", "17": "B-soilAvailableNitrogen", "18": "I-soilAvailableNitrogen", "19": "B-soilDepth", "20": "I-soilDepth", "21": "B-region", "22": "I-region", "23": "B-country", "24": "I-country", "25": "B-longitude", "26": "I-longitude", "27": "B-latitude", "28": "I-latitude", "29": "B-cropVariety", "30": "I-cropVariety", "31": "B-soilPH", "32": "I-soilPH", "33": "B-soilBulkDensity", "34": "I-soilBulkDensity" } ``` ### Dataset Versions #### 1️⃣ Document-Level Tokenized Format (File-Based) Each row in this CSV corresponds to a **complete document**, tokenized using spaCy. ##### **Columns** | Column | Description | |-----------------------|------------------------------------------| | **file_name** | Unique filename ID of the document | | **Tokens** | List of tokens (words) | | **Labels** | BIO labels aligned 1:1 with tokens | | **ner_tags** | Integer mapping of labels for training | | **number_of_tokens** | Total token count | | **Language** | `"en"` or `"de"` | | **source** | Origin repository (`BonaRes` or `OpenAgrar`) | | **Label_counts** | `Counter` object with annotation frequency | | **number_of_annotations** | Sum of all annotated entity spans | | **DOI** | Document DOI (when available) | #### **Intended Use** - Document-level transformer models (Longformer, BigBird, etc.) - Corpus statistics and label distribution analysis - Document classification + NER pipelines --- ### 2️⃣ Sentence-Level Tokenized Format Each row corresponds to **a single sentence**, preserving alignment with the original document. #### **Sentence Identifier Format** ```bash fileID-sentenceIndex Example: 73465-03 ``` #### **Columns** Same as file-level format, with: | Column | Description | |-------------|------------------------------------------------------------| | **sentence_id** | Sentence identifier combining file name and sentence index | #### **Intended Use** - Classical BERT-style NER (max length ≈512 tokens) - Models with fixed-length input windows - Fine-grained sentence-level training and evaluation ## Code Repository All scripts used to preprocess the INCEpTION XMI files, generate the tokenized datasets, convert annotations into BIO format, create JSON span annotations, and build the HuggingFace-ready dataset version are openly available in the following GitHub repository: 🔗 **Dataset Processing Code Repository:** https://github.com/fairagro/pilot-uc-textmining-metadata The current version is v1.0.0 The repository includes: - Fetching the metadata from the research data infrastructures - CAS → BIO conversion scripts - Sentence and document tokenization routines - Gazetteer-based location normalization - DOI mapping utilities - Span annotation builder (HuggingFace-like JSON format) - Dataset export pipeline (CSV, JSON, HuggingFace) - Example configuration files - A full reproducible workflow for regenerating the dataset ## Authors and Affiliations | Name | ORCID | Affiliation | |--------------------|---------------------|-------------------------------------------------------------------| | **Abanoub Abdelmalak** | [0009-0001-0892-3614](https://orcid.org/0009-0001-0892-3614) | ZB MED – Information Centre for Life Sciences; University of Bonn | | **Gabriel Schneider** | [0000-0001-6573-3115](https://orcid.org/0000-0001-6573-3115) | ZB MED – Information Centre for Life Sciences; University of Bonn | | **Heike Riegler** | [0000-0002-1302-4533](https://orcid.org/0000-0002-1302-4533) | Julius Kühn-Institut | | **Kristin Meier** | [0009-0003-1966-9679](https://orcid.org/0009-0003-1966-9679) | Leibniz Centre for Agricultural Landscape Research (ZALF) | | **Xenia Specka** | [0000-0002-1890-0192](https://orcid.org/0000-0002-1890-0192) | Leibniz Centre for Agricultural Landscape Research (ZALF) | | **Nikolai Svoboda** | [0000-0003-3860-4400](https://orcid.org/0000-0003-3860-4400) | Leibniz Centre for Agricultural Landscape Research (ZALF) | | **Murtuza Husain** | [0009-0004-1496-5644](https://orcid.org/0009-0004-1496-5644) | ZB MED – Information Centre for Life Sciences; University of Bonn | | **Juliane Fluck** | [0000-0003-1379-7023](https://orcid.org/0000-0003-1379-7023) | ZB MED – Information Centre for Life Sciences; University of Bonn | ## Licensing The FAIRagro Metadata Enrichment NER Dataset is released under the: Creative Commons Attribution 4.0 International (CC BY 4.0) License License URL: https://creativecommons.org/licenses/by/4.0/ ## Cite as: ```bash @dataset{abdelmalak_fairagro_ner_2025, author = {Abdelmalak, Abanoub and Schneider, Gabriel and Riegler, Heike and Meier, Kristin and Specka, Xenia and Svoboda, Nikolai and Husain, Murtuza and Fluck, Juliane}, title = {{FAIRagro NER Dataset: Increasing FAIRness of FAIRagro Data Through AI-Supported Metadata Enrichment}}, year = {2025}, publisher = {Fachrepositorium Lebenswissenschaften (FRL)}, doi = {10.4126/FRL01-6526458}, url = {https://doi.org/10.4126/FRL01-6526458}, note = {Version 1.0} } ```
提供机构:
IT-ZBMED
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作