Jaymerry/itis-taxonomy-instruct-30k-v2-negatives
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Jaymerry/itis-taxonomy-instruct-30k-v2-negatives
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc0-1.0
pretty_name: ITIS Taxonomy Instruction Dataset with Negative Samples
size_categories:
- 10K<n<100K
task_categories:
- question-answering
- text-generation
tags:
- taxonomy
- biodiversity
- biology
- instruction-tuning
- alpaca
- itis
- negative-samples
- hallucination-reduction
---
# ITIS Taxonomy Instruction Dataset with Negative Samples
## Overview
The **ITIS Taxonomy Instruction Dataset with Negative Samples** is a structured instruction-response dataset derived from the public domain **Integrated Taxonomic Information System (ITIS)** database.
It was designed for fine-tuning large language models on taxonomy-oriented tasks such as rank identification, lineage reconstruction, parent taxon retrieval, taxonomic validity checks, and common name mapping.
Compared with the initial version of the dataset, this release also includes **synthetic negative samples** and **explicit unknown cases** in order to improve model robustness and reduce hallucinations when the queried taxon does not exist in ITIS.
The dataset supports offline taxonomic intelligence use cases, including biodiversity assistants, taxonomy QA systems, and structured scientific lookup tools.
## Data Source
This dataset is derived from:
**Integrated Taxonomic Information System (ITIS)**
https://www.itis.gov/
ITIS is a public domain taxonomic database maintained by U.S. federal agencies and partners.
All positive taxonomic records used in this dataset are sourced strictly from ITIS structured data.
Synthetic negative samples were generated programmatically from the positive instruction templates and do **not** introduce external biological knowledge sources.
## License
This dataset is released under:
**CC0 1.0 Universal (Public Domain Dedication)**
The original ITIS data are public domain.
This dataset is a structured transformation of those public domain records, with additional synthetic instruction examples generated for training purposes.
## What's New in This Version
This version extends the original ITIS instruction dataset by adding:
- synthetic negative samples
- explicit unknown taxon queries
- typo-based entity perturbations
- invented or mixed scientific names
These additions are intended to help fine-tuned models learn when to:
- answer correctly for known taxa
- avoid guessing when a taxon is unknown
- return a consistent fallback response for unsupported or invalid entities
This makes the dataset more suitable for real-world inference settings where user queries may include misspellings, noise, or nonexistent taxa.
## Dataset Structure
Each record follows an Alpaca-style instruction format:
{
"instruction": "What taxonomic rank is Panthera leo?",
"input": "",
"output": "Species"
}
For lineage tasks, the output may contain structured JSON:
{
"instruction": "Provide the full taxonomic classification (lineage) of Panthera leo.",
"input": "",
"output": "{\"tsn\":12345,\"scientific_name\":\"Panthera leo\",\"rank\":\"Species\",\"status\":\"valid\",\"lineage\":[...]}"
}
This version also contains negative or unknown examples such as:
{
"instruction": "What taxonomic rank is Panthera leoo?",
"input": "",
"output": "{\"status\":\"unknown\"}"
}
{
"instruction": "Provide the full taxonomic classification (lineage) of Xylophus imaginaryus.",
"input": "",
"output": "{\"status\":\"unknown\"}"
}
For validity checks, unknown or unsupported taxa may yield simple binary answers:
{
"instruction": "Is Xylophus imaginaryus a valid/accepted taxon in ITIS?",
"input": "",
"output": "No"
}
## Splits
The dataset contains three JSONL files:
- train.jsonl
- val.jsonl
- test.jsonl
Each file contains one JSON object per line.
## Supported Task Types
The dataset includes multiple taxonomy-oriented instruction types:
1. Rank Identification
2. Lineage Reconstruction
3. Validity Check
4. Parent Taxon Identification
5. Common Name ↔ Scientific Name Mapping
6. Taxonomic Comparison
7. Unknown Taxon Detection
## Intended Use
- fine-tuning large language models (LLMs)
- offline biodiversity assistants
- taxonomy QA systems
- scientific lookup tools
- hallucination reduction experiments
## Data Generation Process
1. parsing ITIS SQLite database
2. extracting taxonomic units and relationships
3. reconstructing lineage hierarchies
4. generating instruction-response pairs
5. adding synthetic negative samples
6. generating explicit unknown examples
7. splitting into train/validation/test
8. converting to Alpaca format
## Known Limitations
- English vernacular names only
- some taxa may lack vernacular names
- outputs reflect ITIS only
- negative samples are synthetic
- mixed response formats (text + JSON)
## Maintainer
Dataset prepared and structured by Jay Merry.
提供机构:
Jaymerry



