five

Jaymerry/itis-taxonomy-instruct-30k-v2-negatives

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Jaymerry/itis-taxonomy-instruct-30k-v2-negatives
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc0-1.0 pretty_name: ITIS Taxonomy Instruction Dataset with Negative Samples size_categories: - 10K<n<100K task_categories: - question-answering - text-generation tags: - taxonomy - biodiversity - biology - instruction-tuning - alpaca - itis - negative-samples - hallucination-reduction --- # ITIS Taxonomy Instruction Dataset with Negative Samples ## Overview The **ITIS Taxonomy Instruction Dataset with Negative Samples** is a structured instruction-response dataset derived from the public domain **Integrated Taxonomic Information System (ITIS)** database. It was designed for fine-tuning large language models on taxonomy-oriented tasks such as rank identification, lineage reconstruction, parent taxon retrieval, taxonomic validity checks, and common name mapping. Compared with the initial version of the dataset, this release also includes **synthetic negative samples** and **explicit unknown cases** in order to improve model robustness and reduce hallucinations when the queried taxon does not exist in ITIS. The dataset supports offline taxonomic intelligence use cases, including biodiversity assistants, taxonomy QA systems, and structured scientific lookup tools. ## Data Source This dataset is derived from: **Integrated Taxonomic Information System (ITIS)** https://www.itis.gov/ ITIS is a public domain taxonomic database maintained by U.S. federal agencies and partners. All positive taxonomic records used in this dataset are sourced strictly from ITIS structured data. Synthetic negative samples were generated programmatically from the positive instruction templates and do **not** introduce external biological knowledge sources. ## License This dataset is released under: **CC0 1.0 Universal (Public Domain Dedication)** The original ITIS data are public domain. This dataset is a structured transformation of those public domain records, with additional synthetic instruction examples generated for training purposes. ## What's New in This Version This version extends the original ITIS instruction dataset by adding: - synthetic negative samples - explicit unknown taxon queries - typo-based entity perturbations - invented or mixed scientific names These additions are intended to help fine-tuned models learn when to: - answer correctly for known taxa - avoid guessing when a taxon is unknown - return a consistent fallback response for unsupported or invalid entities This makes the dataset more suitable for real-world inference settings where user queries may include misspellings, noise, or nonexistent taxa. ## Dataset Structure Each record follows an Alpaca-style instruction format: { "instruction": "What taxonomic rank is Panthera leo?", "input": "", "output": "Species" } For lineage tasks, the output may contain structured JSON: { "instruction": "Provide the full taxonomic classification (lineage) of Panthera leo.", "input": "", "output": "{\"tsn\":12345,\"scientific_name\":\"Panthera leo\",\"rank\":\"Species\",\"status\":\"valid\",\"lineage\":[...]}" } This version also contains negative or unknown examples such as: { "instruction": "What taxonomic rank is Panthera leoo?", "input": "", "output": "{\"status\":\"unknown\"}" } { "instruction": "Provide the full taxonomic classification (lineage) of Xylophus imaginaryus.", "input": "", "output": "{\"status\":\"unknown\"}" } For validity checks, unknown or unsupported taxa may yield simple binary answers: { "instruction": "Is Xylophus imaginaryus a valid/accepted taxon in ITIS?", "input": "", "output": "No" } ## Splits The dataset contains three JSONL files: - train.jsonl - val.jsonl - test.jsonl Each file contains one JSON object per line. ## Supported Task Types The dataset includes multiple taxonomy-oriented instruction types: 1. Rank Identification 2. Lineage Reconstruction 3. Validity Check 4. Parent Taxon Identification 5. Common Name ↔ Scientific Name Mapping 6. Taxonomic Comparison 7. Unknown Taxon Detection ## Intended Use - fine-tuning large language models (LLMs) - offline biodiversity assistants - taxonomy QA systems - scientific lookup tools - hallucination reduction experiments ## Data Generation Process 1. parsing ITIS SQLite database 2. extracting taxonomic units and relationships 3. reconstructing lineage hierarchies 4. generating instruction-response pairs 5. adding synthetic negative samples 6. generating explicit unknown examples 7. splitting into train/validation/test 8. converting to Alpaca format ## Known Limitations - English vernacular names only - some taxa may lack vernacular names - outputs reflect ITIS only - negative samples are synthetic - mixed response formats (text + JSON) ## Maintainer Dataset prepared and structured by Jay Merry.
提供机构:
Jaymerry
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作