five

tekwhisperer/Cannabis_Science_Data

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tekwhisperer/Cannabis_Science_Data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering - text-generation language: - en tags: - chemistry - biology - medical - cannabis - research - cannabisextraction - plant - extraction - chemicalengineering - synthetic-data - scientific-qa pretty_name: Cannabis-Science-Literature size_categories: - 100K<n<1M --- # Cannabis Science Literature QA Dataset This dataset contains **161,170 high-quality question-answer pairs** derived from over 400 peer-reviewed cannabis science research papers and textbooks. Created to advance AI research in cannabis science and medical applications, it provides a comprehensive resource for training language models on cannabis-related scientific knowledge. ## Dataset Details ### Dataset Description This dataset was systematically generated from a curated collection of cannabis science literature using advanced NLP processing techniques. The source materials include peer-reviewed research papers, academic journals, and college-level textbooks covering cannabis chemistry, biology, pharmacology, extraction methods, and medical applications. - **Curated by:** Kellan Finney - **Funded by:** Eighth Revolution - **Language(s) (NLP):** English - **Total Q&A Pairs:** 161,170 - **Source Documents:** 400+ research papers and textbooks - **License:** Apache 2.0 ### Dataset Sources - **Repository:** [https://github.com/KellanFinney/Canna_LoRA](https://github.com/KellanFinney/Canna_LoRA) - **Source Papers:** [Cannabis Research Literature Collection](https://drive.google.com/drive/folders/1zOrIlrChpPteq7cmeNluCBA6tquwIvj9?usp=drive_link) ## Uses ### Direct Use - **Training scientific Q&A models** for cannabis domain expertise - **Fine-tuning language models** for cannabis and botanical applications - **Research applications** in computational biology and chemistry - **Educational chatbots** for cannabis science learning - **Literature analysis** and knowledge synthesis tools ### Out-of-Scope Use - **Medical advice or diagnosis** - This dataset is for research purposes only - **Legal advice** regarding cannabis regulations or compliance - **Commercial product claims** without proper validation and testing - **Direct medical decision-making** without healthcare professional oversight ## Dataset Structure The dataset is organized in JSON batch files, each containing Q&A pairs with associated metadata: ```json { "paper_name": { "chunk_0": { "generated": [ { "question": "What is the primary psychoactive compound in cannabis?", "answer": "Δ9-tetrahydrocannabinol (THC) is the primary psychoactive compound..." } ], "context": "Source text chunk from research paper...", "source_pdf": "cannabis_pharmacology_2023.pdf" } } } ``` **File Organization:** - Batch files: `science_training_batch_001.json` through `science_training_batch_XXX.json` - Each batch contains 5 processed documents - Total file size: ~2.5GB across all batches ## Dataset Creation ### Curation Rationale This dataset addresses a critical gap in domain-specific training data for cannabis science. Key motivations include: - **Scientific accuracy**: Ensuring AI models have access to peer-reviewed cannabis research - **Industry support**: Helping cannabis operators make informed, science-based decisions - **Educational advancement**: Supporting research and education in cannabis science - **Knowledge accessibility**: Making complex scientific literature more accessible through AI ### Source Data - **400+ peer-reviewed research papers** from academic journals - **College-level textbooks** on cannabis science and related fields - **Academic publications** covering 2010-2024 research - **Selection criteria**: Peer-reviewed, scientific rigor, relevance to cannabis research #### Data Collection and Processing 1. **Document Processing**: Docling library for high-quality PDF conversion 2. **Intelligent Chunking**: HybridChunker for context-aware text segmentation 3. **Contextualization**: Each chunk enriched with surrounding document context 4. **Q&A Generation**: GPT-4o-mini with specialized prompts (5 pairs per chunk) 5. **Quality Control**: Structured JSON validation and rate-limited processing 6. **Parallel Processing**: 30 workers with 490 RPM rate limiting #### Who are the source data producers? - **Academic researchers** from universities and research institutions - **Peer-reviewed journal publishers** in chemistry, biology, and medical fields - **Scientific community members** specializing in cannabis research - **Educational institutions** producing cannabis science curricula ## Bias, Risks, and Limitations ### Potential Biases - **Academic bias**: Reflects published research perspectives and methodologies - **Geographic bias**: Primarily Western/English-language research sources - **Temporal bias**: Weighted toward more recent research (2015-2024) - **Research focus bias**: May emphasize certain cannabis applications over others ### Risks and Limitations - **Generated content accuracy**: AI-generated Q&A pairs may contain factual errors - **Medical applications**: Not suitable for direct medical decision-making - **Regulatory compliance**: Does not provide legal or regulatory guidance - **Technical limitations**: Context window constraints during generation process - **Model hallucinations**: Potential for GPT model to generate plausible but incorrect information - **Coverage gaps**: Some specialized subtopics may be underrepresented ### Recommendations Users should: - Verify critical information against original sources - Use for research and educational purposes only - Consult healthcare professionals for medical applications - Fact-check generated content for high-stakes applications ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{finney2025cannabis, title={Cannabis Science Literature QA Dataset: 161K Question-Answer Pairs from Peer-Reviewed Research}, author={Kellan Finney}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/datasets/KellanF89/Cannabis_Science_Data} } ``` ## Dataset Card Authors **Kellan Finney** - Dataset creation, curation, and processing pipeline development ## Dataset Card Contact For questions, collaborations, or feedback, please reach out via [LinkedIn](https://www.linkedin.com/in/kellan-finney-m-s-861379a1). --- *This dataset represents a significant advancement in making cannabis science knowledge accessible to AI systems, supporting both research progress and practical applications in the evolving cannabis industry.*
提供机构:
tekwhisperer
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作