five

jmhb/pubmed_bioasq_2022

收藏
Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jmhb/pubmed_bioasq_2022
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc0-1.0 size_categories: - 10M<n<100M task_categories: - text-retrieval - question-answering pretty_name: PubMed BioASQ 2022 Corpus tags: - biomedical - pubmed - mesh - information-retrieval - qa - medical - scientific-literature --- # PubMed BioASQ 2022 Corpus This dataset contains the PubMed abstracts corpus from the BioASQ 2022 challenge, comprising approximately 23 million biomedical documents with MeSH (Medical Subject Headings) annotations. ## Purpose This is a **convenience collection** of the PubMed corpus from the [BioASQ 2022 Challenge](http://bioasq.org/), reformatted for easier use in retrieval and QA systems. The original source is the BioASQ challenge data. We created this processed version with multiple formats (JSON, Parquet, JSONL) to facilitate different use cases in our PaperSearchQA work. **IMPORTANT**: The underlying PubMed abstracts and MeSH annotations are from the BioASQ 2022 challenge release. We have reformatted this data for convenience but all content originates from BioASQ. ## Dataset Description The corpus includes PubMed abstracts covering publications up to 2022, originally sourced from the [BioASQ Challenge](http://bioasq.org/). This is the foundational corpus used for the [PaperSearchQA](https://jmhb0.github.io/PaperSearchQA) benchmark for evaluating retrieval and question-answering systems in the biomedical domain. ### Key Statistics - **Total Documents**: ~23 million PubMed abstracts - **Coverage**: Publications up to 2022 - **Annotations**: MeSH terms (Medical Subject Headings) - **Size**: ~63GB total (multiple formats) - **Languages**: Primarily English ### Data Formats The dataset is provided in three formats for different use cases: 1. **`data/allMeSH_2022.json`** (27GB) - Original BioASQ format - Complete JSON with all metadata - Contains: PMID, title, abstract, journal, publication year, MeSH terms 2. **`data/allMeSH_2022.parquet`** (13GB) - Optimized structured format - Apache Parquet for efficient analytics - Same fields as JSON, optimized for queries and filtering - Recommended for data analysis workflows 3. **`data/corpus/pubmed.jsonl`** (23GB) - Retrieval-optimized format - One document per line (JSONL) - Format: `{"_id": "PMID", "title": "...", "text": "...", "metadata": {...}}` - Optimized for information retrieval systems (BM25, dense retrievers) - Compatible with Pyserini, Elasticsearch, and other IR tools ### Auxiliary Files - **`data/indices/allMeSH_2022_pmid_index.pkl`** (243MB) - PMID lookup index - **`data/indices/allMeSH_2022_mesh_index.pkl`** (981MB) - MeSH term index ## Usage ### Loading the Parquet Format (Recommended for Analysis) ```python import pandas as pd from huggingface_hub import hf_hub_download # Download the parquet file file_path = hf_hub_download( repo_id="jmhb/pubmed_bioasq_2022", filename="data/allMeSH_2022.parquet", repo_type="dataset" ) # Load into pandas df = pd.read_parquet(file_path) # Example: Filter by year recent_papers = df[df['year'] >= 2020] # Example: Search by MeSH term covid_papers = df[df['meshMajor'].apply(lambda x: 'COVID-19' in str(x))] ``` ### Loading the JSONL Format (for Retrieval) ```python import json from huggingface_hub import hf_hub_download # Download the JSONL corpus file_path = hf_hub_download( repo_id="jmhb/pubmed_bioasq_2022", filename="data/corpus/pubmed.jsonl", repo_type="dataset" ) # Read documents documents = [] with open(file_path, 'r') as f: for line in f: doc = json.loads(line) documents.append(doc) # Example document structure: # { # "_id": "12345678", # "title": "Example paper title", # "text": "Abstract text...", # "metadata": { # "journal": "Nature", # "year": 2020, # "meshMajor": ["COVID-19", "SARS-CoV-2"] # } # } ``` ### Using with Pyserini (BM25 Retrieval) ```python from pyserini.search import SimpleSearcher from huggingface_hub import hf_hub_download import os # Download corpus corpus_path = hf_hub_download( repo_id="jmhb/pubmed_bioasq_2022", filename="data/corpus/pubmed.jsonl", repo_type="dataset" ) # Index with Pyserini (run once) os.system(f"python -m pyserini.index -collection JsonCollection \ -generator DefaultLuceneDocumentGenerator \ -threads 8 \ -input {os.path.dirname(corpus_path)} \ -index ./pubmed_index \ -storePositions -storeDocvectors -storeRaw") # Search searcher = SimpleSearcher('./pubmed_index') hits = searcher.search('COVID-19 treatments', k=10) for hit in hits: print(f"PMID: {hit.docid}") print(f"Score: {hit.score:.4f}") print(f"Title: {json.loads(hit.raw)['title']}\n") ``` ### Loading Original JSON Format ```python import json from huggingface_hub import hf_hub_download file_path = hf_hub_download( repo_id="jmhb/pubmed_bioasq_2022", filename="data/allMeSH_2022.json", repo_type="dataset" ) with open(file_path, 'r') as f: data = json.load(f) # Access documents for article in data['articles'][:5]: print(f"PMID: {article['pmid']}") print(f"Title: {article['title']}") print(f"Year: {article['year']}") print(f"MeSH: {article.get('meshMajor', [])}\n") ``` ## Data Fields ### Common Fields Across Formats - **pmid** (string): PubMed unique identifier - **title** (string): Article title - **abstractText** (string): Article abstract - **journal** (string): Journal name - **year** (integer): Publication year - **meshMajor** (list): Major MeSH descriptor terms - **meshMinor** (list): Minor MeSH descriptor terms (if available) ### JSONL-Specific Fields - **_id** (string): Document ID (same as PMID) - **text** (string): Combined abstract text - **metadata** (dict): Contains journal, year, MeSH terms ## Conversion Scripts The repository includes scripts to regenerate the different formats: - **`scripts/allMesh_to_parquet.py`**: Convert JSON to Parquet - **`scripts/make_pubmed_corpus.py`**: Generate JSONL corpus from JSON See the scripts for usage details. ## Source and Attribution **Original Data Source**: [BioASQ 2022 Challenge](http://bioasq.org/) ### PubMed Data PubMed abstracts are in the **public domain** and may be used freely. However, proper attribution is required: - **Source**: U.S. National Library of Medicine (NLM) - **Database**: PubMed / MEDLINE - **Required Attribution**: - Cite the National Center for Biotechnology Information (NCBI) as the source - Provide a link to the original PubMed record when possible - Example: "Data from PubMed (NCBI, NLM, NIH)" ### Citation The corpus structure follows the BioASQ 2022 challenge format. If using this dataset for research, you **must** cite the original BioASQ papers (the first two below). If you found this processed version valuable, please also consider citing PaperSearchQA: ```bibtex @article{krithara2023bioasq, title={BioASQ-QA: A manually curated corpus for Biomedical Question Answering}, author={Krithara, Anastasia and Nentidis, Anastasios and Bougiatiotis, Konstantinos and Paliouras, Georgios}, journal={Scientific Data}, volume={10}, number={1}, pages={170}, year={2023}, publisher={Nature Publishing Group UK London} } @article{tsatsaronis2015overview, title={An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition}, author={Tsatsaronis, George and Balikas, Georgios and Malakasiotis, Prodromos and Partalas, Ioannis and Zschunke, Matthias and Alvers, Michael R and Weissenborn, Dirk and Krithara, Anastasia and Petridis, Sergios and Polychronopoulos, Dimitris and others}, journal={BMC bioinformatics}, volume={16}, number={1}, pages={138}, year={2015}, publisher={Springer} } @misc{burgess2026papersearchqalearningsearchreason, title={PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR}, author={James Burgess and Jan N. Hansen and Duo Peng and Yuhui Zhang and Alejandro Lozano and Min Woo Sun and Emma Lundberg and Serena Yeung-Levy}, year={2026}, eprint={2601.18207}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2601.18207}, } ``` ## Dataset Creation ### Source Data The corpus was created from the BioASQ 2022 challenge data release, which aggregates: 1. **PubMed Abstracts**: Retrieved via NCBI E-utilities API 2. **MeSH Annotations**: Medical Subject Headings assigned by NLM indexers 3. **Metadata**: Journal, publication dates, authors (where available) ### Processing Pipeline 1. **Download**: Raw BioASQ JSON (`allMeSH_2022.json`) 2. **Parquet Conversion**: Optimized tabular format using Apache Arrow 3. **JSONL Corpus**: Reformatted for retrieval systems with normalized structure 4. **Index Creation**: Built PMID and MeSH term lookup indices for fast access ### Quality Notes - Some abstracts may be missing or incomplete (pre-2000 papers) - MeSH terms are professionally annotated by NLM indexers - Non-English abstracts are included but uncommon - Retracted papers may still be present in the corpus ## Maintenance and Updates This dataset represents a **snapshot from 2022** and will not receive updates. For more recent PubMed data, please use: - The official PubMed API (E-utilities) - PubMed FTP baseline files - Future BioASQ releases ## Links - **BioASQ Challenge**: [http://bioasq.org/](http://bioasq.org/) - **PaperSearchQA Project**: [https://jmhb0.github.io/PaperSearchQA](https://jmhb0.github.io/PaperSearchQA) - **Code Repository**: [GitHub](https://github.com/jmhb0/PaperSearchQA) - **PubMed**: [https://pubmed.ncbi.nlm.nih.gov/](https://pubmed.ncbi.nlm.nih.gov/) - **MeSH Database**: [https://www.ncbi.nlm.nih.gov/mesh](https://www.ncbi.nlm.nih.gov/mesh) ## Acknowledgments - **BioASQ Team** for organizing the challenge and providing the corpus - **NLM/NCBI** for maintaining PubMed and MeSH databases - **PubMed Contributors** for making biomedical literature openly accessible
提供机构:
jmhb
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作