下载链接：

https://modelscope.cn/datasets/NeuML/pubmed-hmpv

下载链接

链接失效反馈

官方服务：

资源简介：

# PubMed HMPV Articles _Current as of January 7, 2025_ This dataset is metadata (id, publication date, title, link) from PubMed articles related to HMPV. It was created using [paperetl](https://github.com/neuml/paperetl) and the [PubMed Baseline](https://pubmed.ncbi.nlm.nih.gov/download/). The 37 million articles were filtered to match either of the following criteria. - MeSH code = [D029121](https://meshb-prev.nlm.nih.gov/record/ui?ui=D029121) - Keyword of `HMPV` in either the `title` or `abstract` ## Retrieve article abstracts The full article abstracts can be retrieved via the [PubMed API](https://www.nlm.nih.gov/dataguide/eutilities/utilities.html#efetch). This method accepts batches of PubMed IDs. Alternatively, the dataset can be recreated using the following steps and loading the abstracts into the dataset (see step 5). ## Download and build The following steps recreate this dataset. 1. Create the following directories and files ```bash mkdir -p pubmed/config pubmed/data echo "D029121" > pubmed/config/codes echo "HMPV" > pubmed/config/keywords ``` 2. Install `paperetl` and download `PubMed Baseline + Updates` into `pubmed/data`. ```bash pip install paperetl datasets ``` 3. Parse the PubMed dataset into article metadata ```bash python -m paperetl.file pubmed/data pubmed/articles pubmed/config ``` 4. Export to dataset ```python from datasets import Dataset ds = Dataset.from_sql( ("SELECT id id, published published, title title, reference reference FROM articles " "ORDER BY published DESC"), f"sqlite:///pubmed/articles/articles.sqlite" ) ds.to_csv(f"pubmed-hmpv/articles.csv") ``` 5. _Optional_ Export to dataset with all fields paperetl parses all metadata and article abstracts. If you'd like to create a local dataset with the abstracts, run the following instead of step 4. ```python import sqlite3 import uuid from datasets import Dataset class Export: def __init__(self, dbfile): # Load database self.connection = sqlite3.connect(dbfile) self.connection.row_factory = sqlite3.Row def __call__(self): # Create cursors cursor1 = self.connection.cursor() cursor2 = self.connection.cursor() # Get article metadata cursor1.execute("SELECT * FROM articles ORDER BY id") for row in cursor1: # Get abstract text cursor2.execute( "SELECT text FROM sections WHERE article = ? and name != 'TITLE' ORDER BY id", [row[0]] ) abstract = " ".join(r["text"] for r in cursor2) # Combine into single record and yield row = {**row, **{"abstract": abstract}} yield {k.lower(): v for k, v in row.items()} def __reduce__(self): return (pickle, (str(uuid.uuid4()),)) def pickle(self, *args, **kwargs): raise AssertionError("Generator pickling workaround") # Path to database export = Export("pubmed/articles/articles.sqlite") ds = Dataset.from_generator(export) ds = ds.sort("published", reverse=True) ds.to_csv("pubmed-hmpv-full/articles.csv") ```

# PubMed HMPV相关文献数据集 _截至2025年1月7日更新_ 本数据集收录与HMPV（人类偏肺病毒，Human Metapneumovirus）相关的PubMed文献元数据，包含文献ID、发表日期、标题与链接。本数据集通过paperetl工具与PubMed基础数据集（PubMed Baseline）构建而成。原始数据集共包含3700万篇文献，经筛选后符合以下任一条件的文献将被纳入本数据集： - 医学主题词（Medical Subject Headings, MeSH）编码为[D029121](https://meshb-prev.nlm.nih.gov/record/ui?ui=D029121) - 在标题或摘要中包含关键词`HMPV` ## 获取文献摘要完整的文献摘要可通过PubMed应用程序编程接口（PubMed API）获取，该接口支持批量传入PubMed文献ID。此外，也可按照下述步骤重新构建本数据集，并将摘要载入数据集中（详见步骤5）。 ## 下载与构建流程下述步骤可用于复现本数据集的构建过程： 1. 创建所需目录与文件 bash mkdir -p pubmed/config pubmed/data echo "D029121" > pubmed/config/codes echo "HMPV" > pubmed/config/keywords 2. 安装paperetl工具，并将PubMed基础数据集及其更新包下载至`pubmed/data`目录下。执行以下命令安装依赖： bash pip install paperetl datasets 3. 将PubMed数据集解析为文献元数据：执行以下命令 bash python -m paperetl.file pubmed/data pubmed/articles pubmed/config 4. 导出为数据集：执行以下Python代码 python from datasets import Dataset ds = Dataset.from_sql( ("SELECT id id, published published, title title, reference reference FROM articles " "ORDER BY published DESC"), f"sqlite:///pubmed/articles/articles.sqlite" ) ds.to_csv(f"pubmed-hmpv/articles.csv") 5. 【可选】导出包含全字段的数据集 paperetl会解析所有元数据与文献摘要。若需构建包含摘要的本地数据集，请执行以下代码替代步骤4： python import sqlite3 import uuid from datasets import Dataset class Export: def __init__(self, dbfile): # 加载数据库 self.connection = sqlite3.connect(dbfile) self.connection.row_factory = sqlite3.Row def __call__(self): # 创建游标 cursor1 = self.connection.cursor() cursor2 = self.connection.cursor() # 获取文献元数据 cursor1.execute("SELECT * FROM articles ORDER BY id") for row in cursor1: # 获取摘要文本 cursor2.execute( "SELECT text FROM sections WHERE article = ? and name != 'TITLE' ORDER BY id", [row[0]] ) abstract = " ".join(r["text"] for r in cursor2) # 合并为单条记录并生成 row = {**row, **{"abstract": abstract}} yield {k.lower(): v for k, v in row.items()} def __reduce__(self): return (pickle, (str(uuid.uuid4()),)) def pickle(self, *args, **kwargs): raise AssertionError("生成器序列化兼容临时方案") # 数据库路径 export = Export("pubmed/articles/articles.sqlite") ds = Dataset.from_generator(export) ds = ds.sort("published", reverse=True) ds.to_csv("pubmed-hmpv-full/articles.csv")

应用场景：