pubmed-hmpv
收藏魔搭社区2025-12-04 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/NeuML/pubmed-hmpv
下载链接
链接失效反馈官方服务:
资源简介:
# PubMed HMPV Articles
_Current as of January 7, 2025_
This dataset is metadata (id, publication date, title, link) from PubMed articles related to HMPV. It was created using [paperetl](https://github.com/neuml/paperetl) and the [PubMed Baseline](https://pubmed.ncbi.nlm.nih.gov/download/).
The 37 million articles were filtered to match either of the following criteria.
- MeSH code = [D029121](https://meshb-prev.nlm.nih.gov/record/ui?ui=D029121)
- Keyword of `HMPV` in either the `title` or `abstract`
## Retrieve article abstracts
The full article abstracts can be retrieved via the [PubMed API](https://www.nlm.nih.gov/dataguide/eutilities/utilities.html#efetch). This method accepts batches of PubMed IDs.
Alternatively, the dataset can be recreated using the following steps and loading the abstracts into the dataset (see step 5).
## Download and build
The following steps recreate this dataset.
1. Create the following directories and files
```bash
mkdir -p pubmed/config pubmed/data
echo "D029121" > pubmed/config/codes
echo "HMPV" > pubmed/config/keywords
```
2. Install `paperetl` and download `PubMed Baseline + Updates` into `pubmed/data`.
```bash
pip install paperetl datasets
```
3. Parse the PubMed dataset into article metadata
```bash
python -m paperetl.file pubmed/data pubmed/articles pubmed/config
```
4. Export to dataset
```python
from datasets import Dataset
ds = Dataset.from_sql(
("SELECT id id, published published, title title, reference reference FROM articles "
"ORDER BY published DESC"),
f"sqlite:///pubmed/articles/articles.sqlite"
)
ds.to_csv(f"pubmed-hmpv/articles.csv")
```
5. _Optional_ Export to dataset with all fields
paperetl parses all metadata and article abstracts. If you'd like to create a local dataset with the abstracts, run the following instead of step 4.
```python
import sqlite3
import uuid
from datasets import Dataset
class Export:
def __init__(self, dbfile):
# Load database
self.connection = sqlite3.connect(dbfile)
self.connection.row_factory = sqlite3.Row
def __call__(self):
# Create cursors
cursor1 = self.connection.cursor()
cursor2 = self.connection.cursor()
# Get article metadata
cursor1.execute("SELECT * FROM articles ORDER BY id")
for row in cursor1:
# Get abstract text
cursor2.execute(
"SELECT text FROM sections WHERE article = ? and name != 'TITLE' ORDER BY id",
[row[0]]
)
abstract = " ".join(r["text"] for r in cursor2)
# Combine into single record and yield
row = {**row, **{"abstract": abstract}}
yield {k.lower(): v for k, v in row.items()}
def __reduce__(self):
return (pickle, (str(uuid.uuid4()),))
def pickle(self, *args, **kwargs):
raise AssertionError("Generator pickling workaround")
# Path to database
export = Export("pubmed/articles/articles.sqlite")
ds = Dataset.from_generator(export)
ds = ds.sort("published", reverse=True)
ds.to_csv("pubmed-hmpv-full/articles.csv")
```
# PubMed HMPV相关文献数据集
_截至2025年1月7日更新_
本数据集收录与HMPV(人类偏肺病毒,Human Metapneumovirus)相关的PubMed文献元数据,包含文献ID、发表日期、标题与链接。本数据集通过paperetl工具与PubMed基础数据集(PubMed Baseline)构建而成。
原始数据集共包含3700万篇文献,经筛选后符合以下任一条件的文献将被纳入本数据集:
- 医学主题词(Medical Subject Headings, MeSH)编码为[D029121](https://meshb-prev.nlm.nih.gov/record/ui?ui=D029121)
- 在标题或摘要中包含关键词`HMPV`
## 获取文献摘要
完整的文献摘要可通过PubMed应用程序编程接口(PubMed API)获取,该接口支持批量传入PubMed文献ID。此外,也可按照下述步骤重新构建本数据集,并将摘要载入数据集中(详见步骤5)。
## 下载与构建流程
下述步骤可用于复现本数据集的构建过程:
1. 创建所需目录与文件
bash
mkdir -p pubmed/config pubmed/data
echo "D029121" > pubmed/config/codes
echo "HMPV" > pubmed/config/keywords
2. 安装paperetl工具,并将PubMed基础数据集及其更新包下载至`pubmed/data`目录下。执行以下命令安装依赖:
bash
pip install paperetl datasets
3. 将PubMed数据集解析为文献元数据:执行以下命令
bash
python -m paperetl.file pubmed/data pubmed/articles pubmed/config
4. 导出为数据集:执行以下Python代码
python
from datasets import Dataset
ds = Dataset.from_sql(
("SELECT id id, published published, title title, reference reference FROM articles "
"ORDER BY published DESC"),
f"sqlite:///pubmed/articles/articles.sqlite"
)
ds.to_csv(f"pubmed-hmpv/articles.csv")
5. 【可选】导出包含全字段的数据集
paperetl会解析所有元数据与文献摘要。若需构建包含摘要的本地数据集,请执行以下代码替代步骤4:
python
import sqlite3
import uuid
from datasets import Dataset
class Export:
def __init__(self, dbfile):
# 加载数据库
self.connection = sqlite3.connect(dbfile)
self.connection.row_factory = sqlite3.Row
def __call__(self):
# 创建游标
cursor1 = self.connection.cursor()
cursor2 = self.connection.cursor()
# 获取文献元数据
cursor1.execute("SELECT * FROM articles ORDER BY id")
for row in cursor1:
# 获取摘要文本
cursor2.execute(
"SELECT text FROM sections WHERE article = ? and name != 'TITLE' ORDER BY id",
[row[0]]
)
abstract = " ".join(r["text"] for r in cursor2)
# 合并为单条记录并生成
row = {**row, **{"abstract": abstract}}
yield {k.lower(): v for k, v in row.items()}
def __reduce__(self):
return (pickle, (str(uuid.uuid4()),))
def pickle(self, *args, **kwargs):
raise AssertionError("生成器序列化兼容临时方案")
# 数据库路径
export = Export("pubmed/articles/articles.sqlite")
ds = Dataset.from_generator(export)
ds = ds.sort("published", reverse=True)
ds.to_csv("pubmed-hmpv-full/articles.csv")
提供机构:
maas
创建时间:
2025-01-08



