Tokenized product information for centrally approved medicines within EU (extracted May 3, 2022)

Name: Tokenized product information for centrally approved medicines within EU (extracted May 3, 2022)
Creator: Uppsala University
Published: 2026-03-23 10:17:31
License: 暂无描述

DataCite Commons2026-03-23 更新2025-04-17 收录

下载链接：

https://researchdata.se/catalogue/dataset/2022-157-1/1

下载链接

链接失效反馈

官方服务：

资源简介：

The text corpus was compiled on May 3, 2022, by scripted downloading of all available English language product information files for all centrally approved medicinal products within the EU, from the European Medicines Agency website. Package Leaflet (PL) and Summary of product characteristics (SmPC) documents for each medicinal product, excluding multiplicate documents for medicinal products with more than one strength or pharmaceutical preparation, were used. The PDF files were scraped using the pdfplumber version 0.6.1 package in Python 3.8.10 to extract all text except page numbering, headers, and footers. Line breaks and special characters (excluding punctuation characters) were removed, and punctuation was added to sentences where this was missing (such as headings) to avoid false aggregation. All paragraphs were tokenized on a sentence level using the Natural Language Toolkit (NLTK) version 3.7 tokenizer This database contains sentence-level tokenized product infomation from all centrally approved medicinal products within the EU (May 3, 2022) including Summary of product characteristics (SmPC) and Package leaflet (PL) documents. A total of 1258 medicinal products were initially included, of which 5 were subsequently excluded due to document compatibility issues. From these, a total of 783 K sentences were extracted from PL and SmPC documents.

该文本语料库于2022年5月3日编制完成，通过编写脚本下载欧洲药品管理局（European Medicines Agency）网站上欧盟境内所有经中央审批通过药品的全部可用英文产品信息文件。本次采集未纳入同一药品存在多种规格或制剂的重复文件，仅使用各药品对应的药品说明书（Package Leaflet, PL）与产品特性摘要（Summary of product characteristics, SmPC）文档。使用Python 3.8.1环境下的pdfplumber 0.6.1版本库对PDF文件进行爬取，提取其中除页码、页眉、页脚之外的全部文本。移除换行符与非标点类特殊字符，并为缺失标点的语句（如标题）补充标点，以避免文本错误聚合。所有段落均采用自然语言工具包（Natural Language Toolkit, NLTK）3.7版本的分词器进行句子级分词。本数据库包含2022年5月3日欧盟境内所有经中央审批通过药品的句子级分词产品信息，涵盖产品特性摘要（SmPC）与药品说明书（PL）两类文档。初始共纳入1258种药品，后因文件兼容性问题排除5种。最终从两类文档中共提取78.3万条句子。

提供机构：

Uppsala University

创建时间：

2022-09-29