Regulatory Signal Dataset: FDA Accelerated Approval Documents
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/xpr2yxmcnx
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains a collection of regulatory documents from the U.S. Food and Drug Administration (FDA) for drugs granted Accelerated Approval. It focuses on the modern regulatory review paradigm where information is distributed across multiple document types. This initial phase of data collection includes raw document files and their extracted text content for a broad set of applications, serving as a foundation for subsequent detailed signal extraction and analysis.
Description
This Phase 1 dataset comprises **224 application-level document sets** for drugs approved via the FDA's Accelerated Approval (AA) pathway. The application IDs were automatically extracted from the official FDA document "CDER Drug and Biologic Accelerated Approvals Based on a Surrogate Endpoint As of December 31, 2024".
Documents were then retrieved from the FDA's Drugs@FDA database, focusing on publicly available files.
Each document set is centered on a specific Application Number (e.g., BLA 761381) and contains the following key documents and their raw text content:
Document Types Included (Downloaded and Text Extracted):
* **Approval Letters:** Official regulatory decision documents (PDF).
* **Draft Labeling (USPI):** Product indication and efficacy claims (PDF).
* **Summary Review:** Often contains statistical information and context (CFM/HTML).
* **180-Day Progress Report:** Dedicated document for postmarketing requirement updates (CFM/HTML).
Extracted Information:
* **Raw Text Content:** For each downloaded document, its full text content is extracted and saved as a `.txt` file. This raw text serves as the primary input for subsequent, more sophisticated signal extraction.
* **Basic Statistical Signals (from combined text):** The current script performs a basic extraction of confidence intervals, p-values, and mentions of statistical methods from the combined text of all documents for an application.
Purpose and Potential Applications
This dataset, in its current form, provides a robust foundation for research into automated regulatory intelligence. Its primary purpose is to enable the creation of a comprehensive, machine-readable corpus of FDA Accelerated Approval documents. This raw data is essential for:
* Developing Natural Language Processing (NLP) models to extract detailed regulatory signals.
* Training machine learning (ML) or quantum classification algorithms to predict Accelerated Approval outcomes.
* Analyzing the evolution of regulatory language and requirements over time.
The dataset is particularly valuable for researchers in clinical informatics, regulatory science, and health policy who seek to build transparent, auditable models for analyzing FDA decision-making.
Access and Usage
Data Format: The dataset is organized into a directory (`fda_documents_v2`) where each application's documents (PDFs, CFMs/HTMLs) and their corresponding extracted `.txt` files are stored. Extracted basic signals in `.json` files.
创建时间:
2025-08-26



