five

Nigeria-Health-data-OCR-pipeline/African-Medical-Records

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Nigeria-Health-data-OCR-pipeline/African-Medical-Records
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - feature-extraction language: - en tags: - medical --- # African Medical Records (AMR): Nigerian Handwritten Medical Records Dataset **A Benchmark Dataset for OCR, Clinical Text Extraction, and Healthcare Insight Modeling** --- ## Overview African Medical Records (AMR) is a large-scale, growing dataset designed to support **optical character recognition (OCR)**, **medical text extraction**, and **healthcare data analysis** from handwritten clinical records. The dataset consists of **paired handwritten medical notes and their corresponding digital ground truth**, enabling robust benchmarking and training of AI systems for real-world healthcare environments. This initiative is driven by a distributed network of **107+ contributors (and growing)** across **40 universities in Nigeria**, ensuring diversity in handwriting styles, formats, and regional medical practices. > The current batch features **13 contributors** from this volunteer pool, with additional contributors submitting datasets on a **rolling basis**. Our long-term vision is to build **the largest African medical handwriting dataset**, expanding across regions in Nigeria and eventually across Africa. --- ## Project Goal The primary goal of AMR is to: * Build a **realistic and reproducible OCR benchmark** for handwritten medical notes * Enable **accurate digitization of healthcare records** in low-resource settings * Support development of **edge-deployable AI systems** for healthcare environments * Provide a dataset that reflects **clinical usefulness, not just text accuracy** --- ## Dataset Structure Each contribution follows a standardized format: ### Document Types Volunteers generate **4 types of medical records**, each with **2 versions**: | Specialty | Record Type | | ---------------- | ------------------ | | General Medicine | Patient Visit Note | | Nursing | Vital Signs Chart | | Pharmacy | Prescription Note | | Laboratory | Lab Request Form | Each record includes: * **Synthetic Handwritten Version** * **Digital Ground Truth Version** This results in: > **8 documents per contributor** --- ## Dataset Features * Diverse **handwriting styles across regions and institutions** * Realistic **clinical abbreviations and formatting** * Paired **image-to-text ground truth alignment** * Structured to support: * OCR benchmarking (CER, WER) * Field-level extraction (drug names, dosage, vitals) * Abbreviation recognition * Clinical reasoning pipelines --- ## File Naming Convention To ensure traceability and structure, files follow standardized naming. ### Volunteer-Based Naming ``` [VolunteerNumber]-[DocumentNumber]-[TYPE] ``` Example: ``` 1-001-syn 1-001-truth ``` --- ### Patient-Based Dataset Naming (Recommended) To improve dataset organization and scalability: ``` [DatasetCode]-PATIENT-[Number] ``` Examples: * **SUNN-PATIENT-001** → Synthetic dataset from University of Nigeria Nsukka * **SUIL-PATIENT-001** → Synthetic dataset from University of Ilorin * **TUNN-PATIENT-001** → Ground truth dataset from University of Nigeria Nsukka Example sequence: ``` SUNN-PATIENT-001 TUNN-PATIENT-001 SUNN-PATIENT-002 TUNN-PATIENT-002 ``` Each synthetic record must have a **corresponding ground truth pair**. --- ## Data Collection Standards ### Document Requirements * A4 plain white paper * Blue or black ink * Natural handwriting (no printed text) * Proper margins and full-page visibility ### Scan Requirements **Acceptable:** * Well-lit images * Camera directly above the paper * Fully visible and cropped page * Clear and readable text **Not Acceptable:** * Blurry or dark images * Tilted or angled captures * Shadows or obstructions * Incomplete or poorly cropped scans --- ## Use Cases This dataset is designed for: ### AI Researchers * Benchmarking OCR and VLM models * Evaluating handwriting recognition in low-resource settings * Training models for structured medical extraction ### Health-Tech Developers * Building EHR digitization systems * Developing edge AI for clinics and rural hospitals ### Policy Makers & Public Health Analysts * Understanding **patterns between diseases and regions** * Informing **health infrastructure planning and prioritization** * Supporting data-driven healthcare decisions --- ## Dataset Growth Vision AMR is a **living dataset**. * New batches will be added continuously * Coverage will expand across **regions in Nigeria** * Long-term goal: scale to **pan-African healthcare datasets** --- ## Ethical Statement > The following datasets are synthetic, and any similarities to medical conditions of members of the public are intended and not obtained from any medical institution. * No real patient data is included * All records are **fictional and generated for research purposes** * Contributors are instructed to avoid any identifiable information --- ## Intended Use This dataset is for **research purposes only**. > It should **not be used for clinical diagnosis, treatment, or real-world medical decision-making**. --- ## License Recommended: **Creative Commons Attribution 4.0 (CC BY 4.0)** This allows broad usage while ensuring proper attribution to the AMR project and its contributors. --- ## Citation ```bibtex @dataset{amr_2026, title={African Medical Records (AMR): Nigerian Handwritten Medical Records Dataset}, author={ AMR Contributors}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/Nigeria-Health-data-OCR-pipeline/African-Medical-Records} } ``` --- ## Contributors This dataset is made possible by a distributed network of contributors. * **107+ contributors (and growing)** * **40 universities across Nigeria** ### Volunteers See `CONTRIBUTORS.md` for the full list. --- ## Acknowledgements We acknowledge all student contributors, institutions, and collaborators supporting the AMR initiative. --- ## Final Note AMR is not just a dataset—it is **infrastructure for African AI in healthcare**.
提供机构:
Nigeria-Health-data-OCR-pipeline
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作