five

NationalLibraryOfScotland/medical-history-of-british-india

收藏
Hugging Face2025-08-13 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/NationalLibraryOfScotland/medical-history-of-british-india
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: document_id dtype: string - name: page_number dtype: string - name: image dtype: image - name: text dtype: string - name: alto_xml dtype: string - name: has_image dtype: bool - name: has_alto dtype: bool splits: - name: train num_bytes: 17746410153.036 num_examples: 120903 download_size: 12404732133 dataset_size: 17746410153.036 configs: - config_name: default data_files: - split: train path: data/train-* license: cc0-1.0 task_categories: - text-generation language: - en tags: - lam - glam - history pretty_name: 'A Medical History of British India Dataset ' size_categories: - 10K<n<100K --- # A Medical History of British India Dataset ## Dataset Description This dataset contains digitiaed official publications documenting medical research and public health in British India from 1850-1950. The collection represents a crucial period in medical history, capturing the transition from humoral to biochemical medical traditions and documenting major breakthroughs in bacteriology, parasitology, and vaccine development. These documents provide invaluable insights into colonial medical surveillance systems and the evolution of public health policies in British India. ### Dataset Summary - **Source**: [National Library of Scotland - A Medical History of British India](https://data.nls.uk/data/digitised-collections/a-medical-history-of-british-india/) - **Time Period**: 1850-1950 - **Format**: Image-text pairs with hand-corrected OCR output - **Processing**: Converted from ALTO XML and JPG images to Hugging Face dataset format - **Contents**: Official medical reports, disease histories, maps, and statistics from British India - **Size**: 117,022 ALTO XML files, 120,903 image files, 22.5 million words - **DOI**: https://doi.org/10.34812/2w0t-3f08 ## Dataset Structure ### Data Fields Each record in the dataset contains the following fields: - `document_id` (string): Unique identifier for the medical document - `page_number` (int): Sequential page number within the document - `file_identifier` (string): Original file identifier from the NLS dataset - `image` (image): Scanned image of the document page - `text` (string): OCR-extracted text from the page - `alto_xml` (string): Raw ALTO XML containing detailed OCR information - `has_image` (bool): Whether the page has an associated image file - `has_alto` (bool): Whether the page has associated ALTO OCR data - `document_metadata` (string): Full metadata description from inventory - `has_metadata` (bool): Whether metadata is available for this document - `topic` (string): Medical topic or subject of the document - `year` (string): Year of publication (extracted from metadata) - `reference` (string): Document reference code (e.g., "IP/HA.2") - `disease_focus` (string): Primary disease discussed (if applicable) ### Data Statistics - **Total documents**: 468 unique medical reports and publications - **Page distribution**: ~97% of pages have both images and OCR text - **Geographic coverage**: Multiple regions across British India - **Notable medical figures**: Documents the work of Sir Ronald Ross and other prominent researchers ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("NationalLibraryOfScotland/medical-history-of-british-india") # Access the data for example in dataset['train']: print(f"Document: {example['document_id']}") print(f"Topic: {example['topic']}") print(f"Year: {example['year']}") print(f"Disease focus: {example['disease_focus']}") print(f"Text preview: {example['text'][:200]}...") break ``` ## Dataset Creation ### Source Data The original data comes from the National Library of Scotland's comprehensive digitisation project of official medical publications from British India. These documents include: - Annual medical reports from various provinces - Special investigations into disease outbreaks - Statistical compilations of mortality and morbidity - Research papers on tropical diseases - Maps showing disease distribution ### Processing Pipeline 1. **Original Format**: The source data consists of: - High-resolution JPG scans of medical documents - ALTO XML files containing OCR output with cleaned-up text - METS metadata files with page ordering information - Inventory CSV with detailed document metadata 2. **Conversion Process**: Using the custom `convert_india_papers.py` script: - Parsed METS XML files to maintain correct page ordering - Extracted medical metadata including disease focus, year, and topic - Paired image files with their corresponding ALTO XML - Preserved all structural and descriptive metadata - Added specialised fields for medical history research ## Considerations for Using this Data ### Historical and Cultural Context These documents represent colonial-era medical perspectives and should be understood within their historical context: - Terminology reflects period-specific medical understanding - Documents may contain colonial-era biases and perspectives - Geographic names and administrative divisions are from the British colonial period - Medical theories and treatments described may be outdated ### OCR Quality The OCR quality is high due to being hand-corrected, but there are some variations due to: - Original document preservation state - Complexity of medical terminology - Presence of statistical tables and charts - Mixed languages (English with local terms) ### Dataset context This item was digitised as part of a project in 2005 and in 2008-2012 to digitise a collection of official publications items varying from short reports to multi-volume histories related to disease, public health and medical research between 1850 to 1920 from the National Library of Scotland’s India Papers collection. The collection was digitised/microfilmed by the National Library of Scotland at its George IV Bridge studio, and by two contractors (UK Archiving, Edinburgh; and Capita Total Document Solutions, Bicester, Oxfordshire). Large fold-outs were digitised by the National Library of Scotland at their George IV Bridge studio, funded by the National Library of Scotland. The OCR was carried out and cleaned up by AEL Data, Chennai, India and Datamatics, Chennai and Mumbai, India. The collection was digitised as part of the National Library of Scotland’s objective to reach all of Scotland’s communities and to encourage collaborative partnerships. The digitisation and OCR were funded by The Wellcome Trust’s Research Resources in Medical History grant scheme, across five separate grants. ## Additional Information ### Licensing This dataset is in the **public domain** and free of known copyright restrictions. Users should comply with the National Library of Scotland's terms of use when utilizing this dataset. ### Citation If you use this dataset, please cite the source dataset: ```bibtex @misc{nls_india_medical, title={A Medical History of British India}, author={National Library of Scotland}, year={2019}, publisher={National Library of Scotland}, doi={10.34812/2w0t-3f08}, howpublished={\url{https://data.nls.uk/data/digitised-collections/a-medical-history-of-british-india/}}, } ### Acknowledgments Dataset converted to Hugging Face format by [davanstrien](https://huggingface.co/davanstrien) ```
提供机构:
NationalLibraryOfScotland
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作