five

Comprehensive Personal Health Report Dataset

收藏
DataCite Commons2026-03-06 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=3b92b36bb51b4942823652f43ff07e9a
下载链接
链接失效反馈
官方服务:
资源简介:
The original corpus of this dataset is derived from real personal health examination reports from a hospital in Zhanjiang, Guangdong. It was collected collaboratively by the research team in strict compliance with medical data anonymization standards. The final dataset construction is based on 5420 real question and answer pairs of inspection and diagnosis results as the initial cornerstone.In the process of data processing and generation, the team adopted advanced artificial intelligence technology for scale enhancement. In response to the template tendency of the original inspection report, we deployed a BERT model based on the advantage of bidirectional semantic understanding using Python 3.11 environment, PyTorch 2.3.1 framework, and NVIDIA V100 GPU device to perform sequence annotation and entity extraction tasks. In order to ensure the integrity of medical semantics, a manually guided segmentation strategy was introduced during the processing, which non fixed blocks the text based on the distribution interval of disease indicators, effectively avoiding the disconnection between examination items and diagnostic logic. Finally, through this data processing strategy, the record size was expanded from the initial selection of 5420 to 15419, and the ratio of the original inspection report to the numerical enhanced question and answer pairs was set to the optimal 1:2 to achieve the synergistic improvement of the model's numerical reasoning ability and long text coherence.The dataset is stored in a structured table format, with a total of 352454 records in the raw data. The meaning of the column labels in the data table is clear: the "medical examination number" is the depersonalized identifier of the examinee; The "examination item name" covers specific medical items such as skin, thyroid, fasting blood glucose, etc; ExamItemVa "records the quantitative measurement values (such as 14.0 × 10 ⁹/L) or qualitative descriptions (such as" normal ") of each item; ExamItemCc "is the preliminary clinical conclusion of a single examination; The 'overall inspection conclusion' is the final generated comprehensive diagnostic text. The measurement units involved strictly follow clinical medical standards, including but not limited to mmol/L (blood glucose related indicators), 10 ⁹/L (white blood cell count),% (neutrophil ratio), and kPa (pulse oxygen partial pressure).Regarding the quality and distribution of data, there are some NULL values in the "ExamItemVa" field of the dataset, which mainly correspond to physical examination items such as skin, limbs, and joints. This represents that the item only has qualitative judgments rather than quantitative values, and belongs to data performance that conforms to medical business logic. In addition, due to the unstructured features of the original medical text, there may be slight deviations in the semantic parsing of some complex descriptions. However, this dataset has undergone factual verification through an adaptive retrieval enhancement mechanism to ensure the accuracy and scientificity of the overall data.
提供机构:
Science Data Bank
创建时间:
2026-03-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作