Data for ICH (Qingyang sachet) knowledge graph
收藏科学数据银行2024-12-06 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=363988e18a624388bf0afae7746689a9
下载链接
链接失效反馈官方服务:
资源简介:
This dataset includes textual and image data related to the Qianyang sachets, as well as annotated data used for constructing the Qianyang sachet knowledge graph. For physical documents, optical character recognition (OCR) technology was first used to digitize the images and convert them into textual data. The PP-Chat OCRv3, developed by Baidu PaddlePaddle, was employed to extract text from images. This tool combines large-scale language models (LLMs) and OCR technology, and is specifically trained with language-specific data for Chinese text to improve recognition accuracy. After recognition, the textual data were stored in a database. The data was then preprocessed to reduce noise and interference, enhancing the contrast and overall quality of the images. Data cleaning involved deduplication, standardizing format differences (e.g., date formats), and filtering out irrelevant records according to predefined criteria. For errors encountered during preprocessing, the OCR software's built-in text correction function was used to assist in manual correction, ensuring higher accuracy of the foundational data. Finally, manual extraction of data related to the patterns, types, and content of Qianyang sachets was conducted and saved in a two-dimensional table format. Encyclopedic and official website resources were directly gathered through web scraping, followed by cleaning and storage.In addition to textual data, a field survey was conducted in Qianyang City, Gansu Province, from May 2023 to June 2024. During this period, we visited the seven counties and one district of Qianyang and used professional equipment to capture original images of the sachets. In the image preprocessing stage, we manually filtered out low-quality images based on the completeness and clarity of the patterns. Photoshop software was then used to extract individual pattern samples from the original images, with the pattern boundaries serving as the cropping area.
提供机构:
Lanzhou University
创建时间:
2024-12-06



