anonymousubmission/data_continual_pretraining
收藏Hugging Face2025-10-24 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/anonymousubmission/data_continual_pretraining
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含两部分:临床部分(clinical)和科学部分(scientific)。临床部分包含来自急诊部门的文本数据,科学部分则包含来自多个来源的科学文本数据,如常见爬虫医疗数据、药物说明、维基百科、e3c数据库、Web Hose AZ、论文、PubMed、补充描述、医疗网站和其他来源以及UniPD论文。每个部分都有详细的特征信息,如文本块、源文件、块类型、语言、ID、清理状态和原因以及单词数。
The dataset consists of two parts: the clinical part (clinical) and the scientific part (scientific). The clinical part contains text data from the emergency department, while the scientific part includes scientific text data from multiple sources such as common crawl medical data, drug instructions, Wikipedia, e3c database, Web Hose AZ, theses, PubMed, supplement descriptions, medical websites, others, and UniPD theses. Each part has detailed feature information, such as text chunk, source file, chunk type, language, ID, cleaning status and reason, and number of words.
提供机构:
anonymousubmission



