Kornimate/medical-research-clean
收藏Hugging Face2025-11-27 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Kornimate/medical-research-clean
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
features:
- name: nct_id
dtype: string
- name: brief_title_clean
dtype: string
- name: brief_summary_clean
dtype: string
- name: detailed_description_clean
dtype: string
- name: eligibility_criteria_clean
dtype: string
- name: keywords_clean
dtype: string
- name: mesh_terms_clean
dtype: string
- name: condition_browse_module_clean
dtype: string
- name: intervention_browse_module_clean
dtype: string
- name: conditions
list: string
- name: interventions
dtype: 'null'
- name: combined_text
dtype: string
- name: text_len
dtype: int64
splits:
- name: train
num_bytes: 3499875492
num_examples: 479038
download_size: 1707293695
dataset_size: 3499875492
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- en
---
This dataset is a modified format of the ```louisbrulenaudet/clinical-trials``` dataset.
The following is the content of this modified dataset:
- (#1) ```nct_id``` is unique id for each research
- (#2) ```brief_title_clean``` is *cleaned* version of original ```brief_title``` feature, plain text format
- (#3) ```brief_summary_clean``` is *cleaned* version of original ```brief_summary``` feature, plain text format
- (#4) ```detailed_description_clean``` is *cleaned* version of original ```detailed_description``` feature, plan text format
- (#5) ```eligibility_criteria_clean``` is *cleaned* version of original ```eligibility_criteria``` feature, plain text format
- (#6) ```keywords_clean``` is *normalized* version of original ```keywords``` feature, plain text format
- (#7) ```mesh_terms_clean``` is *cleaned* version of original ```mesh_terms``` feature, plain text format
- (#8) ```condition_browse_module_clean``` is *cleaned* version of original ```condition_browse_module``` feature, plain text format
- (#9) ```intervention_browse_module_clean``` is *cleaned* version of original ```intervention_browse_module``` feature, plain text format
- (#10) ```conditions``` is *cleaned* version of original ```conditions``` feature, plain text format
- (#11) ```interventions``` is *cleaned* version of original ```interventions``` feature, plain text format
- (#12) ```combined_text``` is concatenated version of **#1 - #8** with removed stopwords and lemmatized
- (#13) ```text_len``` is text length of **#12**
The term *cleaned* means the following transformations:
- if it was a plain text feature originally, then: removed HTML tags, casefolded, removed trailing whitespaces
- if it was a strcutured text feature, then: filtered for specific keys: ```["meshes","browseLeaves","browseBranches","ancestors","conditions","interventions"]```, filtered for only existing and valid terms, text content lowercased and stripped/trimmed, rejoined with space delimiter
The term *normalized* means that the content was splitted at whitespaces, casefolded, trimmed/stripped and rejoined with space delimiter
提供机构:
Kornimate



