ThaiLLM/med-articles
收藏Hugging Face2026-04-21 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/ThaiLLM/med-articles
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: aid
dtype: string
- name: markdown
dtype: string
- name: metadata
struct:
- name: aid
dtype: string
- name: author
dtype: string
- name: post_date
dtype: string
- name: publish_date
dtype: string
- name: published_date
dtype: string
- name: tag
dtype: string
- name: title
dtype: string
- name: topic
dtype: string
- name: update_date
dtype: string
- name: url
dtype: string
- name: source_id
dtype: string
splits:
- name: train
num_bytes: 217560902
num_examples: 9081
download_size: 103049091
dataset_size: 217560902
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc0-1.0
language:
- th
tags:
- medical
size_categories:
- 1K<n<10K
---
# ThaiLLM Dataset: Medical Articles
This dataset contains the scraped Thai medical articles from various online sources. These articles are used for creating a Retrieval Augmented Generation (RAG) on medical domain.
For this Thai-LLM medical project, these articles are then used to generate "facts" (our proposed special case of text chunking) for downstream RAG applications.
## Background
ThaiLLM is a government funded project with a goal of training a Thai-based Large Language Model (LLM) as well as their medical domain variant (we'll refer the medical adapted ThaiLLM as Med-ThaiLLM). One of the core features of Med-ThaiLLM is to support RAG on medical articles, so our scopes also include curation of a RAG instruction following data.
This data plays an important role as a source data for extracting facts which will later be used for downstream RAG application.
## Data Sources
This articles were scraped from the following sources:
| Name | Source ID |URL |
|-----------------------|-----------|-----------------------------------------------------------|
| WebMD | `webmd_disease` |https://www.webmd.com |
| Harmor | `haamor_article` | https://www.haamor.com |
| Bumrungrad | `bumrungrad_article` | https://www.bumrungrad.com |
| Synpad | `synpad_article` | https://www.synphaet.co.th |
| Hdmall | `hdmall_faq` | https://hdmall.co.th |
| Bangkok Mental Hostpital | `bangkok_mental_hospital_article` | https://bangkokmentalhealthhospital.com |
| Canesten | `canesten_article`| https://www.canesten.co.th |
| Medscape | `medscape_drug_disease`| https://emedicine.medscape.com |
## License
This dataset is provided under CC0 since the data was scraped via internet without official permission asked.
However, we noted that the [extracted facts from these articles](https://huggingface.co/datasets/ThaiLLM/med-facts) are released under MIT License.
## Acknowledgement
We sincerely appreciate the generous support from the Ministry of Digital Economy and Society whose funding made this project possible. We are also grateful for the invaluable collaboration with VISTEC, and Big Data Institute (BDI) which was crucial in bringing this project to fruition.
提供机构:
ThaiLLM



