ThaiLLM/med-facts
收藏Hugging Face2025-07-23 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/ThaiLLM/med-facts
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
features:
- name: fact_id
dtype: string
- name: text
dtype: string
- name: validation
struct:
- name: grounded
dtype: bool
- name: subfacts
list:
- name: supporting_lines
list: string
- name: text
dtype: string
- name: source_id
dtype: string
splits:
- name: train
num_bytes: 108880814
num_examples: 83237
download_size: 45112187
dataset_size: 108880814
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# ThaiLLM Dataset: Medical Facts
This dataset contains the facts extracted from [medical articles scraped online](https://huggingface.co/datasets/ThaiLLM/med-articles).
The facts was extracted using `o4-mini` and also validated using `o4-mini` under different prompt.
We also provide [another dataset that assess the validatity of our fact extraction pipeline here](https://huggingface.co/datasets/ThaiLLM/med-fact-verification).
## Fact Extraction Process
Given the scraped article (please refer to the source articles dataset [here](https://huggingface.co/datasets/ThaiLLM/med-articles)), we extract facts from source article using the following procedure:
1. Prompt `o4-mini` given the article to extract 4-5 facts from source article.
2. Given the extracted facts from (1) and source article, we also use `o4-mini` with different prompt to remove any facts that LLM flagged as not grounded by the article. The goal is to remove any fact that is hallucinated or not grounded by the source article. (We also provide the dataset that we measure the reliability of `o4-mini`'s verification pipeline with human on [this dataset](https://huggingface.co/datasets/ThaiLLM/med-fact-verification).)
## License
This dataset is provided under MIT License.
## Acknowledgement
We sincerely appreciate the generous support from the Ministry of Digital Economy and Society whose funding made this project possible. We are also grateful for the invaluable collaboration with VISTEC, and Big Data Institute (BDI) which was crucial in bringing this project to fruition.
提供机构:
ThaiLLM



