abualmun/MENST
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/abualmun/MENST
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
language:
- en
tags:
- medical
- biology
pretty_name: Menstrual Education kNowledge for Support and Training
size_categories:
- 10K<n<100K
---
# Menstrual Education kNowledge for Support and Training (MENST)
The MENST dataset is a comprehensive and curated resource designed to advance menstrual health education and support. It serves as a foundation for fine-tuning language models for question-answering and conversational tasks, focusing on topics related to menstrual health.
🔗 Cite us
```
Adhikary P, Motiyani I, Oke G, Joshi M, Pathak K, Singh S, Chakraborty T
Menstrual Health Education Using a Specialized Large Language Model in India: Development and Evaluation Study of MenstLLaMA
J Med Internet Res 2025;27:e71977
URL: https://www.jmir.org/2025/1/e71977
DOI: 10.2196/71977
```
## Dataset Details
### Sources
The MENST dataset was compiled from a variety of reputable sources, including:
- Health information portals
- Medical institutions
- Government websites
- Global organizations
- Educational platforms
FAQs and question-answer pairs were extracted from official medical documents, curated, and enriched using advanced language models. Specifically, we incorporated the Menstrual Health Awareness Dataset, containing 562 QA pairs, and annotated them with metadata to ensure relevance and structure.
### Augmentation
To enhance the dataset's coverage and depth, prompting techniques using GPT-4 and Gemini 1.5 Pro were employed. This process generated additional QA pairs based on relevant menstrual health documents. Domain experts validated the generated content, ensuring its accuracy, cultural relevance, and empathetic tone.
### Metadata Creation
To streamline data management and provide detailed catalogs of menstrual health topics, metadata was created for all documents. This includes:
- **Document IDs**: Unique identifiers starting with ‘D’ (unstructured documents) or ‘F’ (FAQs).
- **Document Name**: Title or heading of the document.
- **Source**: Name of the organization or website.
- **Link**: URL to the document.
- **Keywords**: Keywords related to the asked question of the document.
**Metadata Schema for Question-Answer Pairs** (Table 2):
- **Document ID**: Identifier for the source document.
- **Question**: Specific question related to the topic.
- **Answer**: Corresponding answer.
- **Age Group**: Targeted demographic (adolescents, young adults, adults, older adults).
- **Region**: Geographical focus (rural, urban, both).
- **Keywords**: Tags for content (e.g., Medication, Therapy).
- **LLM Used**: Model utilized for post-processing.
### Taxonomy
A taxonomy was developed in collaboration with gynecologists to categorize topics. Primary categories include:
- **Anatomy**
- **Normal Menstruation** (e.g., Menarche, Menopause, Normal Flow)
- **Abnormal Menstruation** (e.g., PCOS, PMS, Irregular Periods)
- **Pregnancy**
- **Lifestyle**
- **Support**
- **Society**
Each category is subdivided into detailed subtopics, ensuring comprehensive coverage.
### Question-Answer Pair Creation
The dataset comprises 117 documents:
- 14 FAQ documents (Set-1) from medical portals serve as a gold test set.
- 103 unstructured documents were processed using GPT-4 and Gemini 1.5 Pro to generate QA pairs, which were validated by domain experts.
By integrating these components, the MENST dataset ensures a robust foundation for advancing menstrual health education and support through machine learning applications.
---
许可证:MIT
任务类别:问答 (question-answering)
语言:英语
标签:医学、生物学
展示名称:用于支持与培训的月经教育知识(MENST)
规模类别:1万条<数据量<10万条
---
# 用于支持与培训的月经教育知识(MENST)
MENST数据集是一项经过精心整理的综合性资源,旨在推动月经健康教育与支持工作。该数据集可作为针对月经健康相关主题开展问答与对话任务的语言模型微调的基础。
🔗 引用本数据集
Adhikary P, Motiyani I, Oke G, Joshi M, Pathak K, Singh S, Chakraborty T
Menstrual Health Education Using a Specialized Large Language Model in India: Development and Evaluation Study of MenstLLaMA
J Med Internet Res 2025;27:e71977
URL: https://www.jmir.org/2025/1/e71977
DOI: 10.2196/71977
## 数据集详情
### 来源
MENST数据集从多个权威来源汇编而成,包括:
- 健康信息门户网站
- 医疗机构
- 政府网站
- 国际组织
- 教育平台
常见问题解答与问答对均从官方医疗文档中提取,经整理并使用先进语言模型进行内容丰富。具体而言,我们纳入了包含562组问答对的《月经健康认知数据集》,并为其添加元数据以确保相关性与结构规范性。
### 数据增强
为提升数据集的覆盖范围与深度,我们采用了基于GPT-4与Gemini 1.5 Pro的提示工程技术,根据相关月经健康文档生成了额外的问答对。领域专家对生成内容进行了验证,确保其准确性、文化适配性与共情口吻。
### 元数据构建
为简化数据管理并提供详细的月经健康主题目录,我们为所有文档创建了元数据,包含:
- **文档ID**:以‘D’(非结构化文档)或‘F’(常见问题解答)开头的唯一标识符。
- **文档名称**:文档的标题。
- **来源**:所属机构或网站名称。
- **链接**:文档的URL。
- **关键词**:与文档所涉问题相关的关键词。
**问答对元数据架构**(表2):
- **文档ID**:来源文档的标识符
- **问题**:与主题相关的具体问题
- **答案**:对应的解答内容
- **年龄组**:目标受众群体(青少年、青年成人、成人、老年成人)
- **地区**:地理覆盖范围(农村、城市、两者兼具)
- **关键词**:内容标签(例如:药物治疗、疗法)
- **所用大语言模型 (Large Language Model)**:用于后处理的模型
### 分类体系
我们与妇科医生合作开发了一套分类体系,用于主题归类。主要类别包括:
- **解剖学**
- **正常月经**(例如:月经初潮、绝经、正常经量)
- **异常月经**(例如:多囊卵巢综合征、经前期综合征、月经不调)
- **妊娠**
- **生活方式**
- **支持服务**
- **社会议题**
每个类别均细分至详细的子主题,确保覆盖全面。
### 问答对构建
本数据集共包含117份文档:
- 14份来自医疗门户网站的常见问题解答文档(集1)作为黄金测试集。
- 其余103份非结构化文档通过GPT-4与Gemini 1.5 Pro处理生成问答对,并经领域专家验证。
通过整合上述组件,MENST数据集可为借助机器学习应用推进月经健康教育与支持工作提供坚实的基础。
提供机构:
abualmun



