med-synth-questions-qwen3-235b-a22b-2507
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/openmed-community/med-synth-questions-qwen3-235b-a22b-2507
下载链接
链接失效反馈官方服务:
资源简介:
# openmed-community/med-synth-questions-qwen3-235b-a22b-2507
## What is this?
**Med Synth Questions — Qwen3-235B-A22B-2507** is an instruction-only dataset of **104,335 English medical questions** generated with [Qwen/Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) via OpenRouter.
Questions were created from **proprietary medical documents authored by physicians** in the **Medcases** application. The dataset contains **questions only**-no source passages or proprietary text are included.
- **Split:** `train` (104,335 rows)
- **Schema:** `{'input': str, 'generation_settings': dict, 'timestamp': str}`
- **License:** **CC0-1.0** (public-domain dedication)
- **Provenance note:** Source materials are proprietary to MedIT Solutions / Medcases; they are *not* redistributed here.
---
## Dataset structure
```json
DatasetDict({
train: Dataset({
features: ['input', 'generation_settings', 'timestamp'],
num_rows: 104335
})
})
````
**Features**
- `input` *(string)* — a single, self-contained medical question.
- `generation_settings` *(dict)* — structured metadata typically including:
- `model` (e.g., `"qwen/qwen3-235b-a22b-2507"`),
- `provider` (e.g., `"openrouter"`),
- request parameters (e.g., `max_tokens`, `num_questions_requested`, `num_questions_generated`).
- `timestamp` *(string)* — ISO-8601 creation time.
**Example**
```json
{
"input": "Hey, can you walk me through how the patient’s smoking history played into the diagnosis of a palate tumor?",
"generation_settings": {
"max_tokens": 4096,
"model": "qwen/qwen3-235b-a22b-2507",
"num_questions_generated": 5,
"num_questions_requested": 5,
"provider": "openrouter"
},
"timestamp": "2025-08-17T18:43:27.659300"
}
````
---
## Intended uses
* **Instruction-only fine-tuning scaffolds** (pair with your own answer-generation pipeline).
* **RAG/eval** — as a bank of domain-specific queries for retrieval and QA evaluation.
* **Question-generation research** — analyze prompt styles, difficulty, and topic coverage.
### Out-of-scope / caveats
* **No answers** are provided; downstream users should generate or annotate answers.
* Questions are derived from clinician-authored materials but may reflect **biases, gaps, or outdated info**; validate before use.
* **Not medical advice.** Do not use for clinical decision-making.
---
## How to load
```python
from datasets import load_dataset
ds = load_dataset("openmed-community/med-synth-questions-qwen3-235b-a22b-2507", split="train")
row = ds[0]
print(row["input"])
print(row["generation_settings"])
print(row["timestamp"])
```
---
## Licensing & responsible use
* **Dataset license:** **CC0-1.0** (public-domain dedication). Downstream users may copy, modify, and redistribute. Please acknowledge the source when feasible.
* **Provenance:** Underlying *source* documents are proprietary to MedIT Solutions / Medcases and are **not** included.
* **Model & provider terms:** Questions were generated with **Qwen3** served via **OpenRouter**. This dataset itself does not grant additional rights to model weights or hosted endpoints.
---
## Provenance & credit
* **Source environment:** [Medcases.io](https://medcases.io) (virtual-patient / medical-education platform) by [MedIT Solutions](https://meditsolutions.pl).
* **Generator model:** `Qwen/Qwen3-235B-A22B-Instruct-2507` via OpenRouter.
* **Curation:** openmed-community.
---
## Changelog
* **2025-08-17** — Initial release (`train`, 104,335 questions).
---
## Disclaimer
This resource is provided **for research and educational use**. It is **not** a source of medical advice. Always follow relevant laws, ethics, platform/model terms, and institutional review requirements. Use responsibly.
---
## Reproduce
To reproduce or adapt the pipeline, see our open-source [Synthetic Questions Generation tool](https://github.com/mkurman/synthetic-questions-generation)
# openmed-community/med-synth-questions-qwen3-235b-a22b-2507
## 本数据集是什么?
**Med Synth Questions — Qwen3-235B-A22B-2507** 是一个仅包含指令的数据集,涵盖104335条英文医疗问题,由[Qwen/Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)通过OpenRouter平台生成。
本数据集的问题源自Medcases应用中由医师撰写的专有医疗文档,且仅包含问题本身,未附带源文本段落或专有内容。
- **数据集拆分:** `train`(共104335条数据行)
- **数据结构(Schema):** `{'input': str, 'generation_settings': dict, 'timestamp': str}`
- **授权协议:** **CC0-1.0**(公共领域贡献协议)
- **来源说明:** 源素材归MedIT Solutions / Medcases所有,本数据集未附带此类源素材。
---
## 数据集结构
json
DatasetDict({
train: Dataset({
features: ['input', 'generation_settings', 'timestamp'],
num_rows: 104335
})
})
## 字段说明
- `input`(字符串类型):单条独立完整的医疗问题。
- `generation_settings`(字典类型):结构化元数据,通常包含以下内容:
- `model`(例如:`"qwen/qwen3-235b-a22b-2507"`):生成所用的模型
- `provider`(例如:`"openrouter"`):模型服务提供商
- 请求参数(例如:`max_tokens`、`num_questions_requested`、`num_questions_generated`)
- `timestamp`(字符串类型):ISO-8601格式的创建时间戳。
## 示例
json
{
"input": "Hey, can you walk me through how the patient’s smoking history played into the diagnosis of a palate tumor?",
"generation_settings": {
"max_tokens": 4096,
"model": "qwen/qwen3-235b-a22b-2507",
"num_questions_generated": 5,
"num_questions_requested": 5,
"provider": "openrouter"
},
"timestamp": "2025-08-17T18:43:27.659300"
}
---
## 预期用途
1. **仅用于指令微调基座**(可搭配自定义的答案生成流水线使用)。
2. **检索增强生成(Retrieval-Augmented Generation,简称RAG)与模型评估**:作为领域专属查询库,用于检索任务与问答系统评估。
3. **问题生成研究**:用于分析提示词风格、问题难度与主题覆盖范围。
### 适用范围限制与注意事项
1. **本数据集未附带答案**,下游使用者需自行生成或标注答案。
2. 问题源自临床医师撰写的素材,但可能存在偏差、信息缺口或过时内容,使用前请自行验证。
3. **本数据集不提供医疗建议**,不得用于临床决策。
---
## 加载方式
python
from datasets import load_dataset
ds = load_dataset("openmed-community/med-synth-questions-qwen3-235b-a22b-2507", split="train")
row = ds[0]
print(row["input"])
print(row["generation_settings"])
print(row["timestamp"])
---
## 授权协议与合规使用
1. **数据集授权:** **CC0-1.0**(公共领域贡献协议)。下游使用者可复制、修改并重新分发本数据集,若可行请注明原来源。
2. **来源说明:** 底层源文档归MedIT Solutions / Medcases所有,未随本数据集一同发布。
3. **模型与服务条款:** 本数据集的问题由通过OpenRouter部署的**Qwen3**模型生成,本数据集本身不赋予任何针对模型权重或托管端点的额外权利。
---
## 来源与致谢
1. **来源平台:** [Medcases.io](https://medcases.io)(虚拟患者/医学教育平台),由[MedIT Solutions](https://meditsolutions.pl)开发。
2. **生成模型:** 通过OpenRouter调用的`Qwen/Qwen3-235B-A22B-Instruct-2507`。
3. **数据集整理:** openmed-community。
---
## 更新日志
* **2025-08-17**:首次发布(仅包含`train`拆分,共104335条问题)。
---
## 免责声明
本资源仅用于**研究与教育用途**,不构成医疗建议。请始终遵守相关法律法规、伦理准则、平台与模型服务条款以及机构审查要求,合规且负责任地使用本数据集。
---
## 复现方法
若需复现或改造本数据集生成流程,请参考我们开源的[合成问题生成工具](https://github.com/mkurman/synthetic-questions-generation)
提供机构:
maas
创建时间:
2025-09-03
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个包含104,335条英文医学问题的指令数据集,由Qwen3-235B模型通过OpenRouter生成,问题基于Medcases平台的医生撰写材料。数据集仅包含问题,不提供答案或源文本,采用CC0-1.0许可,适用于指令微调、检索评估和问题生成研究。
以上内容由遇见数据集搜集并总结生成



