medrxiv-clustering-p2p
收藏魔搭社区2025-11-14 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/MTEB/medrxiv-clustering-p2p
下载链接
链接失效反馈官方服务:
资源简介:
<!-- adapted from https://github.com/huggingface/huggingface_hub/blob/v0.30.2/src/huggingface_hub/templates/datasetcard_template.md -->
<div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;">
<h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">MedrxivClusteringP2P.v2</h1>
<div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">An <a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;" onmouseover="this.style.textDecoration='underline'" onmouseout="this.style.textDecoration='none'">MTEB</a> dataset</div>
<div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">Massive Text Embedding Benchmark</div>
</div>
Clustering of titles+abstract from medrxiv across 51 categories.
| | |
|---------------|---------------------------------------------|
| Task category | t2c |
| Domains | Academic, Medical, Written |
| Reference | https://api.medrxiv.org/ |
## How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code:
```python
import mteb
task = mteb.get_tasks(["MedrxivClusteringP2P.v2"])
evaluator = mteb.MTEB(task)
model = mteb.get_model(YOUR_MODEL)
evaluator.run(model)
```
<!-- Datasets want link to arxiv in readme to autolink dataset with paper -->
To learn more about how to run models on `mteb` task check out the [GitHub repitory](https://github.com/embeddings-benchmark/mteb).
## Citation
If you use this dataset, please cite the dataset as well as [mteb](https://github.com/embeddings-benchmark/mteb), as this dataset likely includes additional processing as a part of the [MMTEB Contribution](https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb).
```bibtex
@article{enevoldsen2025mmtebmassivemultilingualtext,
title={MMTEB: Massive Multilingual Text Embedding Benchmark},
author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin and Ömer Çağatan and Akash Kundu and Martin Bernstorff and Shitao Xiao and Akshita Sukhlecha and Bhavish Pahwa and Rafał Poświata and Kranthi Kiran GV and Shawon Ashraf and Daniel Auras and Björn Plüster and Jan Philipp Harries and Loïc Magne and Isabelle Mohr and Mariya Hendriksen and Dawei Zhu and Hippolyte Gisserot-Boukhlef and Tom Aarsen and Jan Kostkan and Konrad Wojtasik and Taemin Lee and Marek Šuppa and Crystina Zhang and Roberta Rocca and Mohammed Hamdy and Andrianos Michail and John Yang and Manuel Faysse and Aleksei Vatolin and Nandan Thakur and Manan Dey and Dipam Vasani and Pranjal Chitale and Simone Tedeschi and Nguyen Tai and Artem Snegirev and Michael Günther and Mengzhou Xia and Weijia Shi and Xing Han Lù and Jordan Clive and Gayatri Krishnakumar and Anna Maksimova and Silvan Wehrli and Maria Tikhonova and Henil Panchal and Aleksandr Abramov and Malte Ostendorff and Zheng Liu and Simon Clematide and Lester James Miranda and Alena Fenogenova and Guangyu Song and Ruqiya Bin Safi and Wen-Ding Li and Alessia Borghini and Federico Cassano and Hongjin Su and Jimmy Lin and Howard Yen and Lasse Hansen and Sara Hooker and Chenghao Xiao and Vaibhav Adlakha and Orion Weller and Siva Reddy and Niklas Muennighoff},
publisher = {arXiv},
journal={arXiv preprint arXiv:2502.13595},
year={2025},
url={https://arxiv.org/abs/2502.13595},
doi = {10.48550/arXiv.2502.13595},
}
@article{muennighoff2022mteb,
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
publisher = {arXiv},
journal={arXiv preprint arXiv:2210.07316},
year = {2022}
url = {https://arxiv.org/abs/2210.07316},
doi = {10.48550/ARXIV.2210.07316},
}
```
# Dataset Statistics
<details>
<summary> Dataset Statistics</summary>
The following code contains the descriptive statistics from the task. These can also be obtained using:
```python
import mteb
task = mteb.get_task("MedrxivClusteringP2P.v2")
desc_stats = task.metadata.descriptive_stats
```
```json
{
"test": {
"num_samples": 37500,
"number_of_characters": 74294927,
"min_text_length": 148,
"average_text_length": 1981.1980533333333,
"max_text_length": 38759,
"min_labels_per_text": 6,
"average_labels_per_text": 1.0,
"max_labels_per_text": 8830,
"unique_labels": 51,
"labels": {
"epidemiology": {
"count": 6656
},
"public and global health": {
"count": 3595
},
"oncology": {
"count": 845
},
"allergy and immunology": {
"count": 464
},
"orthopedics": {
"count": 104
},
"health informatics": {
"count": 1107
},
"occupational and environmental health": {
"count": 415
},
"infectious diseases": {
"count": 8830
},
"genetic and genomic medicine": {
"count": 1918
},
"health policy": {
"count": 527
},
"gastroenterology": {
"count": 343
},
"radiology and imaging": {
"count": 541
},
"pain medicine": {
"count": 121
},
"neurology": {
"count": 1773
},
"primary care research": {
"count": 232
},
"rheumatology": {
"count": 189
},
"endocrinology": {
"count": 419
},
"hematology": {
"count": 202
},
"addiction medicine": {
"count": 178
},
"pediatrics": {
"count": 589
},
"cardiovascular medicine": {
"count": 855
},
"obstetrics and gynecology": {
"count": 373
},
"health systems and quality improvement": {
"count": 491
},
"nephrology": {
"count": 241
},
"respiratory medicine": {
"count": 482
},
"geriatric medicine": {
"count": 169
},
"dentistry and oral medicine": {
"count": 159
},
"psychiatry and clinical psychology": {
"count": 1781
},
"nutrition": {
"count": 240
},
"intensive care and critical care medicine": {
"count": 368
},
"rehabilitation medicine and physical therapy": {
"count": 322
},
"otolaryngology": {
"count": 166
},
"nursing": {
"count": 93
},
"transplantation": {
"count": 118
},
"health economics": {
"count": 327
},
"sports medicine": {
"count": 180
},
"hiv aids": {
"count": 363
},
"dermatology": {
"count": 98
},
"pathology": {
"count": 223
},
"emergency medicine": {
"count": 191
},
"pharmacology and therapeutics": {
"count": 221
},
"ophthalmology": {
"count": 220
},
"medical ethics": {
"count": 46
},
"palliative medicine": {
"count": 45
},
"sexual and reproductive health": {
"count": 156
},
"medical education": {
"count": 203
},
"surgery": {
"count": 162
},
"urology": {
"count": 65
},
"anesthesia": {
"count": 72
},
"toxicology": {
"count": 16
},
"forensic medicine": {
"count": 6
}
}
}
}
```
</details>
---
*This dataset card was automatically generated using [MTEB](https://github.com/embeddings-benchmark/mteb)*
<div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;">
<h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">MedrxivClusteringP2P.v2</h1>
<div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">属于<a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;" onmouseover="this.style.textDecoration='underline'" onmouseout="this.style.textDecoration='none'">大规模文本嵌入基准(Massive Text Embedding Benchmark,简称MTEB)</a>的数据集</div>
<div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">大规模文本嵌入基准</div>
</div>
<!-- adapted from https://github.com/huggingface/huggingface_hub/blob/v0.30.2/src/huggingface_hub/templates/datasetcard_template.md -->
针对MedRxiv平台上覆盖51个分类的论文标题与摘要开展聚类任务。
| 任务类别 | t2c |
|---------------|---------------------------------------------|
| 应用领域 | 学术、医学、书面文本 |
| 参考链接 | https://api.medrxiv.org/ |
## 任务评估流程
你可通过以下代码在该数据集上评估嵌入模型:
python
import mteb
task = mteb.get_tasks(["MedrxivClusteringP2P.v2"])
evaluator = mteb.MTEB(task)
model = mteb.get_model(YOUR_MODEL)
evaluator.run(model)
<!-- Datasets want link to arxiv in readme to autolink dataset with paper -->
若需了解如何在`MTEB`任务中运行模型,请访问其[GitHub仓库](https://github.com/embeddings-benchmark/mteb)。
## 引用说明
若您使用本数据集,请同时引用本数据集与[MTEB](https://github.com/embeddings-benchmark/mteb)的相关文献,因本数据集的处理流程属于[大规模多语言文本嵌入基准(Massive Multilingual Text Embedding Benchmark,简称MMTEB)贡献项目](https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb)的一部分。
bibtex
@article{enevoldsen2025mmtebmassivemultilingualtext,
title={MMTEB: Massive Multilingual Text Embedding Benchmark},
author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin and Ömer Çağatan and Akash Kundu and Martin Bernstorff and Shitao Xiao and Akshita Sukhlecha and Bhavish Pahwa and Rafał Poświata and Kranthi Kiran GV and Shawon Ashraf and Daniel Auras and Björn Plüster and Jan Philipp Harries and Lo{"i}c Magne and Isabelle Mohr and Mariya Hendriksen and Dawei Zhu and Hippolyte Gisserot-Boukhlef and Tom Aarsen and Jan Kostkan and Konrad Wojtasik and Taemin Lee and Marek Šuppa and Crystina Zhang and Roberta Rocca and Mohammed Hamdy and Andrianos Michail and John Yang and Manuel Faysse and Aleksei Vatolin and Nandan Thakur and Manan Dey and Dipam Vasani and Pranjal Chitale and Simone Tedeschi and Nguyen Tai and Artem Snegirev and Michael Günther and Mengzhou Xia and Weijia Shi and Xing Han Lù and Jordan Clive and Gayatri Krishnakumar and Anna Maksimova and Silvan Wehrli and Maria Tikhonova and Henil Panchal and Aleksandr Abramov and Malte Ostendorff and Zheng Liu and Simon Clematide and Lester James Miranda and Alena Fenogenova and Guangyu Song and Ruqiya Bin Safi and Wen-Ding Li and Alessia Borghini and Federico Cassano and Hongjin Su and Jimmy Lin and Howard Yen and Lasse Hansen and Sara Hooker and Chenghao Xiao and Vaibhav Adlakha and Orion Weller and Siva Reddy and Niklas Muennighoff},
publisher = {arXiv},
journal={arXiv preprint arXiv:2502.13595},
year={2025},
url={https://arxiv.org/abs/2502.13595},
doi = {10.48550/arXiv.2502.13595},
}
@article{muennighoff2022mteb,
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{"i}c and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
publisher = {arXiv},
journal={arXiv preprint arXiv:2210.07316},
year = {2022}
url = {https://arxiv.org/abs/2210.07316},
doi = {10.48550/ARXIV.2210.07316},
}
## 数据集统计信息
<details>
<summary> 数据集统计信息</summary>
以下为该任务的描述性统计数据,您也可通过以下代码获取:
python
import mteb
task = mteb.get_task("MedrxivClusteringP2P.v2")
desc_stats = task.metadata.descriptive_stats
json
{
"test": {
"num_samples": 37500,
"number_of_characters": 74294927,
"min_text_length": 148,
"average_text_length": 1981.1980533333333,
"max_text_length": 38759,
"min_labels_per_text": 6,
"average_labels_per_text": 1.0,
"max_labels_per_text": 8830,
"unique_labels": 51,
"labels": {
"流行病学": {
"count": 6656
},
"公共与全球卫生": {
"count": 3595
},
"肿瘤学": {
"count": 845
},
"变态反应学与免疫学": {
"count": 464
},
"骨科学": {
"count": 104
},
"健康信息学": {
"count": 1107
},
"职业与环境卫生学": {
"count": 415
},
"感染病学": {
"count": 8830
},
"遗传与基因组医学": {
"count": 1918
},
"卫生政策": {
"count": 527
},
"胃肠病学": {
"count": 343
},
"放射学与影像学": {
"count": 541
},
"疼痛医学": {
"count": 121
},
"神经病学": {
"count": 1773
},
"基层医疗研究": {
"count": 232
},
"风湿病学": {
"count": 189
},
"内分泌学": {
"count": 419
},
"血液学": {
"count": 202
},
"成瘾医学": {
"count": 178
},
"儿科学": {
"count": 589
},
"心血管医学": {
"count": 855
},
"妇产科学": {
"count": 373
},
"卫生系统与质量改进": {
"count": 491
},
"肾脏病学": {
"count": 241
},
"呼吸病学": {
"count": 482
},
"老年医学": {
"count": 169
},
"口腔医学与口腔科学": {
"count": 159
},
"精神病学与临床心理学": {
"count": 1781
},
"营养学": {
"count": 240
},
"重症医学": {
"count": 368
},
"康复医学与物理治疗": {
"count": 322
},
"耳鼻咽喉科学": {
"count": 166
},
"护理学": {
"count": 93
},
"移植医学": {
"count": 118
},
"卫生经济学": {
"count": 327
},
"运动医学": {
"count": 180
},
"艾滋病": {
"count": 363
},
"皮肤病学": {
"count": 98
},
"病理学": {
"count": 223
},
"急诊医学": {
"count": 191
},
"药理学与治疗学": {
"count": 221
},
"眼科学": {
"count": 220
},
"医学伦理学": {
"count": 46
},
"姑息医学": {
"count": 45
},
"性与生殖健康": {
"count": 156
},
"医学教育": {
"count": 203
},
"外科学": {
"count": 162
},
"泌尿外科学": {
"count": 65
},
"麻醉学": {
"count": 72
},
"毒理学": {
"count": 16
},
"法医学": {
"count": 6
}
}
}
}
</details>
---
*本数据集卡片由[MTEB](https://github.com/embeddings-benchmark/mteb)自动生成*
提供机构:
maas
创建时间:
2024-09-06



