JMasr/balidea-snomed-dataset
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/JMasr/balidea-snomed-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- es
pretty_name: CU2 SNOMED Benchmark
task_categories:
- token-classification
- text-classification
tags:
- clinical-nlp
- snomed-ct
- benchmarking
---
# CU2 SNOMED Benchmark
## Dataset Summary
This dataset packages the public portion of the CU2 benchmark for Spanish clinical text and SNOMED CT coding.
Public tasks:
- `mention_to_snomed`
- `document_to_snomed`
Public splits:
- `train`
- `validation`
Hidden from the public dataset:
- benchmark `test` gold labels
## Corpus Profile
- Unique SNOMED codes across the processed corpus: 8128
- Unique documents: 1000
- `mention_to_snomed` rows: 26480
- `document_to_snomed` rows: 1000
The benchmark has a long-tail label distribution in both tasks:
- `mention_to_snomed`: 4686 codes appear once; 7291 appear at most 5 times; the top 10 codes account for 8.08% of all mention-level code assignments
- `document_to_snomed`: 5148 codes appear once; 7461 appear at most 5 times; the top 10 codes account for 6.86% of all document-level code assignments
Mention surface forms are short on average but varied:
- Unique mention surface forms: 15741
- Average words per mention surface form: 3.23
- Median words per mention surface form: 2.0
Most common mention lengths:
- `1` word(s): annotations=8345, unique_surface_forms=2040
- `2` word(s): annotations=6846, unique_surface_forms=3708
- `3` word(s): annotations=3023, unique_surface_forms=2284
- `4` word(s): annotations=2311, unique_surface_forms=1993
- `5` word(s): annotations=1740, unique_surface_forms=1598
- `6` word(s): annotations=1192, unique_surface_forms=1148
- `7` word(s): annotations=826, unique_surface_forms=806
- `8` word(s): annotations=629, unique_surface_forms=612
- `9` word(s): annotations=426, unique_surface_forms=418
- `10` word(s): annotations=299, unique_surface_forms=297
## Tasks
### mention_to_snomed
Strict span-and-code prediction over localized mentions.
Fields:
- `doc_id`
- `mention_text`
- `start_char`
- `end_char`
- `gold_snomed_codes`
Primary metrics:
- `precision_strict`
- `recall_strict`
- `f1_strict`
### document_to_snomed
Document-level SNOMED coding as a set prediction task.
Fields:
- `doc_id`
- `text`
- `gold_snomed_codes`
Primary metrics:
- `precision_micro`
- `recall_micro`
- `f1_micro`
Secondary metric:
- `subset_accuracy`
## Split Policy
- Official raw `test` remains hidden and is not included in this public HF export.
- Official raw `train` is partitioned deterministically into benchmark `train` and `validation`.
- Validation derivation rule: `Derived deterministically from official train by doc_id using md5(doc_id) % 100 < 10.`
## Public Export Contents
- `document_to_snomed`: train=658, validation=92
- `mention_to_snomed`: train=15527, validation=2193
## Processed Dataset Statistics
### `mention_to_snomed`
- Full processed rows: 26480
- Unique documents: 1000
- Unique SNOMED codes: 8128
- Mean codes per row: 1.000
- Median codes per row: 1.0
Split coverage across the processed benchmark:
- `train`: rows=15527, docs=658, unique_codes=5622
- `validation`: rows=2193, docs=92, unique_codes=1345
- `test`: rows=8760, docs=250, unique_codes=3803
Most frequent mention-level codes:
- `5880005`: count=291 share=1.10%
- `387713003`: count=281 share=1.06%
- `64882008`: count=281 share=1.06%
- `22253000`: count=203 share=0.77%
- `300848003`: count=201 share=0.76%
- `77477000`: count=198 share=0.75%
- `84387000`: count=186 share=0.70%
- `417163006`: count=185 share=0.70%
- `252416005`: count=165 share=0.62%
- `419099009`: count=149 share=0.56%
### `document_to_snomed`
- Full processed rows: 1000
- Unique documents: 1000
- Unique SNOMED codes: 8128
- Mean codes per row: 22.298
- Median codes per row: 19.0
Split coverage across the processed benchmark:
- `train`: rows=658, docs=658, unique_codes=5622
- `validation`: rows=92, docs=92, unique_codes=1345
- `test`: rows=250, docs=250, unique_codes=3803
Most frequent document-level codes:
- `5880005`: count=247 share=1.11%
- `64882008`: count=188 share=0.84%
- `387713003`: count=167 share=0.75%
- `84387000`: count=165 share=0.74%
- `22253000`: count=133 share=0.60%
- `300848003`: count=131 share=0.59%
- `419099009`: count=130 share=0.58%
- `252416005`: count=128 share=0.57%
- `77477000`: count=122 share=0.55%
- `38341003`: count=118 share=0.53%
## Source Provenance
The benchmark is built from:
- DisTEMIST
- SympTEMIST
- MedProcNER
Current source contribution in the processed benchmark:
- `distemist`: 7431 mention records across 832 documents
- `medprocner`: 8193 mention records across 499 documents
- `symptemist`: 10856 mention records across 990 documents
All three resources come from the Barcelona Supercomputing Center TEMU shared-task ecosystem and are fused into benchmark task schemas after deterministic normalization.
## Licensing Notes
- SympTEMIST local package includes `CC BY 4.0`
- MedProcNER local package includes `CC BY 4.0`
- DisTEMIST provenance is tracked from the local `distemist_zenodo` package root; verify redistribution terms from the original distribution before public push if required by your release workflow
## Hidden Test Policy
This public dataset intentionally excludes benchmark `test` gold labels.
Use the repository's private/local evaluation workflow for real leaderboard scoring on hidden test predictions.
## Limitations
- The public HF dataset is intended for reproducible training and validation, not for blind final scoring.
- Source corpora share the same underlying document collection in many cases; benchmark split logic is handled at document level.
- Strict mention scoring requires exact span and exact SNOMED code-set agreement.
提供机构:
JMasr



