JMasr/balidea-snomed-dataset

Name: JMasr/balidea-snomed-dataset
Creator: JMasr
Published: 2026-04-10 07:39:40
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/JMasr/balidea-snomed-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - es pretty_name: CU2 SNOMED Benchmark task_categories: - token-classification - text-classification tags: - clinical-nlp - snomed-ct - benchmarking --- # CU2 SNOMED Benchmark ## Dataset Summary This dataset packages the public portion of the CU2 benchmark for Spanish clinical text and SNOMED CT coding. Public tasks: - `mention_to_snomed` - `document_to_snomed` Public splits: - `train` - `validation` Hidden from the public dataset: - benchmark `test` gold labels ## Corpus Profile - Unique SNOMED codes across the processed corpus: 8128 - Unique documents: 1000 - `mention_to_snomed` rows: 26480 - `document_to_snomed` rows: 1000 The benchmark has a long-tail label distribution in both tasks: - `mention_to_snomed`: 4686 codes appear once; 7291 appear at most 5 times; the top 10 codes account for 8.08% of all mention-level code assignments - `document_to_snomed`: 5148 codes appear once; 7461 appear at most 5 times; the top 10 codes account for 6.86% of all document-level code assignments Mention surface forms are short on average but varied: - Unique mention surface forms: 15741 - Average words per mention surface form: 3.23 - Median words per mention surface form: 2.0 Most common mention lengths: - `1` word(s): annotations=8345, unique_surface_forms=2040 - `2` word(s): annotations=6846, unique_surface_forms=3708 - `3` word(s): annotations=3023, unique_surface_forms=2284 - `4` word(s): annotations=2311, unique_surface_forms=1993 - `5` word(s): annotations=1740, unique_surface_forms=1598 - `6` word(s): annotations=1192, unique_surface_forms=1148 - `7` word(s): annotations=826, unique_surface_forms=806 - `8` word(s): annotations=629, unique_surface_forms=612 - `9` word(s): annotations=426, unique_surface_forms=418 - `10` word(s): annotations=299, unique_surface_forms=297 ## Tasks ### mention_to_snomed Strict span-and-code prediction over localized mentions. Fields: - `doc_id` - `mention_text` - `start_char` - `end_char` - `gold_snomed_codes` Primary metrics: - `precision_strict` - `recall_strict` - `f1_strict` ### document_to_snomed Document-level SNOMED coding as a set prediction task. Fields: - `doc_id` - `text` - `gold_snomed_codes` Primary metrics: - `precision_micro` - `recall_micro` - `f1_micro` Secondary metric: - `subset_accuracy` ## Split Policy - Official raw `test` remains hidden and is not included in this public HF export. - Official raw `train` is partitioned deterministically into benchmark `train` and `validation`. - Validation derivation rule: `Derived deterministically from official train by doc_id using md5(doc_id) % 100 < 10.` ## Public Export Contents - `document_to_snomed`: train=658, validation=92 - `mention_to_snomed`: train=15527, validation=2193 ## Processed Dataset Statistics ### `mention_to_snomed` - Full processed rows: 26480 - Unique documents: 1000 - Unique SNOMED codes: 8128 - Mean codes per row: 1.000 - Median codes per row: 1.0 Split coverage across the processed benchmark: - `train`: rows=15527, docs=658, unique_codes=5622 - `validation`: rows=2193, docs=92, unique_codes=1345 - `test`: rows=8760, docs=250, unique_codes=3803 Most frequent mention-level codes: - `5880005`: count=291 share=1.10% - `387713003`: count=281 share=1.06% - `64882008`: count=281 share=1.06% - `22253000`: count=203 share=0.77% - `300848003`: count=201 share=0.76% - `77477000`: count=198 share=0.75% - `84387000`: count=186 share=0.70% - `417163006`: count=185 share=0.70% - `252416005`: count=165 share=0.62% - `419099009`: count=149 share=0.56% ### `document_to_snomed` - Full processed rows: 1000 - Unique documents: 1000 - Unique SNOMED codes: 8128 - Mean codes per row: 22.298 - Median codes per row: 19.0 Split coverage across the processed benchmark: - `train`: rows=658, docs=658, unique_codes=5622 - `validation`: rows=92, docs=92, unique_codes=1345 - `test`: rows=250, docs=250, unique_codes=3803 Most frequent document-level codes: - `5880005`: count=247 share=1.11% - `64882008`: count=188 share=0.84% - `387713003`: count=167 share=0.75% - `84387000`: count=165 share=0.74% - `22253000`: count=133 share=0.60% - `300848003`: count=131 share=0.59% - `419099009`: count=130 share=0.58% - `252416005`: count=128 share=0.57% - `77477000`: count=122 share=0.55% - `38341003`: count=118 share=0.53% ## Source Provenance The benchmark is built from: - DisTEMIST - SympTEMIST - MedProcNER Current source contribution in the processed benchmark: - `distemist`: 7431 mention records across 832 documents - `medprocner`: 8193 mention records across 499 documents - `symptemist`: 10856 mention records across 990 documents All three resources come from the Barcelona Supercomputing Center TEMU shared-task ecosystem and are fused into benchmark task schemas after deterministic normalization. ## Licensing Notes - SympTEMIST local package includes `CC BY 4.0` - MedProcNER local package includes `CC BY 4.0` - DisTEMIST provenance is tracked from the local `distemist_zenodo` package root; verify redistribution terms from the original distribution before public push if required by your release workflow ## Hidden Test Policy This public dataset intentionally excludes benchmark `test` gold labels. Use the repository's private/local evaluation workflow for real leaderboard scoring on hidden test predictions. ## Limitations - The public HF dataset is intended for reproducible training and validation, not for blind final scoring. - Source corpora share the same underlying document collection in many cases; benchmark split logic is handled at document level. - Strict mention scoring requires exact span and exact SNOMED code-set agreement.

提供机构：

JMasr

5,000+

优质数据集

54 个

任务类型

进入经典数据集