Word Representations for Clinical Danish
收藏Figshare2020-05-27 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/Word_Representations_for_Clinical_Danish/12377858/1
下载链接
链接失效反馈官方服务:
资源简介:
Word embeddings and word clusters for Clinical Danish, drawn from the heavily-anonymised E4C resource (https://doi.org/10.1177/1460458216647760) and presented here as statistical aggregate data over those records. Vocabulary of 382737 words. Vectors have 100 dimensions. Clusters generated using Generalised Brown clustering with a=2500 and a minimum count of 3; coarser clusters can be generated rapidly from the included mergefile (see https://github.com/sean-chester/generalised-brown/blob/master/cluster_generator/cluster.py)<br>Data statement included<br>
面向丹麦临床文本的词嵌入与词簇数据集,源自经过重度匿名化处理的E4C资源(https://doi.org/10.1177/1460458216647760),本次发布的数据集为基于该资源记录的统计聚合数据。
该数据集包含382737个词的词表,词向量维度为100。
词簇通过广义布朗聚类(Generalised Brown clustering)生成,参数设置为a=2500,最小出现次数阈值为3;可通过附带的合并文件快速生成更粗粒度的词簇(详见https://github.com/sean-chester/generalised-brown/blob/master/cluster_generator/cluster.py)。
本数据集包含数据声明。
创建时间:
2020-05-27



