balacoon/en_us_abbreviations
收藏Hugging Face2022-10-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/balacoon/en_us_abbreviations
下载链接
链接失效反馈官方服务:
资源简介:
# en-US abbrevations
This is a dataset of abbreviations.
Contains examples of abbreviations and regular words.
There are two subsets:
- <mark>wiki</mark> - more accurate, manually annotated subset. Collected
from abbreviations in wiki and words in CMUdict.
- <mark>kestrel</mark> - tokens that are automatically annotated by Google
text normalization into **PLAIN** and **LETTERS** semiotic
classes. Less accurate, but bigger. Files additionally contain frequency
of token (how often it appeared) in a second column for possible filtering.
More info on how dataset was collected: [blog](http://balacoon.com/blog/en_us_abbreviation_detection/#difficult-to-pronounce)
提供机构:
balacoon
原始信息汇总
数据集概述
数据集名称
en-US abbrevations
数据集内容
本数据集包含缩写词和常规单词的示例。
数据集子集
-
wiki
- 来源:从维基百科中的缩写和CMUdict中的单词手动标注。
- 特点:更准确,手动标注。
-
kestrel
- 来源:通过Google文本规范化自动标注,分为PLAIN和LETTERS两个符号类别。
- 特点:准确性较低,但规模更大。文件中额外包含每个标记的出现频率,位于第二列,用于可能的过滤。



