five

Word Sense Change Testset

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/records/495572
下载链接
链接失效反馈
官方服务:
资源简介:
Overview This testset consists of 23 terms which have experienced word sense change during the past centuries. The main changes for each term were found using Wikipedia, dictionary.com and the Oxford English Dictionary. We consider major changes in usage as well as changes to sense. In cases where multiple (fine-grained) senses were available, we opted to accept the widest sense. E.g. for the term rock we consider a music sense without any distinction between different types of rock music, because our dataset is unlikely to have fine-grained sense differentiations. If a clear time point cannot be pinpointed, we choose the earliest possible. For comparison purposes we also chose a set of 11 terms that have experienced minimal change during the investigated period, i.e., stable terms.   Supplementary material 1. testset.txt Contains a list of all terms and the different change types for each term with a short description of the sense and change.   2. Files of the kind "TERM.txt" The header tells us the term, which clustering coefficient was used, which similarity threshold and which similarity measure. A path starts with "Path:".  A unit starts with "UNIT:" and the numbers following indicate 1. the number of years that the unit spans, and then a list of all years that the internal clusters stem from. E.g., UNIT: 83 1785, 1787, 1790, 1793, 1798, 1801, 1823, 1867, spanns 83 years and consists of clusters from year  1785, 1787, 1790 etc. Indentation shows the tree structure, more indentation means lower level branch in the tree. As an example, in AEROPLANE.txt unit UNIT: 23 1908, 1909, 1910, 1911, 1914, 1918, 1930, 1908 is the root node and the unit is related to UNIT: 27 1916, 1919, 1924, 1932, 1942, 1916.   Interesting findings The longest units and paths are found for stable terms, e.g., newspaper. These are statistically significantly longer than the average units and paths for terms that later evolve. Newspaper has a unit that spans 145 years and the first path spans from 1852 - 2007.   FLIGHT.txt For the term flight we find that the first unit captures a name, Flight & Robson who were organ builders. The second unit (it its own path) represents the flight over a hurdle: UNIT: 28 1868, 1869, 1870, 1877, 1885, 1889, 1890, 1892, 1893, 1894, 1895 There is a unit (it its own path) that represents the flight of a cricket ball: UNIT: 29 1938, 1957, 1966 Finally, the last path represents flight as in a means of transportation, in particular for holidays, starting with  UNIT: 19 1962, 1970, 1973, 1980   TAPE.txt The first path for tape is a path related to sowing tape. Then there is a second path starting with  UNIT: 38 1970, 1974, 2007 that takes up the musical tape. The last path end in the same units that the second path ends in, also related to the musical tape. The music tape and the sowing tape should be related because of their shape, but we cannot find any relation as there are few or no overlapping terms.
创建时间:
2020-01-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作