NYTWIT
收藏arXiv2020-10-24 更新2024-06-21 收录
下载链接:
https://github.com/yuvalpinter/nytwit
下载链接
链接失效反馈官方服务:
资源简介:
NYTWIT数据集是由乔治亚理工学院的研究人员创建,包含2500个在《纽约时报》上发表的新颖英语词汇。这些词汇从2017年11月至2019年3月期间收集,并手动标注了新颖性类别,如词汇派生、方言变异、混合或复合等。数据集通过Twitter机器人自动实时收集,每个词汇都标注了出版日期和文档标识符,以便提取上下文信息。该数据集旨在为语言学家和NLP实践者提供一个真实世界中新词汇出现的研究环境,并解决预训练模型在遇到未知词汇时的普遍问题。
The NYTWIT Dataset was created by researchers at the Georgia Institute of Technology. It contains 2,500 novel English words published in *The New York Times*, which were collected between November 2017 and March 2019. Each word is manually annotated with novelty categories such as word derivation, dialectal variation, blending, or compounding. The dataset was automatically collected in real time via a Twitter bot, with each entry paired with its publication date and document identifier to facilitate context extraction. This resource is designed to offer linguists and NLP practitioners a real-world research framework for investigating emerging vocabulary, and to address the widespread challenge that pre-trained models encounter when dealing with out-of-vocabulary terms.
提供机构:
乔治亚理工学院
创建时间:
2020-03-07



