socialmediaie/TweetNERD
收藏TweetNERD - End to End Entity Linking Benchmark for Tweets
数据集概述
- 名称: TweetNERD
- 描述: 一个用于推文实体链接基准测试的数据集,包含2010-2021年间的340K+条推文。
- 领域: 自然语言处理 (NLP)
- 任务: 命名实体识别与消歧 (NERD)
数据集配置
- 配置名称:
defaultpaper
default 配置
- 数据文件:
train:train.public.merged.jsonvalidation:valid.public.merged.jsontest:test.public.merged.jsonacademic:Academic.public.merged.jsonood:OOD.public.merged.json
paper 配置
- 数据文件:
parts:part_*.public.merged.jsonacademic:Academic.public.merged.jsonood:OOD.public.merged.json
数据格式
- 文件类型: 制表符分隔的文件 (TSV)
- 列信息:
tweet_id: 推文ID (字符串)phrase: 实体短语 (字符串, 缺失值为NO_PHRASE)start: 短语在文本中的起始偏移量 (整数, 缺失值为-1)end: 短语在文本中的结束偏移量 (整数, 缺失值为-1)entityId: 实体ID (字符串, 缺失值为NO_ENTITY)score: 同意该短语、起始、结束、实体ID信息的标注者数量 (整数, 缺失值为-1)
数据统计
- 分割:
OOD: 34102行, 25000条唯一推文, 20937条已填充推文Academic: 51685行, 30119条唯一推文, 28694条已填充推文part_0至part_11: 11830至32033行, 10000至24972条唯一推文, 6633至20583条已填充推文train: 349252行, 255490条唯一推文, 207278条已填充推文valid: 6822行, 5000条唯一推文, 4128条已填充推文test: 34129行, 25000条唯一推文, 20274条已填充推文
数据使用政策
- 使用该数据集需要合法访问 Twitter API,并同意 开发者条款、政策和协议。
引用
@inproceedings{TweetNERD, doi = {10.48550/ARXIV.2210.08129}, url = {https://arxiv.org/abs/2210.08129}, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7, 68T50, 68T07}, title = {{TweetNERD} -- {End to End Entity Linking Benchmark for Tweets}}, publisher = {arXiv}, year = {2022}, booktitle = "Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 2 (NeurIPS Datasets and Benchmarks 2022)", copyright = {Creative Commons Attribution 4.0 International} }
@dataset{mishra_shubhanshu_2022_6617192, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, title = {{TweetNERD - End to End Entity Linking Benchmark for Tweets}}, month = jun, year = 2022, note = {{Data usage policy Use of this dataset is subject to you obtaining lawful access to the [Twitter API](https://developer.twitter.com/en/docs /twitter-api), which requires you to agree to the [Developer Terms Policies and Agreements](https://developer.twitter.com/en /developer-terms/).}}, publisher = {Zenodo}, version = {0.0.0}, doi = {10.5281/zenodo.6617192}, url = {https://doi.org/10.5281/zenodo.6617192} }



