TweetNERD - End to End Entity Linking Benchmark for Tweets
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/6617191
下载链接
链接失效反馈官方服务:
资源简介:
TweetNERD - End to End Entity Linking Benchmark for Tweets
Paper - Video - Neurips Page
This is the dataset described in the paper TweetNERD - End to End Entity Linking Benchmark for Tweets (accepted to Thirty-sixth Conference on Neural Information Processing Systems (Neurips) Datasets and Benchmarks Track).
Named Entity Recognition and Disambiguation (NERD) systems are foundational for information retrieval, question answering, event detection, and other natural language processing (NLP) applications. We introduce TweetNERD, a dataset of 340K+ Tweets across 2010-2021, for benchmarking NERD systems on Tweets. This is the largest and most temporally diverse open sourced dataset benchmark for NERD on Tweets and can be used to facilitate research in this area.
UPDATE: The new version contains an additional ~125K Tweets leading to a total dataset size of ~465K Tweets.
TweetNERD dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0) LICENSE.
The license only applies to the data files present in this dataset. See Data usage policy below.
Check out more details at https://github.com/twitter-research/TweetNERD
Usage
We provide the dataset split across the following tab seperated files:
OOD.public.tsv: OOD split of the data in the paper.
Academic.public.tsv: Academic split of the data described in the paper.
part_*.public.tsv: Remaining data split into parts in no particular order.
Each file is tab separated and has has the following format:
tweet_id
phrase
start
end
entityId
score
22
twttr
20
25
Q918
3
21
twttr
20
25
Q918
3
1457198399032287235
Diwali
30
38
Q10244
3
1232456079247736833
NO_PHRASE
-1
-1
NO_ENTITY
-1
For tweets which don't have any entity, their column values for phrase, start, end, entityId, score are set NO_PHRASE, -1, -1, NO_ENTITY, -1 respectively.
Description of file columns is as follows:
Column
Type
Missing Value
Description
tweet_id
string
ID of the Tweet
phrase
string
NO_PHRASE
entity phrase
start
int
-1
start offset of the phrase in text using UTF-16BE encoding
end
int
-1
end offset of the phrase in the text using UTF-16BE encoding
entityId
string
NO_ENTITY
Entity ID. If not missing can be NOT FOUND, AMBIGUOUS, or Wikidata ID of format Q{numbers}, e.g. Q918
score
int
-1
Number of annotators who agreed on the phrase, start, end, entityId information
In order to use the dataset you need to utilize the tweet_id column and get the Tweet text using the Twitter API (See Data usage policy section below).
Data stats
Split
Number of Rows
Number unique tweets
OOD
34102
25000
Academic
51685
30119
part_0
11830
10000
part_1
35681
25799
part_2
34256
25000
part_3
36478
25000
part_4
37518
24999
part_5
36626
25000
part_6
34001
24984
part_7
34125
24981
part_8
32556
25000
part_9
32657
25000
part_10
32442
25000
part_11
32033
24972
part_12
76559
25000
part_13
67240
24920
part_14
67745
25000
part_15
67652
25000
part_16
65739
25000
Data usage policy
Use of this dataset is subject to you obtaining lawful access to the Twitter API, which requires you to agree to the Developer Terms Policies and Agreements.
Please cite the following if you use TweetNERD in your paper:
@dataset{TweetNERD_Zenodo_2022_6617192,
author = {Mishra, Shubhanshu and
Saini, Aman and
Makki, Raheleh and
Mehta, Sneha and
Haghighi, Aria and
Mollahosseini, Ali},
title = {{TweetNERD - End to End Entity Linking Benchmark
for Tweets}},
month = jun,
year = 2022,
note = {{Data usage policy Use of this dataset is subject
to you obtaining lawful access to the [Twitter
API](https://developer.twitter.com/en/docs
/twitter-api), which requires you to agree to the
[Developer Terms Policies and
Agreements](https://developer.twitter.com/en
/developer-terms/).}},
publisher = {Zenodo},
version = {0.0.0},
doi = {10.5281/zenodo.6617192},
url = {https://doi.org/10.5281/zenodo.6617192}
}
@inproceedings{TweetNERDNeurips2022,
author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali},
booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
pages = {},
title = {TweetNERD - End to End Entity Linking Benchmark for Tweets},
volume = {2},
year = {2022},
eprint = {arXiv:2210.08129},
doi = {10.48550/arXiv.2210.08129}
}
创建时间:
2023-02-03



