strombergnlp/twitter_pos
收藏数据集概述
数据集名称
- 名称: Twitter Part-of-speech
- 别名: twitter-pos
数据集属性
- 语言: 英语 (
bcp47:en) - 许可证: CC-BY-4.0
- 多语言性: 单语种
- 大小: 10K<n<100K
- 来源: 原始数据
- 任务类别: 词性标注
- 任务ID: part-of-speech
- 论文代码ID: ritter-pos
数据集内容
- 描述: 该数据集包含两个子数据集,用于英语推文的词性标注。
- Ritter: 包含训练集、开发集和测试集。
- Foster: 包含开发集和测试集。
- 数据实例: 每个实例包含ID、tokens(词列表)和pos_tags(词性标签列表)。
数据集结构
- 数据字段:
id: 字符串类型。tokens: 字符串列表类型。pos_tags: 整数列表类型,表示词性标签。
- 数据分割:
- Ritter:
- 训练集: 10652 tokens, 551 sentences
- 开发集: 2242 tokens, 118 sentences
- 测试集: 2291 tokens, 118 sentences
- Foster:
- 开发集: 2998 tokens, 270 sentences
- 测试集: 2841 tokens, 250 sentences
- Ritter:
支持的任务和排行榜
- 任务: 词性标注
- 排行榜: Ritter上的词性标注
引用信息
@inproceedings{ritter2011named, title={Named entity recognition in tweets: an experimental study}, author={Ritter, Alan and Clark, Sam and Etzioni, Oren and others}, booktitle={Proceedings of the 2011 conference on empirical methods in natural language processing}, pages={1524--1534}, year={2011} }
@inproceedings{foster2011hardtoparse, title={# hardtoparse: POS Tagging and Parsing the Twitterverse}, author={Foster, Jennifer and Cetinoglu, Ozlem and Wagner, Joachim and Le Roux, Joseph and Hogan, Stephen and Nivre, Joakim and Hogan, Deirdre and Van Genabith, Josef}, booktitle={Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence}, year={2011} }
@inproceedings{derczynski2013twitter, title={Twitter part-of-speech tagging for all: Overcoming sparse and noisy data}, author={Derczynski, Leon and Ritter, Alan and Clark, Sam and Bontcheva, Kalina}, booktitle={Proceedings of the international conference recent advances in natural language processing ranlp 2013}, pages={198--206}, year={2013} }



