five

tweet_qa

收藏
魔搭社区2025-08-07 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/Virgo-Internal/tweet_qa
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for TweetQA ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [TweetQA homepage](https://tweetqa.github.io/) - **Repository:** - **Paper:** [TWEETQA: A Social Media Focused Question Answering Dataset](https://arxiv.org/abs/1907.06292) - **Leaderboard:** [TweetQA Leaderboard](https://tweetqa.github.io/) - **Point of Contact:** [Wenhan Xiong](xwhan@cs.ucsb.edu) ### Dataset Summary With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, the first large-scale dataset for QA over social media data is presented. To make sure the tweets are meaningful and contain interesting information, tweets used by journalists to write news articles are gathered. Then human annotators are asked to write questions and answers upon these tweets. Unlike other QA datasets like SQuAD in which the answers are extractive, the answer are allowed to be abstractive. The task requires model to read a short tweet and a question and outputs a text phrase (does not need to be in the tweet) as the answer. ### Supported Tasks and Leaderboards - `question-answering`: The dataset can be used to train a model for Open-Domain Question Answering where the task is to answer the given questions for a tweet. The performance is measured by comparing the model answers to the the annoted groundtruth and calculating the BLEU-1/Meteor/ROUGE-L score. This task has an active leaderboard which can be found [here](https://tweetqa.github.io/) and ranks models based on [BLEU-1](https://huggingface.co/metrics/blue), [Meteor](https://huggingface.co/metrics/meteor) and [ROUGLE-L](https://huggingface.co/metrics/rouge). ### Languages English. ## Dataset Structure ### Data Instances Sample data: ``` { "Question": "who is the tallest host?", "Answer": ["sam bee","sam bee"], "Tweet": "Don't believe @ConanOBrien's height lies. Sam Bee is the tallest host in late night. #alternativefacts\u2014 Full Frontal (@FullFrontalSamB) January 22, 2017", "qid": "3554ee17d86b678be34c4dc2c04e334f" } ``` The test split doesn't include answers so the Answer field is an empty list. ### Data Fields - `Question`: a question based on information from a tweet - `Answer`: list of possible answers from the tweet - `Tweet`: source tweet - `qid`: question id ### Data Splits The dataset is split in train, validation and test set. The train set cointains 10692 examples, the validation set 1086 and the test set 1979 examples. ## Dataset Creation ### Curation Rationale With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, the first large-scale dataset for QA over social media data is presented. To make sure the tweets are meaningful and contain interesting information, tweets used by journalists to write news articles are gathered. Then human annotators are asked to write questions and answers upon these tweets. Unlike other QA datasets like SQuAD in which the answers are extractive, the answer are allowed to be abstractive. The task requires model to read a short tweet and a question and outputs a text phrase (does not need to be in the tweet) as the answer. ### Source Data #### Initial Data Collection and Normalization The authors look into the the archived snapshots of two major news websites (CNN, NBC), and then extract the tweet blocks that are embedded in the news articles. In order to get enough data, they first extract the URLs of all section pages (e.g. World, Politics, Money, Tech) from the snapshot of each home page and then crawl all articles with tweets from these section pages. Then, they filter out the tweets that heavily rely on attached media to convey information, for which they utilize a state-of-the-art semantic role labeling model trained on CoNLL-2005 (He et al., 2017) to analyze the predicate-argument structure of the tweets collected from news articles and keep only the tweets with more than two labeled arguments. This filtering process also automatically filters out most of the short tweets. For the tweets collected from CNN, 22.8% of them were filtered via semantic role labeling. For tweets from NBC, 24.1% of the tweets were filtered. #### Who are the source language producers? Twitter users. ### Annotations #### Annotation process The Amazon Mechanical Turk workers were used to collect question-answer pairs for the filtered tweets. For each Human Intelligence Task (HIT), the authors ask the worker to read three tweets and write two question-answer pairs for each tweet. To ensure the quality, they require the workers to be located in major English speaking countries (i.e. Canada, US, and UK) and have an acceptance rate larger than 95%. Since the authors use tweets as context, lots of important information are contained in hashtags or even emojis. Instead of only showing the text to the workers, they use javascript to directly embed the whole tweet into each HIT. This gives workers the same experience as reading tweets via web browsers and help them to better compose questions. To avoid trivial questions that can be simply answered by superficial text matching methods or too challenging questions that require background knowledge, the authors explicitly state the following items in the HIT instructions for question writing: - No Yes-no questions should be asked. - The question should have at least five words. - Videos, images or inserted links should not be considered. - No background knowledge should be required to answer the question. To help the workers better follow the instructions, they also include a representative example showing both good and bad questions or answers in the instructions. As for the answers, since the context they consider is relatively shorter than the context of previous datasets, they do not restrict the answers to be in the tweet, otherwise, the task may potentially be simplified as a classification problem. The workers are allowed to write their answers in their own words, but the authors require the answers to be brief and can be directly inferred from the tweets. After they retrieve the QA pairs from all HITs, they conduct further post-filtering to filter out the pairs from workers that obviously do not follow instructions. They remove QA pairs with yes/no answers. Questions with less than five words are also filtered out. This process filtered 13% of the QA pairs. The dataset now includes 10,898 articles, 17,794 tweets, and 13,757 crowdsourced question-answer pairs. All QA pairs were written by 492 individual workers. #### Who are the annotators? Amazon Mechanical Turk workers. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases From the paper: > It is also worth noting that the data collected from social media can not only capture events and developments in real-time but also capture individual opinions and thus requires reasoning related to the authorship of the content as is illustrated in Table 1. > Specifically, a significant amount of questions require certain reasoning skills that are specific to social media data: - Understanding authorship: Since tweets are highly personal, it is critical to understand how questions/tweets related to the authors. - Oral English & Tweet English: Tweets are often oral and informal. QA over tweets requires the understanding of common oral English. Our TWEETQA also requires understanding some tweet-specific English, like conversation-style English. - Understanding of user IDs & hashtags: Tweets often contains user IDs and hashtags, which are single special tokens. Understanding these special tokens is important to answer person- or event-related questions. ### Other Known Limitations [More Information Needed] ## Additional Information The annotated answers are validated by the authors as follows: For the purposes of human performance evaluation and inter-annotator agreement checking, the authors launch a different set of HITs to ask workers to answer questions in the test and development set. The workers are shown with the tweet blocks as well as the questions collected in the previous step. At this step, workers are allowed to label the questions as “NA” if they think the questions are not answerable. They find that 3.1% of the questions are labeled as unanswerable by the workers (for SQuAD, the ratio is 2.6%). Since the answers collected at this step and previous step are written by different workers, the answers can be written in different text forms even they are semantically equal to each other. For example, one answer can be “Hillary Clinton” while the other is “@HillaryClinton”. As it is not straightforward to automatically calculate the overall agreement, they manually check the agreement on a subset of 200 random samples from the development set and ask an independent human moderator to verify the result. It turns out that 90% of the answers pairs are semantically equivalent, 2% of them are partially equivalent (one of them is incomplete) and 8% are totally inconsistent. The answers collected at this step are also used to measure the human performance. 59 individual workers participated in this process. ### Dataset Curators Xiong, Wenhan and Wu, Jiawei and Wang, Hong and Kulkarni, Vivek and Yu, Mo and Guo, Xiaoxiao and Chang, Shiyu and Wang, William Yang. ### Licensing Information CC BY-SA 4.0. ### Citation Information ``` @inproceedings{xiong2019tweetqa, title={TweetQA: A Social Media Focused Question Answering Dataset}, author={Xiong, Wenhan and Wu, Jiawei and Wang, Hong and Kulkarni, Vivek and Yu, Mo and Guo, Xiaoxiao and Chang, Shiyu and Wang, William Yang}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, year={2019} } ``` ### Contributions Thanks to [@anaerobeth](https://github.com/anaerobeth) for adding this dataset.

# TweetQA 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据集构建逻辑](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集整理者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **主页**:[TweetQA 主页](https://tweetqa.github.io/) - **代码仓库**: - **论文**:[《TWEETQA:聚焦社交媒体的问答数据集》](https://arxiv.org/abs/1907.06292) - **排行榜**:[TweetQA 排行榜](https://tweetqa.github.io/) - **联系方式**:[Wenhan Xiong](xwhan@cs.ucsb.edu) ### 数据集概述 随着社交媒体愈发普及,其上会报道大量新闻与实时事件,开发自动化问答(Question Answering, QA)系统对诸多依赖实时知识的应用而言至关重要。过往问答数据集多聚焦于新闻、维基百科这类正式文本,而本数据集是首个面向社交媒体数据的大规模问答数据集。为确保推文具备信息价值与趣味性,我们收集了记者用于撰写新闻报道的推文。随后邀请人类标注者基于这些推文编写问答对。与SQuAD等抽取式(extractive)问答数据集不同,本数据集允许答案为生成式(abstractive)。该任务要求模型阅读一段简短推文与对应问题,并输出一段文本短语(无需存在于原推文中)作为答案。 ### 支持任务与排行榜 - `问答(question-answering)`:该数据集可用于训练开放域问答(Open-Domain Question Answering)模型,任务目标为基于给定推文回答对应问题。模型性能通过将模型输出与标注的标准答案对比,计算BLEU-1、Meteor与ROUGE-L得分进行评估。该任务设有活跃排行榜,可访问[此处](https://tweetqa.github.io/),排行榜基于[BLEU-1](https://huggingface.co/metrics/blue)、[Meteor](https://huggingface.co/metrics/meteor)和[ROUGE-L](https://huggingface.co/metrics/rouge)对模型进行排名。 ### 语言 英语。 ## 数据集结构 ### 数据实例 示例数据: { "Question": "who is the tallest host?", "Answer": ["sam bee","sam bee"], "Tweet": "Don't believe @ConanOBrien's height lies. Sam Bee is the tallest host in late night. #alternativefacts— Full Frontal (@FullFrontalSamB) January 22, 2017", "qid": "3554ee17d86b678be34c4dc2c04e334f" } 测试集不包含答案,因此`Answer`字段为空列表。 ### 数据字段 - `Question`:基于推文信息生成的问题 - `Answer`:来自推文的若干候选答案列表 - `Tweet`:源推文 - `qid`:问题ID ### 数据划分 数据集划分为训练集、验证集与测试集。其中训练集包含10692条样本,验证集包含1086条样本,测试集包含1979条样本。 ## 数据集构建 ### 数据集构建逻辑 随着社交媒体愈发普及,其上会报道大量新闻与实时事件,开发自动化问答系统对诸多依赖实时知识的应用而言至关重要。过往问答数据集多聚焦于新闻、维基百科这类正式文本,而本数据集是首个面向社交媒体数据的大规模问答数据集。为确保推文具备信息价值与趣味性,我们收集了记者用于撰写新闻报道的推文。随后邀请人类标注者基于这些推文编写问答对。与SQuAD等抽取式问答数据集不同,本数据集允许答案为生成式。该任务要求模型阅读一段简短推文与对应问题,并输出一段文本短语(无需存在于原推文中)作为答案。 ### 源数据 #### 初始数据收集与标准化 作者检索了两家主流新闻网站(CNN、NBC)的存档快照,从中提取嵌入新闻文章中的推文块。为获取充足数据,作者首先从各主页快照中提取所有频道页面(如国际、政治、财经、科技)的URL,随后从这些频道页面爬取所有包含推文的文章。接下来,作者过滤掉严重依赖附加媒体传递信息的推文:他们利用在CoNLL-2005数据集上训练的先进语义角色标注模型(He et al., 2017),分析从新闻文章中收集的推文的谓词-论元结构,仅保留拥有至少两个标注论元的推文。该过滤流程同时自动过滤掉大部分短推文。针对从CNN收集的推文,22.8%的样本通过语义角色标注被过滤;针对NBC的推文,该比例为24.1%。 #### 源语言生产者是谁? 推特(Twitter)用户。 ### 标注 #### 标注流程 作者使用亚马逊机械 Turk(Amazon Mechanical Turk)工人为过滤后的推文收集问答对。对于每个人工智能任务(Human Intelligence Task, HIT),作者要求标注者阅读三篇推文,并为每篇推文编写两组问答对。为确保标注质量,要求标注者位于主要英语国家(即加拿大、美国与英国),且任务接受率高于95%。由于推文的上下文较短,许多关键信息包含在话题标签(hashtag)甚至表情符号中,因此作者并未仅向标注者展示纯文本,而是通过JavaScript直接将完整推文嵌入每个HIT界面,使标注者获得与通过网页浏览器阅读推文一致的体验,助力其更好地编写问题。为避免仅需浅层文本匹配即可回答的过于简单的问题,或是需要背景知识的过难问题,作者在HIT的问题编写指南中明确规定了以下要求: - 不得提出是非类问题 - 问题至少包含五个单词 - 不得考虑视频、图片或插入的链接 - 回答问题无需借助外部背景知识 为帮助标注者更好地遵循指南,作者还在指南中提供了一个示例,展示了合格与不合格的问答对。关于答案,由于本数据集的上下文相较于过往数据集更短,作者并未将答案限制为必须存在于推文中——否则任务可能被简化为分类问题。标注者可使用自己的语言编写答案,但要求答案简洁且可从推文中直接推断得出。在从所有HIT中取回问答对后,作者还进行了后续过滤,剔除了明显未遵循指南的标注者提交的问答对,移除了包含是非类答案的问答对,同时过滤掉单词数不足五个的问题。该流程共过滤掉13%的问答对。最终数据集包含10898篇文章、17794条推文与13757条众包问答对,所有问答对均由492名独立标注者编写。 #### 标注者是谁? 亚马逊机械 Turk 工人。 ### 个人与敏感信息 [需补充更多信息] ## 数据使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 引自论文: > 值得注意的是,从社交媒体收集的数据不仅能实时捕捉事件与动态,还能反映个体观点,因此需要针对内容作者身份进行推理,如表1所示。 > > 具体而言,大量问题需要特定于社交媒体数据的推理能力: > - 理解作者身份:由于推文极具个人化特征,理解问题/推文与作者的关联至关重要。 > - 口语化英语与推文英语:推文通常为口语化、非正式文本,基于推文的问答需要理解常见口语表达,本TWEETQA数据集还要求理解部分推文专属英语,如对话式英语。 > - 理解用户ID与话题标签:推文中常包含用户ID与话题标签,这类特殊的单个Token。理解这些特殊token对回答与人物或事件相关的问题至关重要。 ### 其他已知局限 [需补充更多信息] ## 附加信息 标注答案由作者进行如下验证: 为评估人类表现与计算标注者间一致性,作者发起了另一组HIT,要求标注者回答测试集与开发集的问题。向标注者展示推文块与此前收集的问题,标注者若认为问题无法回答,可将其标记为“NA”。作者发现,3.1%的问题被标注者标记为无法回答(SQuAD的该比例为2.6%)。由于本阶段与前一阶段收集的答案由不同标注者编写,即便语义等价,答案的文本形式也可能存在差异。例如,一个答案可为“Hillary Clinton”,而另一个可为“@HillaryClinton”。由于难以自动计算整体一致性,作者从开发集中随机抽取200条样本进行人工一致性检查,并邀请一名独立的人类评审员验证结果。结果显示,90%的答案对在语义上等价,2%为部分等价(其中一个答案不完整),8%完全不一致。本阶段收集的答案也被用于评估人类表现,共有59名独立标注者参与该流程。 ### 数据集整理者 Xiong, Wenhan、Wu, Jiawei、Wang, Hong、Kulkarni, Vivek、Yu, Mo、Guo, Xiaoxiao、Chang, Shiyu、Wang, William Yang。 ### 许可信息 CC BY-SA 4.0。 ### 引用信息 @inproceedings{xiong2019tweetqa, title={TweetQA: A Social Media Focused Question Answering Dataset}, author={Xiong, Wenhan and Wu, Jiawei and Wang, Hong and Kulkarni, Vivek and Yu, Mo and Guo, Xiaoxiao and Chang, Shiyu and Wang, William Yang}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, year={2019} } ### 贡献 感谢 [@anaerobeth](https://github.com/anaerobeth) 添加本数据集。
提供机构:
maas
创建时间:
2025-08-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作