Complex Twitter Dataset

arXiv2022-12-18 更新2024-06-21 收录

下载链接：

https://1drv.ms/u/s!AroHb-W6OAlavK4begsDsMALfE?e=c8f2XX

下载链接

链接失效反馈

官方服务：

资源简介：

本研究介绍了一个名为‘Complex Twitter Dataset’的新数据集，由Yu Wang和Hongxia Jin创建。该数据集包含885条长篇推文，平均每条推文33.2个tokens，远超传统SLU数据集的长度。数据主要来源于美国23个主要城市的警察和消防部门官方推特账户，涵盖了火灾、犯罪、交通意外和自然灾害等多种事件类型。数据集的创建旨在解决现有模型在处理长句和复杂语义结构时的不足，特别是在识别和处理分布外模式和词汇外tokens方面的挑战。该数据集的应用领域包括提升个人AI助手和聊天机器人在理解和响应复杂自然语言查询方面的能力。

This study introduces a novel dataset named "Complex Twitter Dataset", created by Yu Wang and Hongxia Jin. This dataset consists of 885 long-form tweets, with an average of 33.2 tokens per tweet, which is significantly longer than that of traditional SLU datasets. The data is mainly sourced from official Twitter accounts of police and fire departments in 23 major cities across the United States, covering a variety of event types such as fires, crimes, traffic accidents, and natural disasters. The dataset is developed to address the shortcomings of existing models when handling long sentences and complex semantic structures, particularly the challenges in identifying and processing out-of-distribution patterns and out-of-vocabulary tokens. The application scenarios of this dataset include enhancing the ability of personal AI assistants and chatbots to understand and respond to complex natural language queries.

提供机构：

未提及

创建时间：

2022-12-18

搜集汇总

数据集介绍

背景与挑战

背景概述

Complex Twitter Dataset是一个包含885条长篇推文的数据集，平均每条推文33.2个tokens，数据来源于美国23个主要城市的警察和消防部门官方推特账户，涵盖多种事件类型。该数据集旨在解决现有模型在处理长句和复杂语义结构时的不足，特别是在分布外模式和词汇外tokens方面的挑战，应用领域包括提升个人AI助手和聊天机器人的自然语言理解和响应能力。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集