five

The UMass Global English on Twitter Dataset

收藏
SSH Open MarketPlace2021-07-22 更新2024-08-03 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/GogVsK
下载链接
链接失效反馈
官方服务:
资源简介:
It can be difficult to identify the language that a tweet is written in. In addition to being very short, they often include code-switching, where the user uses two or more languages together, or names borrowed from a different language. This dataset contains tweets from a variety of languages, tagged for whether they are in English or not, whether they contain code-switching, whether they includes names from a different language and whether they were generated automatically. This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity or having been automatically generated. It includes messages sent from 130 different countries. The file all_annotated.tsv contains the dataset of 10,502 tweets used in the paper. Text is encoded as UTF-8. The column headings (also given in the .tsv file) are: tweet ID, ISO country code, tweet date, tweet text, definitely English, ambiguous, definitely not English, code-switched, ambiguous due to named entities, and automatically generated tweets.
创建时间:
2021-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作