Emoji Gestures in English Tweets: California
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5800197
下载链接
链接失效反馈官方服务:
资源简介:
The dataset consists of 479 193 tweets each of them contains one of the 31 gesture emoji (different hand configurations) and its skin tone modifier options (e.g. 🙏🙏🏿🙏🏾🙏🏽🙏🏼🙏🏻), posted within 250km from San Jose, CA and within 200km from Los Angeles, CA, in English, during May-August 2021. The dataset can be used to investigate the use of gesture emoji by English-speaking California Twitter users. Python libraries used for collecting tweets and preprocessing: tweepy, re, preprocessor, emoji, regex, string, nltk.
The dataset contains 11 columns:
preprocessed
preprocessed text of the tweet (4 steps)
all_emoji
lists all emoji in a given tweet
hashtags
lists all hashtags in a given tweet
user_encoded
encoded Twitter user name: the first 3 characters of the user name and the first 3 characters of the user's location
location_encoded
location of the user: "los_angeles", "san_diego", "san_jose", "san_francisco", "fresno", "long_beach", "sacramento", "oakland", "bakersfield", "anaheim", or "other"
mention_present
checks whether each tweet contains mentions
url_present
checks whether each tweet contains url
preprocess_tweet
preprocessing step 1: tokenizing mentions, urls, and hashtags
lowercase_tweet
preprocessing step 2: lowercasing
remove_punct_tweet
preprocessing step 3: removing punctuation
tokenize_tweet
preprocessing step 4: tokenizing
The further information on the research project can be found here: https://github.com/mzhukovaucsb/emoji_gestures/
本数据集包含479193条推文,每条推文中均包含31种手势表情(gesture emoji,不同手部形态)及其肤色调整选项(例如🙏、🙏🏿、🙏🏾、🙏🏽、🙏🏼、🙏🏻)。这些推文均发布于2021年5月至8月期间,语言为英语,发布位置距美国加利福尼亚州圣何塞250公里范围内,且距洛杉矶200公里范围内。本数据集可用于研究英语使用者在加州Twitter平台上的手势表情使用情况。推文采集与预处理所用的Python库包括:tweepy、re、preprocessor、emoji、regex、string、nltk。
本数据集共包含11个字段:
1. preprocessed:预处理后的推文文本(共经过4步预处理流程)
2. all_emoji:列出给定推文中的所有表情符号
3. hashtags:列出给定推文中的所有话题标签
4. user_encoded:经编码的Twitter用户名:由用户名前3个字符与用户位置前3个字符组合编码而成
5. location_encoded:用户所在地区,可选值包括"los_angeles"(洛杉矶)、"san_diego"(圣迭戈)、"san_jose"(圣何塞)、"san_francisco"(旧金山)、"fresno"(弗雷斯诺)、"long_beach"(长滩)、"sacramento"(萨克拉门托)、"oakland"(奥克兰)、"bakersfield"(贝克斯菲尔德)、"anaheim"(阿纳海姆)或"other"(其他)
6. mention_present:用于标记每条推文是否包含@提及内容
7. url_present:用于标记每条推文是否包含外部链接
8. preprocess_tweet:预处理步骤1:对@提及、链接及话题标签进行分词
9. lowercase_tweet:预处理步骤2:转换为小写形式
10. remove_punct_tweet:预处理步骤3:移除标点符号
11. tokenize_tweet:预处理步骤4:执行分词操作
本研究项目的更多详细信息可访问:https://github.com/mzhukovaucsb/emoji_gestures/
创建时间:
2022-05-18



