five

kaiimran/malaysian-tweets-from-2022-04-17-until-2022-09-03

收藏
Hugging Face2024-06-08 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/kaiimran/malaysian-tweets-from-2022-04-17-until-2022-09-03
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ms pretty_name: malay-tweets --- --- ## Malaysian Tweets Dataset (from 17 April 2022 until 3 September 2022) ### Original Source The dataset was originally sourced from [Mesolitica](https://huggingface.co/datasets/mesolitica/snapshot-twitter-2022-09-03). The author split it into 12 files in JSON Lines format, each containing JSON objects separated by newlines. Each file is approximately 3.9 GB. ### Reasons for modifying the original dataset - To make it more accessible for most users (students, university lecturers, beginner hobbyists, junior data analysts/BI analysts) who are more familiar with CSV files. Many users may not know what JSON Lines format is or how to work with it. - To reduce analysis time by stripping away less useful attributes and reordering the columns for easier readability. - To enable processing on normal laptops with limited RAM and CPU by reducing the file size from around 3.9 GB each to 106 MB each. This was achieved by removing less useful attributes and because CSV only writes the column names once, whereas JSON repeats the key for each row. ### Description This dataset consists of Twitter data collected between April 17, 2022, and September 3, 2022. Due to changes in the Twitter API (following Elon Musk's acquisition of Twitter and its rebranding to X), further snapshots are no longer feasible. - **Timestamp Range:** 2022-04-17T16:30:07.000Z to 2022-09-03T09:23:52.000Z - **Total Rows:** 7,075,025 - **Files:** Split across 12 CSV files ### Data Columns Each CSV file contains the following columns: - **tweet_created_at:** Time when the tweet was created - **tweet_timestamp_ms:** Timestamp in milliseconds - **tweet_id_str:** Unique identifier for the tweet - **tweet_lang:** Language of the tweet (based on Twitter's tagging, which is often inaccurate. Malay is often tagged as Indonesian (in)) - **tweet_text:** Text content of the tweet - **tweet_possibly_sensitive:** Indicates if the tweet might contain sensitive content (as determined by Twitter) - **user_id_str:** Unique identifier for the user - **user_screen_name:** Account username of the tweet poster - **user_followers_count:** Number of followers the user has #### Example Row ``` Mon Aug 01 13:45:07 +0000 2022, 1659361507465, 1.5541009186644582e+18, in, Kedah main macam butoh sekarang, FALSE, 953474382.0, Muddloyy, 1207.0 ``` ### Language and Region The tweets are predominantly in Malay and posted by users in Malaysia. The data was filtered based on the following geographic coordinates: ``` locations=[ 99.8568959909, 0.8232449017, 119.5213933664, 7.2037547089, ] ``` ### Some rows are messed up so you should remove the rows with empty values first: ``` import pandas as pd import numpy as np # Define the data types for specific columns dtype_spec = { 'tweet_created_at': str, 'tweet_timestamp_ms': str, 'tweet_id_str': str, 'tweet_lang': str, 'tweet_text': str, 'tweet_possibly_sensitive': str, 'user_id_str': str, 'user_screen_name': str, 'user_followers_count': str } # Load the CSV file with specified data types df = pd.read_csv('extracted_data0.csv', dtype=dtype_spec) print(df.dtypes) # Replace empty strings with NaN df.replace('', np.nan, inplace=True) # Drop rows where any column is empty (NaN) df = df.dropna() new_dtypes = { 'tweet_created_at': str, 'tweet_timestamp_ms': int, 'tweet_id_str': str, 'tweet_lang': str, 'tweet_text': str, 'tweet_possibly_sensitive': bool, 'user_id_str': str, 'user_screen_name': str, 'user_followers_count': int } df = df.astype(new_dtypes) # Display the data types to verify print(df.dtypes) ```
提供机构:
kaiimran
原始信息汇总

Malaysian Tweets Dataset (from 17 April 2022 until 3 September 2022)

数据集概述

  • 时间范围: 2022-04-17T16:30:07.000Z 至 2022-09-03T09:23:52.000Z
  • 总行数: 7,075,025
  • 文件格式: 12个CSV文件

数据列

每个CSV文件包含以下列:

  • tweet_created_at: 推文创建时间
  • tweet_timestamp_ms: 时间戳(毫秒)
  • tweet_id_str: 推文的唯一标识符
  • tweet_lang: 推文语言(基于Twitter的标记,可能不准确,马来语常被标记为印尼语)
  • tweet_text: 推文内容
  • tweet_possibly_sensitive: 是否可能包含敏感内容(由Twitter判定)
  • user_id_str: 用户的唯一标识符
  • user_screen_name: 推文发布者的用户名
  • user_followers_count: 用户的粉丝数

示例行

Mon Aug 01 13:45:07 +0000 2022, 1659361507465, 1.5541009186644582e+18, in, Kedah main macam butoh sekarang, FALSE, 953474382.0, Muddloyy, 1207.0

语言和地区

  • 主要语言: 马来语

  • 主要地区: 马来西亚

  • 地理坐标过滤:

    locations=[ 99.8568959909, 0.8232449017, 119.5213933664, 7.2037547089, ]

数据预处理

  • 空值处理: 建议删除包含空值的行 python import pandas as pd import numpy as np

    定义数据类型

    dtype_spec = { tweet_created_at: str, tweet_timestamp_ms: str, tweet_id_str: str, tweet_lang: str, tweet_text: str, tweet_possibly_sensitive: str, user_id_str: str, user_screen_name: str, user_followers_count: str }

    加载CSV文件

    df = pd.read_csv(extracted_data0.csv, dtype=dtype_spec)

    替换空字符串为NaN

    df.replace(, np.nan, inplace=True)

    删除包含空值的行

    df = df.dropna()

    重新定义数据类型

    new_dtypes = { tweet_created_at: str, tweet_timestamp_ms: int, tweet_id_str: str, tweet_lang: str, tweet_text: str, tweet_possibly_sensitive: bool, user_id_str: str, user_screen_name: str, user_followers_count: int }

    df = df.astype(new_dtypes)

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作