five

Annotated The dUCk Tweets Dataset

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/876tc4dkts
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is made up of unique annotated English-Malay code-switching, pure English, and pure Malay tweets using raw_tweets_012019_to_062020.csv on Kaggle (Carlson, 2020). The raw tweets file is the collected users’ tweets about a Malaysian brand called, ‘The dUCk Group’ which is founded by Vivy Yusof focuses on selling scarves, bags, cosmetics, stationaries, and Home & Living products. When preparing this dataset, the duplicated, invalid and unusable data rows are removed. The tweets are then annotated with the language category “ENG” for pure English tweets, “BM” for pure Malay tweets, and “ENG-BM” for the code-switching tweets. Besides, the tweets are annotated with sentiment value 0 for neutral, 1 for positive, and -1 for negative. The sub-folders contain in this dataset are as follows: 1) Full Training Dataset: This sub-folder contains a full set of annotated pure English, pure Malay, and English-Malay code-switching tweets regarding ‘The dUCk Group’ brand, which can be used to train machine learning models. The tweets are kept in both CSV and XML format files namely 'full_training_dataset.csv' and 'full_training_dataset.xml'. 2) Full Testing Dataset: This sub-folder contains a full set of annotated pure English, pure Malay, and English-Malay code-switching tweets regarding ‘The dUCk Group’ brand, which can be used to test the performance of learning models. The tweets are kept in both CSV and XML format files namely 'full_testing_dataset.csv' and 'full_testing_dataset.xml'. 3) Code-Switching Training Dataset: This sub-folder comprises only annotated English-Malay code-switching tweets regarding ‘The dUCk Group’ brand for training the learning models. The tweets are kept in both CSV and XML format files namely 'eng_malay_training_dataset.csv' and 'eng_malay_training_dataset.xml'. 4) Code-Switching Testing Dataset: This sub-folder comprises only annotated English-Malay code-switching tweets regarding ‘The dUCk Group’ brand, which can be used to evaluate the performance of the learning models. The tweets are kept in both CSV and XML format files namely 'eng_malay_testing_dataset.csv' and 'eng_malay_testing_dataset.xml. *Note: 'Language' column represents the language category of the tweet belongs to 'TweetText' column represents the whole tweet 'TweetSentiment' column represents the sentiment value of the tweet (0, 1, and -1)
创建时间:
2021-08-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作