Roman Urdu Word Variations and Normalized Sentiment Review Dataset (RUWV-NSR)

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://data.mendeley.com/datasets/v5jfhsvtmd

下载链接

链接失效反馈

官方服务：

资源简介：

We have developed two unique Roman Urdu datasets, translated into English. The first dataset focuses on Roman Urdu words and their spelling variations. This dataset is structured in an Excel file with five columns labeled "Var-1" to "Var-5," each representing up to five variations of Roman Urdu spellings for individual words. The final column, "common," contains the most frequently used spelling for each word. In total, this dataset includes 5,244 unique Roman Urdu words, which, when combined with their variations, amount to 19,527 words. The second dataset contains Roman Urdu reviews, each labeled with a sentiment. Given the variability in Roman Urdu spellings found on the web, where users often create their own spelling variations, we have normalized the spelling of words across these reviews. This dataset is the first of its kind, containing the largest collection of Roman Urdu reviews, with a total of 28,090 reviews categorized into five sentiment classes. This dataset is particularly valuable for analyzing Roman Urdu content in contexts such as online product reviews or Roman Urdu articles, which are becoming increasingly common. It offers significant potential for sentiment analysis and language processing applications.

创建时间：

2024-10-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集