five

Roman Urdu Word Variations and Normalized Sentiment Review Dataset (RUWV-NSR)

收藏
DataCite Commons2025-05-01 更新2025-04-16 收录
下载链接:
https://data.mendeley.com/datasets/v5jfhsvtmd
下载链接
链接失效反馈
官方服务:
资源简介:
We have developed two unique Roman Urdu datasets, translated into English. The first dataset focuses on Roman Urdu words and their spelling variations. This dataset is structured in an Excel file with five columns labeled "Var-1" to "Var-5," each representing up to five variations of Roman Urdu spellings for individual words. The final column, "common," contains the most frequently used spelling for each word. In total, this dataset includes 5,244 unique Roman Urdu words, which, when combined with their variations, amount to 19,527 words. The second dataset contains Roman Urdu reviews, each labeled with a sentiment. Given the variability in Roman Urdu spellings found on the web, where users often create their own spelling variations, we have normalized the spelling of words across these reviews. This dataset is the first of its kind, containing the largest collection of Roman Urdu reviews, with a total of 28,090 reviews categorized into five sentiment classes. This dataset is particularly valuable for analyzing Roman Urdu content in contexts such as online product reviews or Roman Urdu articles, which are becoming increasingly common. It offers significant potential for sentiment analysis and language processing applications.

本团队构建了两份独特的罗马乌尔都语(Roman Urdu)数据集,相关内容均已译为英文。第一份数据集聚焦于罗马乌尔都语单词及其拼写变体。该数据集以Excel文件形式存储,包含五列,分别标注为Var-1至Var-5,每列对应单个罗马乌尔都语单词最多五种拼写变体。最后一列名为"common",存储各单词最常用的拼写形式。该数据集总计包含5244个独特的罗马乌尔都语单词,连同其拼写变体在内,总词量达19527个。 第二份数据集收录罗马乌尔都语评论,每条评论均标注有情感类别。考虑到网络环境中罗马乌尔都语拼写乱象频发——用户常自行创造拼写变体——本团队已对该数据集内所有评论的单词拼写完成标准化处理。该数据集为同类首创,是目前规模最大的罗马乌尔都语评论数据集,总计收录28090条评论,分为五个情感类别。 鉴于在线产品评论、罗马乌尔都语文章等罗马乌尔都语内容的普及度与日俱增,该数据集在相关内容分析领域具备极高的应用价值。其在情感分析与语言处理相关应用中拥有可观的发展潜力。
提供机构:
Mendeley Data
创建时间:
2024-03-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作