community-datasets/roman_urdu
收藏Hugging Face2024-06-24 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/community-datasets/roman_urdu
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个用于文本分类任务的罗马乌尔都语数据集,包含乌尔都语句子及其对应的情感标签(正面、负面、中性)。数据集的大小在10K到100K之间,且为单语种(乌尔都语)。数据集的创建过程涉及众包注释,但具体的注释过程、数据来源、数据收集和标准化等信息未提供。数据集的引用信息提供了相关论文和UCI机器学习仓库的链接。
This dataset is a Roman Urdu dataset for text classification tasks, containing Urdu sentences and their corresponding sentiment labels (Positive, Negative, Neutral). The dataset size is between 10K and 100K, and it is monolingual (Urdu). The dataset creation process involves crowdsourced annotations, but specific details about the annotation process, data sources, data collection, and normalization are not provided. The citation information includes links to relevant papers and the UCI Machine Learning Repository.
提供机构:
community-datasets
原始信息汇总
Roman Urdu Dataset 数据集概述
数据集描述
数据集摘要
- 语言: 乌尔都语 (Urdu)
- 许可: 未知
- 多语言性: 单语种
- 大小类别: 10K<n<100K
- 源数据集: 原始数据
- 任务类别: 文本分类
- 任务ID: 情感分类
- 数据集ID: roman-urdu-data-set
- 数据集名称: Roman Urdu Dataset
数据集结构
数据实例
Wah je wah,Positive,
数据字段
- sentence: 一段乌尔都语文本,数据类型为字符串。
- sentiment: 情感标签,数据类型为类别标签,包括
Positive,Negative, 和Neutral。
数据分割
- train: 训练集,包含 20229 个样本,占用 1633411 字节。
数据集创建
数据集信息
- 特征:
sentence: 字符串类型sentiment: 类别标签类型,包含Positive,Negative, 和Neutral
- 分割:
train: 包含 20229 个样本,占用 1633411 字节
- 下载大小: 1060033 字节
- 数据集大小: 1633411 字节
配置
- 默认配置:
- 数据文件:
train: 路径为data/train-*
- 数据文件:
引用信息
@InProceedings{Sharf:2018, title = "Performing Natural Language Processing on Roman Urdu Datasets", authors = "Zareen Sharf and Saif Ur Rahman", booktitle = "International Journal of Computer Science and Network Security", volume = "18", number = "1", pages = "141-148", year = "2018" }
@misc{Dua:2019, author = "Dua, Dheeru and Graff, Casey", year = "2017", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }



