five

Urdu Human and AI text Dataset (UHAT)

收藏
IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/urdu-human-and-ai-text-dataset-uhat
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset Overview This dataset is designed for Urdu text classification, specifically for distinguishing between human-written and AI-generated text. It's a balanced dataset containing a total of 3600 text samples.Human-Written Texts: 1800 samples written by humans. These are labeled with 0.AI-Generated Texts: 1800 samples generated by AI models. These are labeled with 1. Data Sources The human-written texts were collected from a diverse range of reputable Urdu sources, including:Urdu literature (stories, novels)rekhta.orgBBC UrduVOA UrduUrdu WikipediaNews ArticlesThe AI-generated texts were created by rewriting the human-written texts using the following models:GPTGeminiKimiData Quality & CharacteristicsThe EDA notebook reveals several key aspects of the data's quality and structure:No Missing Values: The dataset is clean, with no null values or empty strings.Duplicates: There are a total of 8 duplicate entries in the combined dataset.Text Length:Human-written texts tend to be longer on average, with a mean length of approximately 889 characters, compared to 748 for AI-generated texts.The maximum length of a human text is 12,754 characters, while for AI it is 6,762.Word Count:Similar to character length, human-written texts have a higher average word count (around 198 words) than AI-generated texts (around 165 words).The longest human text has 3016 words, and the longest AI text has 1464 words.TEXT FEATURE ANALYSIS:Character Distribution: Word-level Analysis:N-Gram Analysis:TEXT COMPLEXITY ANALYSISVocabulary Analysis:
提供机构:
Muhammad Ammar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作