Urdu Human and AI text Dataset (UHAT)

Name: Urdu Human and AI text Dataset (UHAT)
Creator: Muhammad Ammar
License: 暂无描述

IEEE2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/urdu-human-and-ai-text-dataset-uhat

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset Overview This dataset is designed for Urdu text classification, specifically for distinguishing between human-written and AI-generated text. It's a balanced dataset containing a total of 3600 text samples.Human-Written Texts: 1800 samples written by humans. These are labeled with 0.AI-Generated Texts: 1800 samples generated by AI models. These are labeled with 1. Data Sources The human-written texts were collected from a diverse range of reputable Urdu sources, including:Urdu literature (stories, novels)rekhta.orgBBC UrduVOA UrduUrdu WikipediaNews ArticlesThe AI-generated texts were created by rewriting the human-written texts using the following models:GPTGeminiKimiData Quality & CharacteristicsThe EDA notebook reveals several key aspects of the data's quality and structure:No Missing Values: The dataset is clean, with no null values or empty strings.Duplicates: There are a total of 8 duplicate entries in the combined dataset.Text Length:Human-written texts tend to be longer on average, with a mean length of approximately 889 characters, compared to 748 for AI-generated texts.The maximum length of a human text is 12,754 characters, while for AI it is 6,762.Word Count:Similar to character length, human-written texts have a higher average word count (around 198 words) than AI-generated texts (around 165 words).The longest human text has 3016 words, and the longest AI text has 1464 words.TEXT FEATURE ANALYSIS:Character Distribution: Word-level Analysis:N-Gram Analysis:TEXT COMPLEXITY ANALYSISVocabulary Analysis:

提供机构：

Muhammad Ammar

5,000+

优质数据集

54 个

任务类型

进入经典数据集