Omarrran/Sentence_wise_urdu_text_dataset
收藏Hugging Face2024-11-22 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/Omarrran/Sentence_wise_urdu_text_dataset
下载链接
链接失效反馈官方服务:
资源简介:
Sentence_wise_urdu_text_dataset是一个乌尔都语文本数据集,主要用于文本分类、文本生成、翻译和零样本分类任务。数据集文件大小为5.29 MB,编码为UTF-8,包含3,136,348个字符,2,472,408个字符(不包括空格),69,743行,666,907个单词,词汇量为29,888。此外,数据集还提供了详细的统计信息,如平均单词长度、中位数单词长度、平均段落长度、单次出现单词数、两次出现单词数等语言分析数据,以及行长度统计。
The Sentence_wise_urdu_text_dataset is an Urdu text dataset primarily used for text classification, text generation, translation, and zero-shot classification tasks. The dataset file size is 5.29 MB, encoded in UTF-8, containing 3,136,348 characters, 2,472,408 characters (excluding spaces), 69,743 lines, 666,907 words, and a vocabulary size of 29,888. Additionally, the dataset provides detailed statistical information, such as average word length, median word length, average paragraph length, hapax legomena (words appearing once), dis legomena (words appearing twice), and line length statistics.
提供机构:
Omarrran



