pidgin-human-corpus
收藏Hugging Face2026-03-13 更新2026-04-23 收录
下载链接:
https://huggingface.co/datasets/msmaje/pidgin-human-corpus
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含251,338个多语言文本样本,按训练集(201,070例)、验证集(25,134例)和测试集(25,134例)划分。每个样本包含8个结构化字段:原始文本内容(text)、语言代码(language)、语言名称(lang_name)、数据来源(source)、分类标签(label)、质量评分(quality_score)、词数统计(word_count)以及采集日期(collection_date)。数据集总大小为219MB,下载压缩包约90MB。适用于多语言文本分类、语言识别、文本质量评估等NLP任务,其标注的质量分数和语言元数据为研究多语言文本处理提供了额外维度。
This dataset contains 251,338 multilingual text samples, which are split into training set (201,070 instances), validation set (25,134 instances) and test set (25,134 instances). Each sample includes 8 structured fields: original text content (text), language code (language), language name (lang_name), data source (source), classification label (label), quality score (quality_score), word count (word_count) and collection date (collection_date). The total size of the dataset is 219 MB, and the compressed download package is approximately 90 MB. This dataset is applicable to NLP tasks including multilingual text classification, language identification and text quality assessment. Its annotated quality scores and language metadata provide additional dimensions for research on multilingual text processing.
创建时间:
2026-03-07



