five

pidgin-human-corpus

收藏
Hugging Face2026-03-13 更新2026-04-23 收录
下载链接:
https://huggingface.co/datasets/msmaje/pidgin-human-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含251,338个多语言文本样本,按训练集(201,070例)、验证集(25,134例)和测试集(25,134例)划分。每个样本包含8个结构化字段:原始文本内容(text)、语言代码(language)、语言名称(lang_name)、数据来源(source)、分类标签(label)、质量评分(quality_score)、词数统计(word_count)以及采集日期(collection_date)。数据集总大小为219MB,下载压缩包约90MB。适用于多语言文本分类、语言识别、文本质量评估等NLP任务,其标注的质量分数和语言元数据为研究多语言文本处理提供了额外维度。

This dataset contains 251,338 multilingual text samples, which are split into training set (201,070 instances), validation set (25,134 instances) and test set (25,134 instances). Each sample includes 8 structured fields: original text content (text), language code (language), language name (lang_name), data source (source), classification label (label), quality score (quality_score), word count (word_count) and collection date (collection_date). The total size of the dataset is 219 MB, and the compressed download package is approximately 90 MB. This dataset is applicable to NLP tasks including multilingual text classification, language identification and text quality assessment. Its annotated quality scores and language metadata provide additional dimensions for research on multilingual text processing.
创建时间:
2026-03-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作