five

PleIAs/CommonLingua-Train

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/PleIAs/CommonLingua-Train
下载链接
链接失效反馈
官方服务:
资源简介:
CommonLingua-Train是用于训练PleIAs/CommonLingua模型的数据集,该模型是一个支持334种语言的字节级语言识别模型。数据集包含2.48百万个段落,数据来源包括Wikipedia和其他开放许可及公共领域的语料库,这些语料库来自Common Corpus。数据集经过迭代开发,包括对Wikipedia源的过滤、扩展到非百科全书来源和格式、添加低资源语言资源(特别是非洲语言)以及针对常见语言混淆的定向采样。数据集的结构包括文本、语言标签、来源、标识符、标题、子集合、许可证、开放类型、创建者和日期等列。核心数据来自Wikipedia,还包括OpenAlex和其他多语言子集。数据集的使用需遵守开放许可和公共领域的规定,并保持适当的归属信息。

CommonLingua-Train is the training dataset for PleIAs/CommonLingua — a byte-level language identification model for 334 languages. It is composed of 2.48 M paragraphs, sourced exclusively from Wikipedia and other open-licensed and public-domain corpora extracted from Common Corpus. The dataset was developed iteratively, including filtering of Wikipedia sources, extension to non-encyclopedic sources and formats, additions of low resource language resources (especially African languages), and targeted sampling of frequent language confusions. The schema includes columns for text, language label, source, identifier, title, collection, license, open type, creator, and date. The core dataset is from Wikipedia, with additional inclusions from OpenAlex and other multilingual subsets. The dataset aggregates open-licensed and public-domain corpora, requiring proper attribution for redistribution.
提供机构:
PleIAs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作