common-pile/raw_v.01_parquet
收藏Hugging Face2025-07-16 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/common-pile/raw_v.01_parquet
下载链接
链接失效反馈官方服务:
资源简介:
Common Pile v0.1是一个包含公共领域和开放许可文本的8TB数据集。这个数据集整合了所有原始的Common Pile v0.1 Raw Data集合,转换成了Apache Parquet格式,并合并到一个单一的仓库中。数据集保留了原始的三个列(文档ID、文本、来源),并添加了一个额外的列,用于基于长度的快速过滤。该数据集遵循各个语料库的许可协议。
The Common Pile v0.1 dataset is an 8TB collection of public domain and openly licensed text. This dataset consolidates all the raw corpora from the Common Pile v0.1 Raw Data collection into a single repository, converted into Apache Parquet format. It retains the original three columns (document ID, text, source) and adds an extra column for quick length-based filtering. The dataset follows the licensing agreements of the individual corpora.
提供机构:
common-pile



