common-pile/raw_v0.1_parquet
收藏Hugging Face2025-07-16 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/common-pile/raw_v0.1_parquet
下载链接
链接失效反馈官方服务:
资源简介:
Common Pile v0.1数据集是一个包含了来自Common Pile v0.1 Raw Data集合的所有原始语料库的数据集,这些语料库已经被转换为Apache Parquet格式并整合到一个单一仓库中。该数据集未经过滤或修改,只是进行了格式转换、布局合并,并新增了一个基于文本长度进行快速过滤的`len_category`列。
The Common Pile v0.1 dataset is a collection of all the raw corpora from the Common Pile v0.1 Raw Data collection, converted to Apache Parquet format and consolidated into a single repository. The dataset is unfiltered and unmodified, with changes only including format conversion from JSON to Parquet, layout consolidation from multiple repositories into one, and the addition of a `len_category` column for quick length-based filtering.
提供机构:
common-pile



