timaeus/dsir-pile-13m-filtered-no-github-or-dm_mathematics
收藏Hugging Face2025-06-26 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/timaeus/dsir-pile-13m-filtered-no-github-or-dm_mathematics
下载链接
链接失效反馈官方服务:
资源简介:
这是一个经过过滤的文本数据集,从原始的DSIR Pile 13M数据集中移除了包含Github或DM_mathematics的元数据的行。过滤后的数据集包含大约12,782,200个例子,以Parquet格式上传,分为64个批次文件,每个批次包含最多200,000个例子。数据集使用了内存高效的生成方式。
This is a filtered text dataset, which excludes rows containing Github or DM_mathematics in the metadata from the original DSIR Pile 13M dataset. The filtered dataset includes approximately 12,782,200 examples, uploaded in Parquet format across 64 batch files, with each batch containing up to 200,000 examples. The dataset is generated using memory-efficient methods.
提供机构:
timaeus



