EleutherAI/filtering-annealing-mix_20250226-011545
收藏Hugging Face2025-03-02 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/EleutherAI/filtering-annealing-mix_20250226-011545
下载链接
链接失效反馈官方服务:
资源简介:
这是一个包含文本数据的训练集,每个样本包括id、文本内容、来源、元数据、token数量、是否为目标过滤、单词过滤、单词过滤元数据、BERT过滤、BERT过滤元数据、组合过滤等字段。训练集大小为389,019,625,333字节,共有88,961,637个样本。数据集支持默认配置,可通过指定的路径访问训练数据。
This is a training dataset containing text data, with each sample including fields such as id, text content, source, metadata, number of tokens, whether it is target filtered, word filtering, word filtering metadata, BERT filtering, BERT filtering metadata, combined filtering, etc. The training set is 389,019,625,333 bytes in size and contains 88,961,637 samples. The dataset supports a default configuration and can access training data through specified paths.
提供机构:
EleutherAI



