geodesic-research/sfm-pretraining-mix-ai-filtering-results
收藏Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/geodesic-research/sfm-pretraining-mix-ai-filtering-results
下载链接
链接失效反馈官方服务:
资源简介:
该数据集用于对齐过滤任务,包含训练数据。数据特征包括id、word_filter(布尔类型)、word_filter_metadata(包含keywords和reason的结构体)、bert_filter(布尔类型)、bert_filter_metadata(包含highest_score、lowest_score和mean_score的结构体)以及combined_filter(布尔类型)。数据集包含405,836,046个训练样本,总大小为37,089,752,801字节。
This dataset is used for alignment filtering tasks and contains training data. The features include id, word_filter (boolean), word_filter_metadata (a struct containing keywords and reason), bert_filter (boolean), bert_filter_metadata (a struct containing highest_score, lowest_score, and mean_score), and combined_filter (boolean). The dataset contains 405,836,046 training examples with a total size of 37,089,752,801 bytes.
提供机构:
geodesic-research



