five

geodesic-research/sfm-pretraining-mix-ai-filtering-results

收藏
Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/geodesic-research/sfm-pretraining-mix-ai-filtering-results
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集用于对齐过滤任务,包含训练数据。数据特征包括id、word_filter(布尔类型)、word_filter_metadata(包含keywords和reason的结构体)、bert_filter(布尔类型)、bert_filter_metadata(包含highest_score、lowest_score和mean_score的结构体)以及combined_filter(布尔类型)。数据集包含405,836,046个训练样本,总大小为37,089,752,801字节。

This dataset is used for alignment filtering tasks and contains training data. The features include id, word_filter (boolean), word_filter_metadata (a struct containing keywords and reason), bert_filter (boolean), bert_filter_metadata (a struct containing highest_score, lowest_score, and mean_score), and combined_filter (boolean). The dataset contains 405,836,046 training examples with a total size of 37,089,752,801 bytes.
提供机构:
geodesic-research
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作