Stratified Dataset of 26M Files
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/google/magika
下载链接
链接失效反馈官方服务:
资源简介:
该数据集精心筛选自113种不同内容类型,旨在为深度学习内容类型检测器提供一个基准和训练资源。它包含了来自多种内容类型的文件,特别注重在各类环境中均衡地呈现不同类型。该数据集规模宏大,包含2600万个文件,其任务是对内容类型进行检测。
This dataset is carefully curated from 113 distinct content types, serving as both a benchmark and training resource for deep learning-based content type detectors. It includes files spanning multiple content types, with particular emphasis on the balanced representation of each type across diverse contexts. Boasting a substantial scale with 26 million files, this dataset is tailored for content type detection tasks.
提供机构:
Authors of the paper



