Chrisyichuan/screenshot-training-natural-unfiltered-v2
收藏Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Chrisyichuan/screenshot-training-natural-unfiltered-v2
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个未经过滤的图像到文本数据集,包含50,000个查询/正面对应的块对,每个对包含查询、答案、来源句子、来源类型、主题、块路径、URL、标题、块索引和瓦片目录等信息。数据集主要用于训练和评估模型,与过滤版本相比,没有应用页面级和生成后的自然问题正则过滤器。数据集中的图像以50个tar分片的形式存储,可以通过提供的脚本提取。数据集的来源类型以prose为主,主题以sports、entertainment和history为主。数据集是通过gemini-3.1-flash-lite-preview生成器在Vertex AI上生成的,源池来自完整的Wikipedia kiwix瓦片索引。
This dataset is an unfiltered image-to-text dataset containing 50,000 query/positive-chunk pairs, each with query, answer, source sentence, source type, subject, chunk path, URL, title, chunk index, and tiles directory information. The dataset is primarily used for training and evaluating models, and unlike the filtered version, it does not apply page-level and post-generation natural question regex filters. The images in the dataset are stored as 50 tar shards and can be extracted using the provided script. The source types in the dataset are mainly prose, and the subjects are mainly sports, entertainment, and history. The dataset was generated using the gemini-3.1-flash-lite-preview generator on Vertex AI, with the source pool coming from the full Wikipedia kiwix tiles index.
提供机构:
Chrisyichuan



