five

tencent/Penguin-Recap-I

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tencent/Penguin-Recap-I
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: datacomp_coyo_penguin default: true data_files: - split: train path: data/datacomp_coyo_penguin/*.jsonl.gz - config_name: sa1b_penguin data_files: - split: train path: data/sa1b_penguin/*.jsonl.gz - config_name: openimages_penguin data_files: - split: train path: data/openimages_penguin/*.jsonl.gz tags: - multimodal - image-text - metadata-only size_categories: - 10M<n<100M --- # Penguin-Recap-I Penguin-Recap-I publishes recap metadata only. The repository does not contain image binaries. ## Included subsets | subset | collection | local source roots | expected records | | --- | --- | --- | ---: | | `datacomp_coyo_penguin` | DataComp + COYO Penguin recap | `datamultimodal/IMAGE/datacomp_1b, datamultimodal/IMAGE/coyo_700m` | 57,618,155 | | `sa1b_penguin` | SA-1B Penguin recap | `datamultimodal/IMAGE/SA-1B` | 9,254,501 | | `openimages_penguin` | OpenImages Penguin recap | `datamultimodal/IMAGE/openimages` | 1,709,646 | Expected total records: **68,582,302** ## Media access policy - `openimages_penguin`: keeps the relative image path and filename only. Users should obtain the image files from the official OpenImages release. - `sa1b_penguin`: keeps the relative image path and filename only. Users should obtain the image files from the official SA-1B release. - `datacomp_coyo_penguin`: stores the original image URL extracted from the sidecar JSON file next to each local image. ## Image download resources - OpenDataLab OpenImagesV6: https://opendatalab.com/OpenDataLab/OpenImagesV6/tree/main/raw - OpenDataLab SA-1B: https://opendatalab.com/OpenDataLab/SA-1B/tree/main/raw - Official Segment Anything release: https://ai.meta.com/datasets/segment-anything/ - Official OpenImages index: https://storage.googleapis.com/openimages/web/index.html For `openimages_penguin` and `sa1b_penguin`, use the exported `image_name`, `image_names`, `image`, and `image_refs` fields to map each row back to the corresponding original image file. For `datacomp_coyo_penguin`, each JSON entry includes `url` / `urls`, which can be used to download the image directly. ## Repository layout - `data/<subset>/*.jsonl.gz`: metadata shards used by the dataset viewer - `manifest/files.jsonl`: shard-level example counts and byte estimates - `manifest/skipped.jsonl`: skipped samples and the reason - `manifest/build_stats.json`: end-of-run summary ## Row schema Each row contains the normalized metadata below: - `sample_key`: stable public sample id - `subset`: Hugging Face subset/config id - `source`: source id - `original_id`: original annotation id, normalized to string - `image`: first relative image reference from the annotation - `image_refs`: full list of relative image references - `image_name`: first image basename - `url`: first URL for DataComp/COYO rows, otherwise `null` - `conversations`: full conversation list from the annotation - `prompt` / `response`: first human and first gpt turns - `annotation_metadata`: remaining annotation fields that were not promoted ## Loading ```python from datasets import load_dataset datacomp = load_dataset( "tencent/Penguin-Recap-I", "datacomp_coyo_penguin", split="train", streaming=True, ) sample = next(iter(datacomp)) print(sample["url"]) sa1b = load_dataset( "tencent/Penguin-Recap-I", "sa1b_penguin", split="train", streaming=True, ) sample = next(iter(sa1b)) print(sample["image_name"]) openimages = load_dataset( "tencent/Penguin-Recap-I", "openimages_penguin", split="train", streaming=True, ) sample = next(iter(openimages)) print(sample["conversations"][0]["value"]) ``` ## Citation ```bibtex @article{Penguin-VL, title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders}, author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang}, journal={arXiv preprint arXiv:2603.06569}, year={2026} } ```
提供机构:
tencent
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作