EliMC/coco-captions-pt-br

Name: EliMC/coco-captions-pt-br
Creator: EliMC
Published: 2025-12-05 15:46:49
License: 暂无描述

Hugging Face2025-12-05 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/EliMC/coco-captions-pt-br

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - pt size_categories: - 100K<n<1M task_categories: - text-to-image - image-to-text - text-generation pretty_name: COCO Captions Portuguese Translation dataset_info: features: - name: image dtype: image - name: caption sequence: string - name: url dtype: string - name: filepath dtype: string - name: filename dtype: string - name: sentids sequence: int64 - name: imgid dtype: int64 - name: split dtype: string - name: cocoid dtype: int64 splits: - name: train num_bytes: 4284853468.21 num_examples: 82783 - name: test num_bytes: 258794470 num_examples: 5000 - name: validation num_bytes: 259062182 num_examples: 5000 - name: restval num_bytes: 1587879327.48 num_examples: 30504 download_size: 6358581380 dataset_size: 6390589447.690001 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: validation path: data/validation-* - split: restval path: data/restval-* license: mit --- # 🎉 COCO Captions Dataset Translation for Portuguese Image Captioning ## 💾 Dataset Summary COCO Captions Portuguese Translation, a multimodal dataset for Portuguese image captioning with 123,287 images, each accompanied by five descriptive captions that have been generated by human annotators for every individual image. The original English captions were rendered into Portuguese through the utilization of the Google Translator API. ## 🧑‍💻 Hot to Get Started with the Dataset ```python from datasets import load_dataset dataset = load_dataset('laicsiifes/coco-captions-pt-br') ``` ## ✍️ Languages The images descriptions in the dataset are in Portuguese. ## 🧱 Dataset Structure ### 📝 Data Instances An example looks like below: ``` { 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480>, 'caption': [ 'Um restaurante possui mesas e cadeiras modernas de madeira.', 'Uma longa mesa de restaurante com cadeiras de vime com encosto arredondado.', 'uma longa mesa com uma planta em cima cercada por cadeiras de madeira', 'Uma longa mesa com um arranjo de flores no meio para reuniões', 'Uma mesa é adornada com cadeiras de madeira com detalhes em azul.' ], 'url': 'http://images.cocodataset.org/train2014/COCO_train2014_000000057870.jpg', 'filepath': 'train2014', 'filename': 'COCO_train2014_000000057870.jpg', 'sentids': [787980, 789366, 789888, 791316, 794853], 'imgid': 40504, 'split': 'train', 'cocoid': 57870 } ``` ### 🗃️ Data Fields The data instances have the following fields: - `image`: a `PIL.Image.Image` object containing image. - `caption`: a `list` of `str` containing the 5 captions related to image. - `url`: a `str` containing the url to original image. - `filepath`: a `str` containing the path to image file. - `filename`: a `str` containing name of image file. - `sentids`: a `list` of `int` containing the ordered identification numbers related to each caption. - `imgid`: a `int` containing image identification number. - `split`: a `str` containing data split. It stores texts: `train`, `val`, `restval` or `test`. - `cocoid`: an `int` containing example identifier in COCO dataset. ### ✂️ Data Splits The dataset is partitioned using the Karpathy splitting appoach for Image Captioning ([Karpathy and Fei-Fei, 2015](https://arxiv.org/pdf/1412.2306)). For training, the `train` and `restval` splits are put together as an unique training split with 113,287 examples. |Split|Samples|Average Caption Length (Words)| |:-----------:|:-----:|:--------:| |Train|82,783|10.3 ± 2.7| |RestVal|30,504|10.3 ± 2.7| |Validation|5,000|10.3 ± 2.7| |Test|5,000|10.3 ± 2.7| |Total|123,287|10.3 ± 2.7| ## 📋 BibTeX entry and citation info ```bibtex @misc{bromonschenkel2024cocopt, title = {COCO Captions Dataset Translation for Portuguese Image Captioning}, author = {Bromonschenkel, Gabriel and Oliveira, Hil{\'a}rio and Paix{\~a}o, Thiago M.}, howpublished = {\url{https://huggingface.co/datasets/laicsiifes/coco-captions-pt-br}}, publisher = {Hugging Face}, year = {2024} } ```

--- 语言： - 葡萄牙语（pt）规模类别： - 10万 < 样本数 < 100万任务类别： - 文本到图像 - 图像到文本 - 文本生成规范名称：COCO字幕葡萄牙语翻译数据集（COCO Captions Portuguese Translation）数据集信息：特征： - 名称：image 数据类型：图像 - 名称：caption 数据类型：字符串序列 - 名称：url 数据类型：字符串 - 名称：filepath 数据类型：字符串 - 名称：filename 数据类型：字符串 - 名称：sentids 数据类型：int64序列 - 名称：imgid 数据类型：int64 - 名称：split 数据类型：字符串 - 名称：cocoid 数据类型：int64 划分： - 名称：train 字节数：4284853468.21 样本数：82783 - 名称：test 字节数：258794470 样本数：5000 - 名称：validation 字节数：259062182 样本数：5000 - 名称：restval 字节数：1587879327.48 样本数：30504 下载大小：6358581380 数据集总大小：6390589447.690001 配置： - 配置名称：default 数据文件： - 划分：train 路径：data/train-* - 划分：test 路径：data/test-* - 划分：validation 路径：data/validation-* - 划分：restval 路径：data/restval-* 许可证：MIT许可证 --- # 🎉 COCO字幕葡萄牙语翻译数据集：面向葡萄牙语图像字幕任务 ## 💾 数据集概述 COCO字幕葡萄牙语翻译数据集是面向葡萄牙语图像字幕任务的多模态数据集，包含123287张图像，每张图像均配有5条由人工标注者生成的描述性字幕。原始英文字幕通过谷歌翻译API（Google Translator API）转换为葡萄牙语。 ## 💻 数据集快速上手指南 python from datasets import load_dataset dataset = load_dataset('laicsiifes/coco-captions-pt-br') ## 🌐 语言说明数据集中的图像描述均采用葡萄牙语。 ## 🧱 数据集结构 ### 📝 数据样例典型的数据实例格式如下： { 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480>, 'caption': [ '一家餐厅配备了现代木质桌椅。', '一张长餐桌搭配带有圆形靠背的藤编座椅。', '一张长桌摆放着一盆绿植，四周环绕着木质座椅。', '长桌中央设有花艺装饰，可用于会议场景。', '一张餐桌搭配带有蓝色装饰细节的木质座椅。' ], 'url': 'http://images.cocodataset.org/train2014/COCO_train2014_000000057870.jpg', 'filepath': 'train2014', 'filename': 'COCO_train2014_000000057870.jpg', 'sentids': [787980, 789366, 789888, 791316, 794853], 'imgid': 40504, 'split': 'train', 'cocoid': 57870 } ### 🗃️ 数据字段每个数据实例包含以下字段： - `image`：存储图像的`PIL.Image.Image`对象 - `caption`：包含与该图像关联的5条描述字幕的字符串列表 - `url`：指向原始图像的URL字符串 - `filepath`：图像文件的存储路径字符串 - `filename`：图像文件名字符串 - `sentids`：与每条字幕对应的有序唯一标识编号列表 - `imgid`：图像的唯一标识整数 - `split`：数据划分标记字符串，可选值为`train`、`val`、`restval`或`test` - `cocoid`：该样本在COCO数据集中的唯一标识整数 ### ✂️ 数据划分该数据集采用卡帕西划分方法（Karpathy splitting approach）用于图像字幕任务（[Karpathy和Fei-Fei, 2015](https://arxiv.org/pdf/1412.2306)）。训练阶段将`train`与`restval`两个划分合并为单一训练集，总计包含113287个样本。 | 数据划分 | 样本数量 | 平均字幕长度（词数） | |:-------:|:-------:|:----------------:| | 训练集 | 82,783 | 10.3 ± 2.7 | | RestVal | 30,504 | 10.3 ± 2.7 | | 验证集 | 5,000 | 10.3 ± 2.7 | | 测试集 | 5,000 | 10.3 ± 2.7 | | 总计 | 123,287 | 10.3 ± 2.7 | ## 📋 BibTeX引用格式 bibtex @misc{bromonschenkel2024cocopt, title = {COCO Captions Dataset Translation for Portuguese Image Captioning}, author = {Bromonschenkel, Gabriel and Oliveira, Hilário and Paixão, Thiago M.}, howpublished = {url{https://huggingface.co/datasets/laicsiifes/coco-captions-pt-br}}, publisher = {Hugging Face}, year = {2024} }

提供机构：

EliMC

5,000+

优质数据集

54 个

任务类型

进入经典数据集