EliMC/coco-captions-pt-br
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/EliMC/coco-captions-pt-br
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- pt
size_categories:
- 100K<n<1M
task_categories:
- text-to-image
- image-to-text
- text-generation
pretty_name: COCO Captions Portuguese Translation
dataset_info:
features:
- name: image
dtype: image
- name: caption
sequence: string
- name: url
dtype: string
- name: filepath
dtype: string
- name: filename
dtype: string
- name: sentids
sequence: int64
- name: imgid
dtype: int64
- name: split
dtype: string
- name: cocoid
dtype: int64
splits:
- name: train
num_bytes: 4284853468.21
num_examples: 82783
- name: test
num_bytes: 258794470
num_examples: 5000
- name: validation
num_bytes: 259062182
num_examples: 5000
- name: restval
num_bytes: 1587879327.48
num_examples: 30504
download_size: 6358581380
dataset_size: 6390589447.690001
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: validation
path: data/validation-*
- split: restval
path: data/restval-*
license: mit
---
# 🎉 COCO Captions Dataset Translation for Portuguese Image Captioning
## 💾 Dataset Summary
COCO Captions Portuguese Translation, a multimodal dataset for Portuguese image captioning with 123,287 images, each accompanied by five descriptive captions that have been
generated by human annotators for every individual image. The original English captions were rendered into Portuguese
through the utilization of the Google Translator API.
## 🧑💻 Hot to Get Started with the Dataset
```python
from datasets import load_dataset
dataset = load_dataset('laicsiifes/coco-captions-pt-br')
```
## ✍️ Languages
The images descriptions in the dataset are in Portuguese.
## 🧱 Dataset Structure
### 📝 Data Instances
An example looks like below:
```
{
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480>,
'caption': [
'Um restaurante possui mesas e cadeiras modernas de madeira.',
'Uma longa mesa de restaurante com cadeiras de vime com encosto arredondado.',
'uma longa mesa com uma planta em cima cercada por cadeiras de madeira',
'Uma longa mesa com um arranjo de flores no meio para reuniões',
'Uma mesa é adornada com cadeiras de madeira com detalhes em azul.'
],
'url': 'http://images.cocodataset.org/train2014/COCO_train2014_000000057870.jpg',
'filepath': 'train2014',
'filename': 'COCO_train2014_000000057870.jpg',
'sentids': [787980, 789366, 789888, 791316, 794853],
'imgid': 40504,
'split': 'train',
'cocoid': 57870
}
```
### 🗃️ Data Fields
The data instances have the following fields:
- `image`: a `PIL.Image.Image` object containing image.
- `caption`: a `list` of `str` containing the 5 captions related to image.
- `url`: a `str` containing the url to original image.
- `filepath`: a `str` containing the path to image file.
- `filename`: a `str` containing name of image file.
- `sentids`: a `list` of `int` containing the ordered identification numbers related to each caption.
- `imgid`: a `int` containing image identification number.
- `split`: a `str` containing data split. It stores texts: `train`, `val`, `restval` or `test`.
- `cocoid`: an `int` containing example identifier in COCO dataset.
### ✂️ Data Splits
The dataset is partitioned using the Karpathy splitting appoach for Image Captioning
([Karpathy and Fei-Fei, 2015](https://arxiv.org/pdf/1412.2306)). For training, the `train` and `restval` splits
are put together as an unique training split with 113,287 examples.
|Split|Samples|Average Caption Length (Words)|
|:-----------:|:-----:|:--------:|
|Train|82,783|10.3 ± 2.7|
|RestVal|30,504|10.3 ± 2.7|
|Validation|5,000|10.3 ± 2.7|
|Test|5,000|10.3 ± 2.7|
|Total|123,287|10.3 ± 2.7|
## 📋 BibTeX entry and citation info
```bibtex
@misc{bromonschenkel2024cocopt,
title = {COCO Captions Dataset Translation for Portuguese Image Captioning},
author = {Bromonschenkel, Gabriel and Oliveira, Hil{\'a}rio and Paix{\~a}o, Thiago M.},
howpublished = {\url{https://huggingface.co/datasets/laicsiifes/coco-captions-pt-br}},
publisher = {Hugging Face},
year = {2024}
}
```
---
语言:
- 葡萄牙语(pt)
规模类别:
- 10万 < 样本数 < 100万
任务类别:
- 文本到图像
- 图像到文本
- 文本生成
规范名称:COCO字幕葡萄牙语翻译数据集(COCO Captions Portuguese Translation)
数据集信息:
特征:
- 名称:image
数据类型:图像
- 名称:caption
数据类型:字符串序列
- 名称:url
数据类型:字符串
- 名称:filepath
数据类型:字符串
- 名称:filename
数据类型:字符串
- 名称:sentids
数据类型:int64序列
- 名称:imgid
数据类型:int64
- 名称:split
数据类型:字符串
- 名称:cocoid
数据类型:int64
划分:
- 名称:train
字节数:4284853468.21
样本数:82783
- 名称:test
字节数:258794470
样本数:5000
- 名称:validation
字节数:259062182
样本数:5000
- 名称:restval
字节数:1587879327.48
样本数:30504
下载大小:6358581380
数据集总大小:6390589447.690001
配置:
- 配置名称:default
数据文件:
- 划分:train
路径:data/train-*
- 划分:test
路径:data/test-*
- 划分:validation
路径:data/validation-*
- 划分:restval
路径:data/restval-*
许可证:MIT许可证
---
# 🎉 COCO字幕葡萄牙语翻译数据集:面向葡萄牙语图像字幕任务
## 💾 数据集概述
COCO字幕葡萄牙语翻译数据集是面向葡萄牙语图像字幕任务的多模态数据集,包含123287张图像,每张图像均配有5条由人工标注者生成的描述性字幕。原始英文字幕通过谷歌翻译API(Google Translator API)转换为葡萄牙语。
## 💻 数据集快速上手指南
python
from datasets import load_dataset
dataset = load_dataset('laicsiifes/coco-captions-pt-br')
## 🌐 语言说明
数据集中的图像描述均采用葡萄牙语。
## 🧱 数据集结构
### 📝 数据样例
典型的数据实例格式如下:
{
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480>,
'caption': [
'一家餐厅配备了现代木质桌椅。',
'一张长餐桌搭配带有圆形靠背的藤编座椅。',
'一张长桌摆放着一盆绿植,四周环绕着木质座椅。',
'长桌中央设有花艺装饰,可用于会议场景。',
'一张餐桌搭配带有蓝色装饰细节的木质座椅。'
],
'url': 'http://images.cocodataset.org/train2014/COCO_train2014_000000057870.jpg',
'filepath': 'train2014',
'filename': 'COCO_train2014_000000057870.jpg',
'sentids': [787980, 789366, 789888, 791316, 794853],
'imgid': 40504,
'split': 'train',
'cocoid': 57870
}
### 🗃️ 数据字段
每个数据实例包含以下字段:
- `image`:存储图像的`PIL.Image.Image`对象
- `caption`:包含与该图像关联的5条描述字幕的字符串列表
- `url`:指向原始图像的URL字符串
- `filepath`:图像文件的存储路径字符串
- `filename`:图像文件名字符串
- `sentids`:与每条字幕对应的有序唯一标识编号列表
- `imgid`:图像的唯一标识整数
- `split`:数据划分标记字符串,可选值为`train`、`val`、`restval`或`test`
- `cocoid`:该样本在COCO数据集中的唯一标识整数
### ✂️ 数据划分
该数据集采用卡帕西划分方法(Karpathy splitting approach)用于图像字幕任务([Karpathy和Fei-Fei, 2015](https://arxiv.org/pdf/1412.2306))。训练阶段将`train`与`restval`两个划分合并为单一训练集,总计包含113287个样本。
| 数据划分 | 样本数量 | 平均字幕长度(词数) |
|:-------:|:-------:|:----------------:|
| 训练集 | 82,783 | 10.3 ± 2.7 |
| RestVal | 30,504 | 10.3 ± 2.7 |
| 验证集 | 5,000 | 10.3 ± 2.7 |
| 测试集 | 5,000 | 10.3 ± 2.7 |
| 总计 | 123,287 | 10.3 ± 2.7 |
## 📋 BibTeX引用格式
bibtex
@misc{bromonschenkel2024cocopt,
title = {COCO Captions Dataset Translation for Portuguese Image Captioning},
author = {Bromonschenkel, Gabriel and Oliveira, Hilário and Paixão, Thiago M.},
howpublished = {url{https://huggingface.co/datasets/laicsiifes/coco-captions-pt-br}},
publisher = {Hugging Face},
year = {2024}
}
提供机构:
EliMC



