csarron/4m-img-caps

Name: csarron/4m-img-caps
Creator: csarron
Published: 2022-03-28 18:50:53
License: 暂无描述

Hugging Face2022-03-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/csarron/4m-img-caps

下载链接

链接失效反馈

官方服务：

资源简介：

see [read_pyarrow.py](https://gist.github.com/csarron/df712e53c9e0dcaad4eb6843e7a3d51c#file-read_pyarrow-py) for how to read one pyarrow file. example PyTorch dataset: ```python from torch.utils.data import Dataset class ImageCaptionArrowDataset(Dataset): def __init__( self, dataset_file, tokenizer, ): import pyarrow as pa data = [pa.ipc.open_file(pa.memory_map(f, "rb")).read_all() for f in glob.glob(dataset_file)] self.data = pa.concat_tables(data) # do other initialization, like init image preprocessing fn, def __getitem__(self, index): # item_id = self.data["id"][index].as_py() text = self.data["text"][index].as_py() # get text if isinstance(text, list): text = random.choice(text) img_bytes = self.data["image"][index].as_py() # get image bytes # do some processing with image and text, return the features # img_feat = self.image_bytes_to_tensor(img_bytes) # inputs = self.tokenizer( # text, # padding="max_length", # max_length=self.max_text_len, # truncation=True, # return_token_type_ids=True, # return_attention_mask=True, # add_special_tokens=True, # return_tensors="pt", # ) # input_ids = inputs.input_ids.squeeze(0) # attention_mask = inputs.attention_mask.squeeze(0) # return { # # "item_ids": item_id, # "text_ids": input_ids, # "input_ids": input_ids, # "text_masks": attention_mask, # "pixel_values": img_feat, # } def __len__(self): return len(self.data) ```

如需查阅单个PyArrow文件的读取方法，请参阅[read_pyarrow.py](https://gist.github.com/csarron/df712e53c9e0dcaad4eb6843e7a3d51c#file-read_pyarrow-py)。以下为示例PyTorch数据集实现： python from torch.utils.data import Dataset class ImageCaptionArrowDataset(Dataset): def __init__( self, dataset_file, tokenizer, ): import pyarrow as pa data = [pa.ipc.open_file(pa.memory_map(f, "rb")).read_all() for f in glob.glob(dataset_file)] self.data = pa.concat_tables(data) # 执行其他初始化操作，例如初始化图像预处理函数 def __getitem__(self, index): # 可通过 self.data["id"][index].as_py() 获取数据项ID text = self.data["text"][index].as_py() # 获取文本数据 if isinstance(text, list): text = random.choice(text) img_bytes = self.data["image"][index].as_py() # 获取图像字节数据 # 对图像与文本进行预处理，返回特征张量 # img_feat = self.image_bytes_to_tensor(img_bytes) # 将图像字节数据转换为特征张量 # inputs = self.tokenizer( # text, # padding="max_length", # max_length=self.max_text_len, # truncation=True, # return_token_type_ids=True, # return_attention_mask=True, # add_special_tokens=True, # return_tensors="pt", # ) # input_ids = inputs.input_ids.squeeze(0) # attention_mask = inputs.attention_mask.squeeze(0) # return { # # "item_ids": item_id, # "text_ids": input_ids, # "input_ids": input_ids, # "text_masks": attention_mask, # "pixel_values": img_feat, # } def __len__(self): return len(self.data)

提供机构：

csarron

原始信息汇总

数据集概述

数据集加载方法

使用PyArrow库加载数据集文件。
通过pyarrow.ipc.open_file和pyarrow.memory_map方法读取文件，并使用pyarrow.concat_tables合并数据表。

数据集结构

数据集包含以下字段：
- id: 项目ID。
- text: 文本内容，可能为列表，需随机选择一个元素。
- image: 图像数据，以字节形式存储。

数据处理

图像数据通过image_bytes_to_tensor方法转换为张量。
文本数据通过Tokenizer处理，包括填充、截断、添加特殊标记等，最终返回输入ID和注意力掩码。

数据集操作

__getitem__方法用于获取指定索引的数据项，包括文本和图像数据。
__len__方法返回数据集的总项数。

5,000+

优质数据集

54 个

任务类型

进入经典数据集