dinhanhx/google-wit-vi

Name: dinhanhx/google-wit-vi
Creator: dinhanhx
Published: 2023-11-29 01:45:05
License: 暂无描述

Hugging Face2023-11-29 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/dinhanhx/google-wit-vi

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc task_categories: - image-to-text task_ids: - image-captioning language: - vi size_categories: - 100M<n<1B pretty_name: Google WIT Vietnamese --- # Google WIT Vietnamese This data repos contain extracted data from [Google WIT](https://github.com/google-research-datasets/wit/blob/main/DATA.md). The extracted data is all for Vietnamese language. Given `x` is a data point in the OG dataset which has keys following OG `field_name`, the criteria to filter is ```python criteria = lambda x: x.get("language", "") == "vi" and x.get("caption_reference_description", "") ``` ## Text-related details All `.tsv.gz` files follow OG data files in terms of file names and file structures. ### Train split `wit_v1.train.*.tsv.gz` Train data length of each file (not including the header), ``` 17690 17756 17810 17724 17619 17494 17624 17696 17777 17562 ``` Total 176752 ### Validation split `wit_v1.val.*.tsv.gz` Val data length of each file (not including the header), ``` 292 273 275 320 306 ``` Total 1466 ### Test split `wit_v1.test.*.tsv.gz` Test data length of each file (not including the header), ``` 215 202 201 201 229 ``` Total 1048 ## Image-related details ### Image URL only `*.image_url_list.txt` are simply lists of image urls from `*.tsv.gz` files Image url length of each file (train, val, test, all) ``` 157281 1271 900 159452 ``` Google Research has made sure that all sets don't share same exact images. ### Downloaded Images ⚠ Please for the love of the gods, read this section carefully. For `all.index.fmt_id.image_url_list.tsv`, from left to right, without headers, the columns are `index`, `fmt_id`, `image_url`. It is to map `image_url` (in `all.image_url_list.txt`) to `fmt_id`. It's for downloading images. `fmt_id` is: - used to name images (with proper image extensions) in `images/`. - `index` but filled with 6 zeros Downloading time was less than 36 hours with: - 90 Mbps - Processor Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99 GHz - No asynchronous For `fail.index.fmt_id.status.image_url_list.tsv`, from left to right, without headers, the columns are `index`, `fmt_id`, `status`, `image_url`. It is to track image urls (during downloading) that are inaccessible. 3367 image urls returned 404 (`status` values). In other words, we were able to download 97.88839275% of images. `images/` folder takes disk space of: - 215 GBs (uncompressed) - 209 GBs (compressed) We use Pillow to open each image to make sure that downloaded images are usable. We also log all faulty files in `corrupted_image_list.json`. There are less than 70 image files. For `corrupted_image_list.json`, for each item in this list, the keys are `file_name`, `error`. `file_name` is `fmt_id` with extension but without `images/`. Some errors are either: - files exceed Pillow default limit - files are truncated To actually load those files, the following code can be used to change Pillow behavior ```python from PIL import Image, ImageFile # For very big image files Image.MAX_IMAGE_PIXELS = None # For truncated image files ImageFile.LOAD_TRUNCATED_IMAGES = True ``` Zip `images/` folder, ```bash zip -r images.zip images/ zip images.zip --out spanned_images.zip -s 40g ``` https://superuser.com/questions/336219/how-do-i-split-a-zip-file-into-multiple-segments Unzip `spanned_images.*` files, ```bash zip -s 0 spanned_images.zip --out images.zip unzip images.zip ``` https://unix.stackexchange.com/questions/40480/how-to-unzip-a-multipart-spanned-zip-on-linux

提供机构：

dinhanhx

原始信息汇总

Google WIT Vietnamese

数据集概述

许可证: cc
任务类别:
- image-to-text
任务ID:
- image-captioning
语言:
- vi
数据集大小:
- 100M<n<1B
数据集名称: Google WIT Vietnamese

数据集详情

数据来源: 从Google WIT提取的越南语数据。
过滤标准: python criteria = lambda x: x.get("language", "") == "vi" and x.get("caption_reference_description", "")

文本相关细节

文件格式: 所有.tsv.gz文件遵循原始数据文件的文件名和文件结构。
训练集:
- 文件名: wit_v1.train.*.tsv.gz
- 数据长度:
  
  17690 17756 17810 17724 17619 17494 17624 17696 17777 17562
- 总长度: 176752
验证集:
- 文件名: wit_v1.val.*.tsv.gz
- 数据长度:
  
  292 273 275 320 306
- 总长度: 1466
测试集:
- 文件名: wit_v1.test.*.tsv.gz
- 数据长度:
  
  215 202 201 201 229
- 总长度: 1048

图像相关细节

图像URL列表:
- 文件名: *.image_url_list.txt
- 图像URL长度:
  
  157281 1271 900 159452
- 确保所有数据集不共享相同的图像。
下载的图像:
- 文件名: all.index.fmt_id.image_url_list.tsv
- 列: index, fmt_id, image_url
- fmt_id用于命名图像并映射到image_url。
- 下载失败的图像URL记录在fail.index.fmt_id.status.image_url_list.tsv中，列: index, fmt_id, status, image_url。
- 3367个图像URL返回404状态，成功下载率为97.88839275%。
- 图像文件夹占用磁盘空间:
  - 未压缩: 215 GB
  - 压缩: 209 GB
- 使用Pillow检查图像可用性，并记录在corrupted_image_list.json中，包含file_name和error键。
- 处理超大或截断图像的代码: python from PIL import Image, ImageFile Image.MAX_IMAGE_PIXELS = None ImageFile.LOAD_TRUNCATED_IMAGES = True
- 压缩图像文件夹: bash zip -r images.zip images/ zip images.zip --out spanned_images.zip -s 40g
- 解压图像文件夹: bash zip -s 0 spanned_images.zip --out images.zip unzip images.zip

5,000+

优质数据集

54 个

任务类型

进入经典数据集