five

dinhanhx/google-wit-vi

收藏
Hugging Face2023-11-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dinhanhx/google-wit-vi
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc task_categories: - image-to-text task_ids: - image-captioning language: - vi size_categories: - 100M<n<1B pretty_name: Google WIT Vietnamese --- # Google WIT Vietnamese This data repos contain extracted data from [Google WIT](https://github.com/google-research-datasets/wit/blob/main/DATA.md). The extracted data is all for Vietnamese language. Given `x` is a data point in the OG dataset which has keys following OG `field_name`, the criteria to filter is ```python criteria = lambda x: x.get("language", "") == "vi" and x.get("caption_reference_description", "") ``` ## Text-related details All `.tsv.gz` files follow OG data files in terms of file names and file structures. ### Train split `wit_v1.train.*.tsv.gz` Train data length of each file (not including the header), ``` 17690 17756 17810 17724 17619 17494 17624 17696 17777 17562 ``` Total 176752 ### Validation split `wit_v1.val.*.tsv.gz` Val data length of each file (not including the header), ``` 292 273 275 320 306 ``` Total 1466 ### Test split `wit_v1.test.*.tsv.gz` Test data length of each file (not including the header), ``` 215 202 201 201 229 ``` Total 1048 ## Image-related details ### Image URL only `*.image_url_list.txt` are simply lists of image urls from `*.tsv.gz` files Image url length of each file (train, val, test, all) ``` 157281 1271 900 159452 ``` Google Research has made sure that all sets don't share same exact images. ### Downloaded Images ⚠ Please for the love of the gods, read this section carefully. For `all.index.fmt_id.image_url_list.tsv`, from left to right, without headers, the columns are `index`, `fmt_id`, `image_url`. It is to map `image_url` (in `all.image_url_list.txt`) to `fmt_id`. It's for downloading images. `fmt_id` is: - used to name images (with proper image extensions) in `images/`. - `index` but filled with 6 zeros Downloading time was less than 36 hours with: - 90 Mbps - Processor Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99 GHz - No asynchronous For `fail.index.fmt_id.status.image_url_list.tsv`, from left to right, without headers, the columns are `index`, `fmt_id`, `status`, `image_url`. It is to track image urls (during downloading) that are inaccessible. 3367 image urls returned 404 (`status` values). In other words, we were able to download 97.88839275% of images. `images/` folder takes disk space of: - 215 GBs (uncompressed) - 209 GBs (compressed) We use Pillow to open each image to make sure that downloaded images are usable. We also log all faulty files in `corrupted_image_list.json`. There are less than 70 image files. For `corrupted_image_list.json`, for each item in this list, the keys are `file_name`, `error`. `file_name` is `fmt_id` with extension but without `images/`. Some errors are either: - files exceed Pillow default limit - files are truncated To actually load those files, the following code can be used to change Pillow behavior ```python from PIL import Image, ImageFile # For very big image files Image.MAX_IMAGE_PIXELS = None # For truncated image files ImageFile.LOAD_TRUNCATED_IMAGES = True ``` Zip `images/` folder, ```bash zip -r images.zip images/ zip images.zip --out spanned_images.zip -s 40g ``` https://superuser.com/questions/336219/how-do-i-split-a-zip-file-into-multiple-segments Unzip `spanned_images.*` files, ```bash zip -s 0 spanned_images.zip --out images.zip unzip images.zip ``` https://unix.stackexchange.com/questions/40480/how-to-unzip-a-multipart-spanned-zip-on-linux
提供机构:
dinhanhx
原始信息汇总

Google WIT Vietnamese

数据集概述

  • 许可证: cc
  • 任务类别:
    • image-to-text
  • 任务ID:
    • image-captioning
  • 语言:
    • vi
  • 数据集大小:
    • 100M<n<1B
  • 数据集名称: Google WIT Vietnamese

数据集详情

  • 数据来源: 从Google WIT提取的越南语数据。
  • 过滤标准: python criteria = lambda x: x.get("language", "") == "vi" and x.get("caption_reference_description", "")

文本相关细节

  • 文件格式: 所有.tsv.gz文件遵循原始数据文件的文件名和文件结构。
  • 训练集:
    • 文件名: wit_v1.train.*.tsv.gz

    • 数据长度:

      17690 17756 17810 17724 17619 17494 17624 17696 17777 17562

    • 总长度: 176752

  • 验证集:
    • 文件名: wit_v1.val.*.tsv.gz

    • 数据长度:

      292 273 275 320 306

    • 总长度: 1466

  • 测试集:
    • 文件名: wit_v1.test.*.tsv.gz

    • 数据长度:

      215 202 201 201 229

    • 总长度: 1048

图像相关细节

  • 图像URL列表:
    • 文件名: *.image_url_list.txt

    • 图像URL长度:

      157281 1271 900 159452

    • 确保所有数据集不共享相同的图像。

  • 下载的图像:
    • 文件名: all.index.fmt_id.image_url_list.tsv

    • 列: index, fmt_id, image_url

    • fmt_id用于命名图像并映射到image_url

    • 下载失败的图像URL记录在fail.index.fmt_id.status.image_url_list.tsv中,列: index, fmt_id, status, image_url

    • 3367个图像URL返回404状态,成功下载率为97.88839275%。

    • 图像文件夹占用磁盘空间:

      • 未压缩: 215 GB
      • 压缩: 209 GB
    • 使用Pillow检查图像可用性,并记录在corrupted_image_list.json中,包含file_nameerror键。

    • 处理超大或截断图像的代码: python from PIL import Image, ImageFile Image.MAX_IMAGE_PIXELS = None ImageFile.LOAD_TRUNCATED_IMAGES = True

    • 压缩图像文件夹: bash zip -r images.zip images/ zip images.zip --out spanned_images.zip -s 40g

    • 解压图像文件夹: bash zip -s 0 spanned_images.zip --out images.zip unzip images.zip

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作