CCRss/wit_filtered_kz_ru_en

Name: CCRss/wit_filtered_kz_ru_en
Creator: CCRss
Published: 2024-09-04 05:35:09
License: 暂无描述

Hugging Face2024-09-04 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/CCRss/wit_filtered_kz_ru_en

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是WIT数据集的过滤版本，专注于哈萨克语、俄语和英语。数据集包含7,371,471张图片，主要语言为英语（74.48%）、俄语（21.03%）、土耳其语（3.70%）和哈萨克语（0.79%）。文件格式为Parquet。数据集分为训练集、测试集和验证集，每个集合包含多个Parquet文件。图片的平均尺寸为1787 x 1502像素，平均宽高比为1.24，最常见的分辨率为640x480。描述统计包括平均字数，如Alt描述、Attribution描述和Reference描述。文件类型主要为JPEG、PNG、SVG、GIF和TIFF。数据集包含来自各种维基百科页面的图片，并包括元数据如页面URL、图片URL、页面标题和章节标题。

This is a filtered version of the Wikipedia-based Image Text (WIT) dataset, focusing on Kazakh, Russian, and English languages. The dataset contains 7,371,471 images, primarily in English, Russian, and Turkish, with the file format being Parquet. Image statistics include average dimensions, aspect ratio, and the most common resolution. Caption statistics provide the average word count for different types of captions. File types statistics show the proportion of different image formats. The dataset is divided into train, test, and validation sets, each containing multiple Parquet files. The usage section provides an example of how to load Parquet files. Notes mention that this is a subset of the original WIT dataset, with images sourced from various Wikipedia pages and including metadata such as page URLs, image URLs, page titles, and section titles.

提供机构：

CCRss

5,000+

优质数据集

54 个

任务类型

进入经典数据集