test-big-dataset

Name: test-big-dataset
Creator: maas
Published: 2025-10-09 16:26:03
License: 暂无描述

魔搭社区2025-10-09 更新2025-03-15 收录

下载链接：

https://modelscope.cn/datasets/huggingface/test-big-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Danish WIT ## Dataset Description - **Repository:** - **Point of Contact:** [Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk) - **Size of downloaded dataset files:** 7.5 GB - **Size of the generated dataset:** 7.8 GB - **Total amount of disk used:** 15.3 GB ### Dataset Summary Google presented the Wikipedia Image Text (WIT) dataset in [July 2021](https://dl.acm.org/doi/abs/10.1145/3404835.3463257), a dataset which contains scraped images from Wikipedia along with their descriptions. WikiMedia released WIT-Base in [September 2021](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/), being a modified version of WIT where they have removed the images with empty "reference descriptions", as well as removing images where a person's face covers more than 10% of the image surface, along with inappropriate images that are candidate for deletion. This dataset is the Danish portion of the WIT-Base dataset, consisting of roughly 160,000 images with associated Danish descriptions. We release the dataset under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/), in accordance with WIT-Base's [identical license](https://huggingface.co/datasets/wikimedia/wit_base#licensing-information). ### Supported Tasks and Leaderboards Training machine learning models for caption generation, zero-shot image classification and text-image search are the intended tasks for this dataset. No leaderboard is active at this point. ### Languages The dataset is available in Danish (`da`). ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 7.5 GB - **Size of the generated dataset:** 7.8 GB - **Total amount of disk used:** 15.3 GB An example from the `train` split looks as follows. ``` { "image": { "bytes": b"\xff\xd8\xff\xe0\x00\x10JFIF...", "path": None }, "image_url": "https://upload.wikimedia.org/wikipedia/commons/4/45/Bispen_-_inside.jpg", "embedding": [2.8568285, 2.9562542, 0.33794892, 8.753725, ...], "metadata_url": "http://commons.wikimedia.org/wiki/File:Bispen_-_inside.jpg", "original_height": 3161, "original_width": 2316, "mime_type": "image/jpeg", "caption_attribution_description": "Kulturhuset Bispen set indefra. Biblioteket er til venstre", "page_url": "https://da.wikipedia.org/wiki/Bispen", "attribution_passes_lang_id": True, "caption_alt_text_description": None, "caption_reference_description": "Bispen set indefra fra 1. sal, hvor ....", "caption_title_and_reference_description": "Bispen [SEP] Bispen set indefra ...", "context_page_description": "Bispen er navnet på det offentlige kulturhus i ...", "context_section_description": "Bispen er navnet på det offentlige kulturhus i ...", "hierarchical_section_title": "Bispen", "is_main_image": True, "page_changed_recently": True, "page_title": "Bispen", "section_title": None } ``` ### Data Fields The data fields are the same among all splits. - `image`: a `dict` feature. - `image_url`: a `str` feature. - `embedding`: a `list` feature. - `metadata_url`: a `str` feature. - `original_height`: an `int` or `NaN` feature. - `original_width`: an `int` or `NaN` feature. - `mime_type`: a `str` or `None` feature. - `caption_attribution_description`: a `str` or `None` feature. - `page_url`: a `str` feature. - `attribution_passes_lang_id`: a `bool` or `None` feature. - `caption_alt_text_description`: a `str` or `None` feature. - `caption_reference_description`: a `str` or `None` feature. - `caption_title_and_reference_description`: a `str` or `None` feature. - `context_page_description`: a `str` or `None` feature. - `context_section_description`: a `str` or `None` feature. - `hierarchical_section_title`: a `str` feature. - `is_main_image`: a `bool` or `None` feature. - `page_changed_recently`: a `bool` or `None` feature. - `page_title`: a `str` feature. - `section_title`: a `str` or `None` feature. ### Data Splits Roughly 2.60% of the WIT-Base dataset comes from the Danish Wikipedia. We have split the resulting 168,740 samples into a training set, validation set and testing set of the following sizes: | split | samples | |---------|--------:| | train | 167,460 | | val | 256 | | test | 1,024 | ## Dataset Creation ### Curation Rationale It is quite cumbersome to extract the Danish portion of the WIT-Base dataset, especially as the dataset takes up 333 GB of disk space, so the curation of Danish-WIT is purely to make it easier to work with the Danish portion of it. ### Source Data The original data was collected from WikiMedia's [WIT-Base](https://huggingface.co/datasets/wikimedia/wit_base) dataset, which in turn comes from Google's [WIT](https://huggingface.co/datasets/google/wit) dataset. ## Additional Information ### Dataset Curators [Dan Saattrup Nielsen](https://saattrupdan.github.io/) from the [The Alexandra Institute](https://alexandra.dk/) curated this dataset. ### Licensing Information The dataset is licensed under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/).

# 丹麦WIT数据集卡片 ## 数据集描述 - **仓库地址**： - **联系负责人**：[Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk) - **下载数据集文件总大小**：7.5 GB - **生成后数据集大小**：7.8 GB - **总磁盘占用空间**：15.3 GB ### 数据集概述谷歌于2021年7月发布了维基百科图像文本（Wikipedia Image Text，WIT）数据集，详见[https://dl.acm.org/doi/abs/10.1145/3404835.3463257](https://dl.acm.org/doi/abs/10.1145/3404835.3463257)，该数据集包含从维基百科爬取的图像及其配套描述文本。维基媒体（WikiMedia）于2021年9月推出了WIT-Base数据集，详见[https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/)，它是WIT的修改版本：移除了带有空“参考描述”的图像、移除了人脸占比超过图像总面积10%的图像，以及标记为待删除的不当图像。本数据集即为WIT-Base的丹麦语分支，包含约16万张配有丹麦语描述文本的图像。我们遵照WIT-Base的同源许可协议，将本数据集以[CC BY-SA 4.0许可协议](https://creativecommons.org/licenses/by-sa/4.0/)开源发布。 ### 支持任务与官方排行榜本数据集旨在用于训练图像标题生成、零样本图像分类以及文本-图像检索的机器学习模型。目前暂无对应的官方排行榜。 ### 支持语言本数据集仅支持丹麦语（`da`）。 ## 数据集结构 ### 数据示例 - **下载数据集文件总大小**：7.5 GB - **生成后数据集大小**：7.8 GB - **总磁盘占用空间**：15.3 GB 以下为`train`划分下的一条示例数据： json { "image": { "bytes": b"xffxd8xffxe0x00x10JFIF...", "path": null }, "image_url": "https://upload.wikimedia.org/wikipedia/commons/4/45/Bispen_-_inside.jpg", "embedding": [2.8568285, 2.9562542, 0.33794892, 8.753725, ...], "metadata_url": "http://commons.wikimedia.org/wiki/File:Bispen_-_inside.jpg", "original_height": 3161, "original_width": 2316, "mime_type": "image/jpeg", "caption_attribution_description": "Kulturhuset Bispen set indefra. Biblioteket er til venstre", "page_url": "https://da.wikipedia.org/wiki/Bispen", "attribution_passes_lang_id": true, "caption_alt_text_description": null, "caption_reference_description": "Bispen set indefra fra 1. sal, hvor ....", "caption_title_and_reference_description": "Bispen [SEP] Bispen set indefra ...", "context_page_description": "Bispen er navnet på det offentlige kulturhus i ...", "context_section_description": "Bispen er navnet på det offentlige kulturhus i ...", "hierarchical_section_title": "Bispen", "is_main_image": true, "page_changed_recently": true, "page_title": "Bispen", "section_title": null } ### 数据字段说明所有数据划分下的字段均保持一致： - `image`：字典类型特征 - `image_url`：字符串类型特征 - `embedding`：列表类型特征 - `metadata_url`：字符串类型特征 - `original_height`：整数或空值（NaN）类型特征 - `original_width`：整数或空值（NaN）类型特征 - `mime_type`：字符串或空值类型特征 - `caption_attribution_description`：字符串或空值类型特征 - `page_url`：字符串类型特征 - `attribution_passes_lang_id`：布尔值或空值类型特征 - `caption_alt_text_description`：字符串或空值类型特征 - `caption_reference_description`：字符串或空值类型特征 - `caption_title_and_reference_description`：字符串或空值类型特征 - `context_page_description`：字符串或空值类型特征 - `context_section_description`：字符串或空值类型特征 - `hierarchical_section_title`：字符串类型特征 - `is_main_image`：布尔值或空值类型特征 - `page_changed_recently`：布尔值或空值类型特征 - `page_title`：字符串类型特征 - `section_title`：字符串或空值类型特征 ### 数据划分 WIT-Base数据集中约2.60%的样本来自丹麦维基百科。我们将最终得到的168,740条样本划分为训练集、验证集与测试集，各集合规模如下： | 数据划分 | 样本数量 | |---------|--------:| | 训练集 | 167,460 | | 验证集 | 256 | | 测试集 | 1,024 | ## 数据集构建说明 ### 筛选初衷提取WIT-Base的丹麦语分支操作较为繁琐，且该数据集本身占用333GB磁盘空间，因此构建丹麦WIT数据集的初衷仅为简化该语言分支的使用流程。 ### 源数据来源本数据集的原始数据来源于维基媒体的[WIT-Base](https://huggingface.co/datasets/wikimedia/wit_base)数据集，而WIT-Base则源自谷歌的[WIT](https://huggingface.co/datasets/google/wit)数据集。 ## 附加信息 ### 数据集维护者来自[亚历山大研究所（The Alexandra Institute）](https://alexandra.dk/)的[Dan Saattrup Nielsen](https://saattrupdan.github.io/)负责本数据集的整理与维护。 ### 许可协议信息本数据集采用[CC BY-SA 4.0许可协议](https://creativecommons.org/licenses/by-sa/4.0/)进行授权。

提供机构：

maas

创建时间：

2025-03-12

搜集汇总

数据集介绍