five

test-big-dataset

收藏
魔搭社区2025-10-09 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/huggingface/test-big-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Danish WIT ## Dataset Description - **Repository:** - **Point of Contact:** [Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk) - **Size of downloaded dataset files:** 7.5 GB - **Size of the generated dataset:** 7.8 GB - **Total amount of disk used:** 15.3 GB ### Dataset Summary Google presented the Wikipedia Image Text (WIT) dataset in [July 2021](https://dl.acm.org/doi/abs/10.1145/3404835.3463257), a dataset which contains scraped images from Wikipedia along with their descriptions. WikiMedia released WIT-Base in [September 2021](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/), being a modified version of WIT where they have removed the images with empty "reference descriptions", as well as removing images where a person's face covers more than 10% of the image surface, along with inappropriate images that are candidate for deletion. This dataset is the Danish portion of the WIT-Base dataset, consisting of roughly 160,000 images with associated Danish descriptions. We release the dataset under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/), in accordance with WIT-Base's [identical license](https://huggingface.co/datasets/wikimedia/wit_base#licensing-information). ### Supported Tasks and Leaderboards Training machine learning models for caption generation, zero-shot image classification and text-image search are the intended tasks for this dataset. No leaderboard is active at this point. ### Languages The dataset is available in Danish (`da`). ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 7.5 GB - **Size of the generated dataset:** 7.8 GB - **Total amount of disk used:** 15.3 GB An example from the `train` split looks as follows. ``` { "image": { "bytes": b"\xff\xd8\xff\xe0\x00\x10JFIF...", "path": None }, "image_url": "https://upload.wikimedia.org/wikipedia/commons/4/45/Bispen_-_inside.jpg", "embedding": [2.8568285, 2.9562542, 0.33794892, 8.753725, ...], "metadata_url": "http://commons.wikimedia.org/wiki/File:Bispen_-_inside.jpg", "original_height": 3161, "original_width": 2316, "mime_type": "image/jpeg", "caption_attribution_description": "Kulturhuset Bispen set indefra. Biblioteket er til venstre", "page_url": "https://da.wikipedia.org/wiki/Bispen", "attribution_passes_lang_id": True, "caption_alt_text_description": None, "caption_reference_description": "Bispen set indefra fra 1. sal, hvor ....", "caption_title_and_reference_description": "Bispen [SEP] Bispen set indefra ...", "context_page_description": "Bispen er navnet på det offentlige kulturhus i ...", "context_section_description": "Bispen er navnet på det offentlige kulturhus i ...", "hierarchical_section_title": "Bispen", "is_main_image": True, "page_changed_recently": True, "page_title": "Bispen", "section_title": None } ``` ### Data Fields The data fields are the same among all splits. - `image`: a `dict` feature. - `image_url`: a `str` feature. - `embedding`: a `list` feature. - `metadata_url`: a `str` feature. - `original_height`: an `int` or `NaN` feature. - `original_width`: an `int` or `NaN` feature. - `mime_type`: a `str` or `None` feature. - `caption_attribution_description`: a `str` or `None` feature. - `page_url`: a `str` feature. - `attribution_passes_lang_id`: a `bool` or `None` feature. - `caption_alt_text_description`: a `str` or `None` feature. - `caption_reference_description`: a `str` or `None` feature. - `caption_title_and_reference_description`: a `str` or `None` feature. - `context_page_description`: a `str` or `None` feature. - `context_section_description`: a `str` or `None` feature. - `hierarchical_section_title`: a `str` feature. - `is_main_image`: a `bool` or `None` feature. - `page_changed_recently`: a `bool` or `None` feature. - `page_title`: a `str` feature. - `section_title`: a `str` or `None` feature. ### Data Splits Roughly 2.60% of the WIT-Base dataset comes from the Danish Wikipedia. We have split the resulting 168,740 samples into a training set, validation set and testing set of the following sizes: | split | samples | |---------|--------:| | train | 167,460 | | val | 256 | | test | 1,024 | ## Dataset Creation ### Curation Rationale It is quite cumbersome to extract the Danish portion of the WIT-Base dataset, especially as the dataset takes up 333 GB of disk space, so the curation of Danish-WIT is purely to make it easier to work with the Danish portion of it. ### Source Data The original data was collected from WikiMedia's [WIT-Base](https://huggingface.co/datasets/wikimedia/wit_base) dataset, which in turn comes from Google's [WIT](https://huggingface.co/datasets/google/wit) dataset. ## Additional Information ### Dataset Curators [Dan Saattrup Nielsen](https://saattrupdan.github.io/) from the [The Alexandra Institute](https://alexandra.dk/) curated this dataset. ### Licensing Information The dataset is licensed under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/).

# 丹麦WIT数据集卡片 ## 数据集描述 - **仓库地址**: - **联系负责人**:[Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk) - **下载数据集文件总大小**:7.5 GB - **生成后数据集大小**:7.8 GB - **总磁盘占用空间**:15.3 GB ### 数据集概述 谷歌于2021年7月发布了维基百科图像文本(Wikipedia Image Text,WIT)数据集,详见[https://dl.acm.org/doi/abs/10.1145/3404835.3463257](https://dl.acm.org/doi/abs/10.1145/3404835.3463257),该数据集包含从维基百科爬取的图像及其配套描述文本。维基媒体(WikiMedia)于2021年9月推出了WIT-Base数据集,详见[https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/),它是WIT的修改版本:移除了带有空“参考描述”的图像、移除了人脸占比超过图像总面积10%的图像,以及标记为待删除的不当图像。本数据集即为WIT-Base的丹麦语分支,包含约16万张配有丹麦语描述文本的图像。我们遵照WIT-Base的同源许可协议,将本数据集以[CC BY-SA 4.0许可协议](https://creativecommons.org/licenses/by-sa/4.0/)开源发布。 ### 支持任务与官方排行榜 本数据集旨在用于训练图像标题生成、零样本图像分类以及文本-图像检索的机器学习模型。目前暂无对应的官方排行榜。 ### 支持语言 本数据集仅支持丹麦语(`da`)。 ## 数据集结构 ### 数据示例 - **下载数据集文件总大小**:7.5 GB - **生成后数据集大小**:7.8 GB - **总磁盘占用空间**:15.3 GB 以下为`train`划分下的一条示例数据: json { "image": { "bytes": b"xffxd8xffxe0x00x10JFIF...", "path": null }, "image_url": "https://upload.wikimedia.org/wikipedia/commons/4/45/Bispen_-_inside.jpg", "embedding": [2.8568285, 2.9562542, 0.33794892, 8.753725, ...], "metadata_url": "http://commons.wikimedia.org/wiki/File:Bispen_-_inside.jpg", "original_height": 3161, "original_width": 2316, "mime_type": "image/jpeg", "caption_attribution_description": "Kulturhuset Bispen set indefra. Biblioteket er til venstre", "page_url": "https://da.wikipedia.org/wiki/Bispen", "attribution_passes_lang_id": true, "caption_alt_text_description": null, "caption_reference_description": "Bispen set indefra fra 1. sal, hvor ....", "caption_title_and_reference_description": "Bispen [SEP] Bispen set indefra ...", "context_page_description": "Bispen er navnet på det offentlige kulturhus i ...", "context_section_description": "Bispen er navnet på det offentlige kulturhus i ...", "hierarchical_section_title": "Bispen", "is_main_image": true, "page_changed_recently": true, "page_title": "Bispen", "section_title": null } ### 数据字段说明 所有数据划分下的字段均保持一致: - `image`:字典类型特征 - `image_url`:字符串类型特征 - `embedding`:列表类型特征 - `metadata_url`:字符串类型特征 - `original_height`:整数或空值(NaN)类型特征 - `original_width`:整数或空值(NaN)类型特征 - `mime_type`:字符串或空值类型特征 - `caption_attribution_description`:字符串或空值类型特征 - `page_url`:字符串类型特征 - `attribution_passes_lang_id`:布尔值或空值类型特征 - `caption_alt_text_description`:字符串或空值类型特征 - `caption_reference_description`:字符串或空值类型特征 - `caption_title_and_reference_description`:字符串或空值类型特征 - `context_page_description`:字符串或空值类型特征 - `context_section_description`:字符串或空值类型特征 - `hierarchical_section_title`:字符串类型特征 - `is_main_image`:布尔值或空值类型特征 - `page_changed_recently`:布尔值或空值类型特征 - `page_title`:字符串类型特征 - `section_title`:字符串或空值类型特征 ### 数据划分 WIT-Base数据集中约2.60%的样本来自丹麦维基百科。我们将最终得到的168,740条样本划分为训练集、验证集与测试集,各集合规模如下: | 数据划分 | 样本数量 | |---------|--------:| | 训练集 | 167,460 | | 验证集 | 256 | | 测试集 | 1,024 | ## 数据集构建说明 ### 筛选初衷 提取WIT-Base的丹麦语分支操作较为繁琐,且该数据集本身占用333GB磁盘空间,因此构建丹麦WIT数据集的初衷仅为简化该语言分支的使用流程。 ### 源数据来源 本数据集的原始数据来源于维基媒体的[WIT-Base](https://huggingface.co/datasets/wikimedia/wit_base)数据集,而WIT-Base则源自谷歌的[WIT](https://huggingface.co/datasets/google/wit)数据集。 ## 附加信息 ### 数据集维护者 来自[亚历山大研究所(The Alexandra Institute)](https://alexandra.dk/)的[Dan Saattrup Nielsen](https://saattrupdan.github.io/)负责本数据集的整理与维护。 ### 许可协议信息 本数据集采用[CC BY-SA 4.0许可协议](https://creativecommons.org/licenses/by-sa/4.0/)进行授权。
提供机构:
maas
创建时间:
2025-03-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作