test-big-dataset
收藏魔搭社区2025-10-09 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/huggingface/test-big-dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Danish WIT
## Dataset Description
- **Repository:**
- **Point of Contact:** [Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk)
- **Size of downloaded dataset files:** 7.5 GB
- **Size of the generated dataset:** 7.8 GB
- **Total amount of disk used:** 15.3 GB
### Dataset Summary
Google presented the Wikipedia Image Text (WIT) dataset in [July
2021](https://dl.acm.org/doi/abs/10.1145/3404835.3463257), a dataset which contains
scraped images from Wikipedia along with their descriptions. WikiMedia released
WIT-Base in [September
2021](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/),
being a modified version of WIT where they have removed the images with empty
"reference descriptions", as well as removing images where a person's face covers more
than 10% of the image surface, along with inappropriate images that are candidate for
deletion. This dataset is the Danish portion of the WIT-Base dataset, consisting of
roughly 160,000 images with associated Danish descriptions. We release the dataset
under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/), in
accordance with WIT-Base's [identical
license](https://huggingface.co/datasets/wikimedia/wit_base#licensing-information).
### Supported Tasks and Leaderboards
Training machine learning models for caption generation, zero-shot image classification
and text-image search are the intended tasks for this dataset. No leaderboard is active
at this point.
### Languages
The dataset is available in Danish (`da`).
## Dataset Structure
### Data Instances
- **Size of downloaded dataset files:** 7.5 GB
- **Size of the generated dataset:** 7.8 GB
- **Total amount of disk used:** 15.3 GB
An example from the `train` split looks as follows.
```
{
"image": {
"bytes": b"\xff\xd8\xff\xe0\x00\x10JFIF...",
"path": None
},
"image_url": "https://upload.wikimedia.org/wikipedia/commons/4/45/Bispen_-_inside.jpg",
"embedding": [2.8568285, 2.9562542, 0.33794892, 8.753725, ...],
"metadata_url": "http://commons.wikimedia.org/wiki/File:Bispen_-_inside.jpg",
"original_height": 3161,
"original_width": 2316,
"mime_type": "image/jpeg",
"caption_attribution_description": "Kulturhuset Bispen set indefra. Biblioteket er til venstre",
"page_url": "https://da.wikipedia.org/wiki/Bispen",
"attribution_passes_lang_id": True,
"caption_alt_text_description": None,
"caption_reference_description": "Bispen set indefra fra 1. sal, hvor ....",
"caption_title_and_reference_description": "Bispen [SEP] Bispen set indefra ...",
"context_page_description": "Bispen er navnet på det offentlige kulturhus i ...",
"context_section_description": "Bispen er navnet på det offentlige kulturhus i ...",
"hierarchical_section_title": "Bispen",
"is_main_image": True,
"page_changed_recently": True,
"page_title": "Bispen",
"section_title": None
}
```
### Data Fields
The data fields are the same among all splits.
- `image`: a `dict` feature.
- `image_url`: a `str` feature.
- `embedding`: a `list` feature.
- `metadata_url`: a `str` feature.
- `original_height`: an `int` or `NaN` feature.
- `original_width`: an `int` or `NaN` feature.
- `mime_type`: a `str` or `None` feature.
- `caption_attribution_description`: a `str` or `None` feature.
- `page_url`: a `str` feature.
- `attribution_passes_lang_id`: a `bool` or `None` feature.
- `caption_alt_text_description`: a `str` or `None` feature.
- `caption_reference_description`: a `str` or `None` feature.
- `caption_title_and_reference_description`: a `str` or `None` feature.
- `context_page_description`: a `str` or `None` feature.
- `context_section_description`: a `str` or `None` feature.
- `hierarchical_section_title`: a `str` feature.
- `is_main_image`: a `bool` or `None` feature.
- `page_changed_recently`: a `bool` or `None` feature.
- `page_title`: a `str` feature.
- `section_title`: a `str` or `None` feature.
### Data Splits
Roughly 2.60% of the WIT-Base dataset comes from the Danish Wikipedia. We have split
the resulting 168,740 samples into a training set, validation set and testing set of
the following sizes:
| split | samples |
|---------|--------:|
| train | 167,460 |
| val | 256 |
| test | 1,024 |
## Dataset Creation
### Curation Rationale
It is quite cumbersome to extract the Danish portion of the WIT-Base dataset,
especially as the dataset takes up 333 GB of disk space, so the curation of Danish-WIT
is purely to make it easier to work with the Danish portion of it.
### Source Data
The original data was collected from WikiMedia's
[WIT-Base](https://huggingface.co/datasets/wikimedia/wit_base) dataset, which in turn
comes from Google's [WIT](https://huggingface.co/datasets/google/wit) dataset.
## Additional Information
### Dataset Curators
[Dan Saattrup Nielsen](https://saattrupdan.github.io/) from the [The Alexandra
Institute](https://alexandra.dk/) curated this dataset.
### Licensing Information
The dataset is licensed under the [CC BY-SA 4.0
license](https://creativecommons.org/licenses/by-sa/4.0/).
# 丹麦WIT数据集卡片
## 数据集描述
- **仓库地址**:
- **联系负责人**:[Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk)
- **下载数据集文件总大小**:7.5 GB
- **生成后数据集大小**:7.8 GB
- **总磁盘占用空间**:15.3 GB
### 数据集概述
谷歌于2021年7月发布了维基百科图像文本(Wikipedia Image Text,WIT)数据集,详见[https://dl.acm.org/doi/abs/10.1145/3404835.3463257](https://dl.acm.org/doi/abs/10.1145/3404835.3463257),该数据集包含从维基百科爬取的图像及其配套描述文本。维基媒体(WikiMedia)于2021年9月推出了WIT-Base数据集,详见[https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/),它是WIT的修改版本:移除了带有空“参考描述”的图像、移除了人脸占比超过图像总面积10%的图像,以及标记为待删除的不当图像。本数据集即为WIT-Base的丹麦语分支,包含约16万张配有丹麦语描述文本的图像。我们遵照WIT-Base的同源许可协议,将本数据集以[CC BY-SA 4.0许可协议](https://creativecommons.org/licenses/by-sa/4.0/)开源发布。
### 支持任务与官方排行榜
本数据集旨在用于训练图像标题生成、零样本图像分类以及文本-图像检索的机器学习模型。目前暂无对应的官方排行榜。
### 支持语言
本数据集仅支持丹麦语(`da`)。
## 数据集结构
### 数据示例
- **下载数据集文件总大小**:7.5 GB
- **生成后数据集大小**:7.8 GB
- **总磁盘占用空间**:15.3 GB
以下为`train`划分下的一条示例数据:
json
{
"image": {
"bytes": b"xffxd8xffxe0x00x10JFIF...",
"path": null
},
"image_url": "https://upload.wikimedia.org/wikipedia/commons/4/45/Bispen_-_inside.jpg",
"embedding": [2.8568285, 2.9562542, 0.33794892, 8.753725, ...],
"metadata_url": "http://commons.wikimedia.org/wiki/File:Bispen_-_inside.jpg",
"original_height": 3161,
"original_width": 2316,
"mime_type": "image/jpeg",
"caption_attribution_description": "Kulturhuset Bispen set indefra. Biblioteket er til venstre",
"page_url": "https://da.wikipedia.org/wiki/Bispen",
"attribution_passes_lang_id": true,
"caption_alt_text_description": null,
"caption_reference_description": "Bispen set indefra fra 1. sal, hvor ....",
"caption_title_and_reference_description": "Bispen [SEP] Bispen set indefra ...",
"context_page_description": "Bispen er navnet på det offentlige kulturhus i ...",
"context_section_description": "Bispen er navnet på det offentlige kulturhus i ...",
"hierarchical_section_title": "Bispen",
"is_main_image": true,
"page_changed_recently": true,
"page_title": "Bispen",
"section_title": null
}
### 数据字段说明
所有数据划分下的字段均保持一致:
- `image`:字典类型特征
- `image_url`:字符串类型特征
- `embedding`:列表类型特征
- `metadata_url`:字符串类型特征
- `original_height`:整数或空值(NaN)类型特征
- `original_width`:整数或空值(NaN)类型特征
- `mime_type`:字符串或空值类型特征
- `caption_attribution_description`:字符串或空值类型特征
- `page_url`:字符串类型特征
- `attribution_passes_lang_id`:布尔值或空值类型特征
- `caption_alt_text_description`:字符串或空值类型特征
- `caption_reference_description`:字符串或空值类型特征
- `caption_title_and_reference_description`:字符串或空值类型特征
- `context_page_description`:字符串或空值类型特征
- `context_section_description`:字符串或空值类型特征
- `hierarchical_section_title`:字符串类型特征
- `is_main_image`:布尔值或空值类型特征
- `page_changed_recently`:布尔值或空值类型特征
- `page_title`:字符串类型特征
- `section_title`:字符串或空值类型特征
### 数据划分
WIT-Base数据集中约2.60%的样本来自丹麦维基百科。我们将最终得到的168,740条样本划分为训练集、验证集与测试集,各集合规模如下:
| 数据划分 | 样本数量 |
|---------|--------:|
| 训练集 | 167,460 |
| 验证集 | 256 |
| 测试集 | 1,024 |
## 数据集构建说明
### 筛选初衷
提取WIT-Base的丹麦语分支操作较为繁琐,且该数据集本身占用333GB磁盘空间,因此构建丹麦WIT数据集的初衷仅为简化该语言分支的使用流程。
### 源数据来源
本数据集的原始数据来源于维基媒体的[WIT-Base](https://huggingface.co/datasets/wikimedia/wit_base)数据集,而WIT-Base则源自谷歌的[WIT](https://huggingface.co/datasets/google/wit)数据集。
## 附加信息
### 数据集维护者
来自[亚历山大研究所(The Alexandra Institute)](https://alexandra.dk/)的[Dan Saattrup Nielsen](https://saattrupdan.github.io/)负责本数据集的整理与维护。
### 许可协议信息
本数据集采用[CC BY-SA 4.0许可协议](https://creativecommons.org/licenses/by-sa/4.0/)进行授权。
提供机构:
maas
创建时间:
2025-03-12



