encord-team/E-MM1-100M
收藏Hugging Face2025-11-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/encord-team/E-MM1-100M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
language:
- en
size_categories:
- 100M<n<1B
configs:
- config_name: default
data_files:
- split: train
path: "data/nn_*.parquet"
---
# Dataset Card for E-MM1-100M

<div style="display: flex; justify-content: space-between; gap: 5px;">
<!-- <a href="todohttps://arxiv.org/abs/YYMM.NNNNN" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/arXiv-YYMM.NNNNN-b31b1b.svg?logo=arxiv" alt="arXiv Paper" style="vertical-align:middle;">
</a> -->
<!-- <a href="https://colab.research.google.com/github/encord-team/ebind/blob/main/misc/demo.ipynb" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="vertical-align:middle;">
</a> -->
<a href="https://huggingface.co/encord-team/ebind-full" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face Models" style="vertical-align:middle;">
</a>
<a href="https://huggingface.co/datasets/encord-team/E-MM1-100M" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue" alt="Hugging Face Datasets" style="vertical-align:middle;">
</a>
<a href="https://e-mm1.github.io" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/Project%20Page-blue?logo=github" alt="Blog" style="vertical-align:middle;">
</a>
<div style="flex:1"></div>
<a href="https://encord.com/blog/how-we-built-multimodal-dataset-emm1/" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/%F0%9F%93%96-Blog-blue" alt="Blog" style="vertical-align:middle;">
</a>
<a href="https://twitter.com/encord_team" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&style=social" style="vertical-align: middle">
</a>
<span style="text-decoration:none;"><img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue" style="vertical-align: middle;"></span>
</div>
</div>
## Dataset Summary
**E-MM1-100M** is a large-scale multimodal dataset of 100M+ data groups, pairing data from five modalities: audio, image, video, point cloud, and text.
Each pair is a 5-tuple of a caption and an item from one of the four other modalities.
The data and captions are sourced from [public data sources](https://github.com/encord-team/E-MM1/blob/main/SOURCE_DATASETS.md).
The dataset was created to advance work on joint embeddings for multimodal applications like cross-modal retrieval. <br>
To visually explore the dataset, please visit our [E-MM1 Explorer](https://data.encord.com/e-mm1/explorer).
## Dataset Structure
The data is structured in nearest-neighbor (NN) shards. The `data` directory contains the `i`th nearest neighbors of each caption in the data in `./data/nn_{i:02d}.parquet`.
For example, the third nearest neighbor to all captions is in `./data/nn_03.parquet`.
You can use this fact to, e.g., only load the nearest neighbor to each caption like this:
```python
from datasets import load_dataset
# Full dataset (it's 15GB so perhaps best to stream)
ds = load_dataset("encord-team/E-MM1-100M", split="train", stream=True)
# Load just the nearest neighbor
ds = load_dataset("encord-team/E-MM1-100M", split="train", data_files=["data/nn_01.parquet"])
```
> [!NOTE] The raw underlying data for this dataset can be downloaded by following the [instructions](https://github.com/encord-team/E-MM1/blob/main/DOWNLOAD.md) on our GitHub page.
## Dataset Splits
We provide two data splits:
- (this dataset) **E-MM1-100M (automated)**: very large, built via nearest-neighbour retrieval for pre-training applications.
- **[E-MM1-1M](https://huggingface.co/datasets/encord-team/E-MM1-1M/) (annotated)**: validated with high quality, human-verified annotations for post-training applications.
The **E-MM1-100M** split contains the large-scale dataset built with nearest-neighbour retrieval.
For each of ~6.7M captions, we retrieved the top-16 nearest neighbours across all modalities, resulting in roughly 1B multimodal connections or 100M groups.
## Data Schema
```json
{
"caption": "String: Caption of the audio, image, points, text, or video.",
"dataset_license_audio": "(String) The license of the dataset that the audio belongs to. !! This is not the license of the audio file !!",
"dataset_license_image": "(String) The license of the dataset that the image belongs to. !! This is not the license of the image file !!",
"dataset_license_points": "(String) The license of the dataset that the points belongs to.",
"dataset_license_text": "(String) The license of the dataset that the text belongs to.",
"dataset_license_video": "(String) The license of the dataset that the video belongs to. !! This is not the license of the video file !!",
"encord_audio_id": "(Int64) Unique ID of the audio file (and segment).",
"encord_image_id": "(Int64) Unique ID of the image file.",
"encord_points_id": "(Int64) Unique ID of the points file.",
"encord_text_id": "(Int64) Unique ID of the text file.",
"encord_video_id": "(Int64) Unique ID of the video file (and segment).",
"end_time_audio": "(Int64) End time of the audio segment in seconds.",
"end_time_video": "(Int64) End time of the video segment in seconds.",
"file_id_audio": "(String) Audio identifier from the source dataset.",
"file_id_image": "(String) Image identifier from the source dataset.",
"file_id_points": "(String) 3D object identifier from the source dataset.",
"file_id_video": "(String) Video identifier from the source dataset.",
"file_name_audio": "(String) Filename of the audio file if downloaded with download script.",
"file_name_image": "(String) Filename of the image file if downloaded with download script.",
"file_name_points": "(String) Filename of the points file if downloaded with download script.",
"file_name_video": "(String) Filename of the video file if downloaded with download script.",
"nn_index": "(Int64) all items in the row arethe `nn_index` nearest neighbors to the caption.",
"save_folder_audio": "(String) Folder name of the audio file if downloaded with download script.",
"save_folder_image": "(String) Folder name of the image file if downloaded with download script.",
"save_folder_points": "(String) Folder name of the points file if downloaded with download script.",
"save_folder_video": "(String) Folder name of the video file if downloaded with download script.",
"source_dataset_audio": "(String) Source dataset of the audio file.",
"source_dataset_image": "(String) Source dataset of the image file.",
"source_dataset_points": "(String) Source dataset of the points file.",
"source_dataset_text": "(String) Source dataset of the text file.",
"source_dataset_video": "(String) Source dataset of the video file.",
"start_time_audio": "(Int64) Start time of the audio segment in seconds.",
"start_time_video": "(Int64) Start time of the video segment in seconds.",
"youtube_id_audio": "(String) Youtube ID of the audio file.",
"youtube_id_video": "(String) Youtube ID of the video file."
}
```
## Additional Information
### Usage Documentation
Please find more detailed usage instructions on [Github](https://github.com/encord-team/E-MM1).
We have also created an [interactive demonstration](data.encord.com) in Encord for visual exploration of a subset of the dataset, where you can find dataset tutorials as well.
### Contact
For any questions about the creation of and applications for the dataset, please contact our team at [ml@encord.com](mailto:ml@encord.com).
### Citation Information
```
@misc{broadbent2025ebindpracticalapproachspace,
title={{EBind}: a practical approach to space binding},
author={Jim Broadbent and Felix Cohen and Frederik Hvilshøj and Eric Landau and Eren Sasoglu},
year={2025},
eprint={2511.14229},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.14229},
}
```
提供机构:
encord-team



