encord-team/E-MM1-100M

Name: encord-team/E-MM1-100M
Creator: encord-team
Published: 2025-11-19 19:59:06
License: 暂无描述

Hugging Face2025-11-19 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/encord-team/E-MM1-100M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by language: - en size_categories: - 100M<n<1B configs: - config_name: default data_files: - split: train path: "data/nn_*.parquet" --- # Dataset Card for E-MM1-100M ![emm1](https://cdn-uploads.huggingface.co/production/uploads/62a0da842e30aaf94ebaaa12/wTt7DkIhOr-VvqW9w54VT.png) <div style="display: flex; justify-content: space-between; gap: 5px;">   <a href="https://huggingface.co/encord-team/ebind-full" target="_blank" rel="noreferrer" style="text-decoration:none; "> <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face Models" style="vertical-align:middle;"> </a> <a href="https://huggingface.co/datasets/encord-team/E-MM1-100M" target="_blank" rel="noreferrer" style="text-decoration:none; "> <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue" alt="Hugging Face Datasets" style="vertical-align:middle;"> </a> <a href="https://e-mm1.github.io" target="_blank" rel="noreferrer" style="text-decoration:none; "> <img src="https://img.shields.io/badge/Project%20Page-blue?logo=github" alt="Blog" style="vertical-align:middle;"> </a> <div style="flex:1"></div> <a href="https://encord.com/blog/how-we-built-multimodal-dataset-emm1/" target="_blank" rel="noreferrer" style="text-decoration:none; "> <img src="https://img.shields.io/badge/%F0%9F%93%96-Blog-blue" alt="Blog" style="vertical-align:middle;"> </a> <a href="https://twitter.com/encord_team" target="_blank" rel="noreferrer" style="text-decoration:none; "> <img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&style=social" style="vertical-align: middle"> </a> <span style="text-decoration:none;"><img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue" style="vertical-align: middle;"></span> </div> </div> ## Dataset Summary **E-MM1-100M** is a large-scale multimodal dataset of 100M+ data groups, pairing data from five modalities: audio, image, video, point cloud, and text. Each pair is a 5-tuple of a caption and an item from one of the four other modalities. The data and captions are sourced from [public data sources](https://github.com/encord-team/E-MM1/blob/main/SOURCE_DATASETS.md). The dataset was created to advance work on joint embeddings for multimodal applications like cross-modal retrieval. <br> To visually explore the dataset, please visit our [E-MM1 Explorer](https://data.encord.com/e-mm1/explorer). ## Dataset Structure The data is structured in nearest-neighbor (NN) shards. The `data` directory contains the `i`th nearest neighbors of each caption in the data in `./data/nn_{i:02d}.parquet`. For example, the third nearest neighbor to all captions is in `./data/nn_03.parquet`. You can use this fact to, e.g., only load the nearest neighbor to each caption like this: ```python from datasets import load_dataset # Full dataset (it's 15GB so perhaps best to stream) ds = load_dataset("encord-team/E-MM1-100M", split="train", stream=True) # Load just the nearest neighbor ds = load_dataset("encord-team/E-MM1-100M", split="train", data_files=["data/nn_01.parquet"]) ``` > [!NOTE] The raw underlying data for this dataset can be downloaded by following the [instructions](https://github.com/encord-team/E-MM1/blob/main/DOWNLOAD.md) on our GitHub page. ## Dataset Splits We provide two data splits: - (this dataset) **E-MM1-100M (automated)**: very large, built via nearest-neighbour retrieval for pre-training applications. - **[E-MM1-1M](https://huggingface.co/datasets/encord-team/E-MM1-1M/) (annotated)**: validated with high quality, human-verified annotations for post-training applications. The **E-MM1-100M** split contains the large-scale dataset built with nearest-neighbour retrieval. For each of ~6.7M captions, we retrieved the top-16 nearest neighbours across all modalities, resulting in roughly 1B multimodal connections or 100M groups. ## Data Schema ```json { "caption": "String: Caption of the audio, image, points, text, or video.", "dataset_license_audio": "(String) The license of the dataset that the audio belongs to. !! This is not the license of the audio file !!", "dataset_license_image": "(String) The license of the dataset that the image belongs to. !! This is not the license of the image file !!", "dataset_license_points": "(String) The license of the dataset that the points belongs to.", "dataset_license_text": "(String) The license of the dataset that the text belongs to.", "dataset_license_video": "(String) The license of the dataset that the video belongs to. !! This is not the license of the video file !!", "encord_audio_id": "(Int64) Unique ID of the audio file (and segment).", "encord_image_id": "(Int64) Unique ID of the image file.", "encord_points_id": "(Int64) Unique ID of the points file.", "encord_text_id": "(Int64) Unique ID of the text file.", "encord_video_id": "(Int64) Unique ID of the video file (and segment).", "end_time_audio": "(Int64) End time of the audio segment in seconds.", "end_time_video": "(Int64) End time of the video segment in seconds.", "file_id_audio": "(String) Audio identifier from the source dataset.", "file_id_image": "(String) Image identifier from the source dataset.", "file_id_points": "(String) 3D object identifier from the source dataset.", "file_id_video": "(String) Video identifier from the source dataset.", "file_name_audio": "(String) Filename of the audio file if downloaded with download script.", "file_name_image": "(String) Filename of the image file if downloaded with download script.", "file_name_points": "(String) Filename of the points file if downloaded with download script.", "file_name_video": "(String) Filename of the video file if downloaded with download script.", "nn_index": "(Int64) all items in the row arethe `nn_index` nearest neighbors to the caption.", "save_folder_audio": "(String) Folder name of the audio file if downloaded with download script.", "save_folder_image": "(String) Folder name of the image file if downloaded with download script.", "save_folder_points": "(String) Folder name of the points file if downloaded with download script.", "save_folder_video": "(String) Folder name of the video file if downloaded with download script.", "source_dataset_audio": "(String) Source dataset of the audio file.", "source_dataset_image": "(String) Source dataset of the image file.", "source_dataset_points": "(String) Source dataset of the points file.", "source_dataset_text": "(String) Source dataset of the text file.", "source_dataset_video": "(String) Source dataset of the video file.", "start_time_audio": "(Int64) Start time of the audio segment in seconds.", "start_time_video": "(Int64) Start time of the video segment in seconds.", "youtube_id_audio": "(String) Youtube ID of the audio file.", "youtube_id_video": "(String) Youtube ID of the video file." } ``` ## Additional Information ### Usage Documentation Please find more detailed usage instructions on [Github](https://github.com/encord-team/E-MM1). We have also created an [interactive demonstration](data.encord.com) in Encord for visual exploration of a subset of the dataset, where you can find dataset tutorials as well. ### Contact For any questions about the creation of and applications for the dataset, please contact our team at [ml@encord.com](mailto:ml@encord.com). ### Citation Information ``` @misc{broadbent2025ebindpracticalapproachspace, title={{EBind}: a practical approach to space binding}, author={Jim Broadbent and Felix Cohen and Frederik Hvilshøj and Eric Landau and Eren Sasoglu}, year={2025}, eprint={2511.14229}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2511.14229}, } ```

提供机构：

encord-team

5,000+

优质数据集

54 个

任务类型

进入经典数据集