lance-format/openvid-lance

Name: lance-format/openvid-lance
Creator: lance-format
Published: 2026-01-31 10:58:16
License: 暂无描述

Hugging Face2026-01-31 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/lance-format/openvid-lance

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-to-video - video-classification language: - en tags: - text-to-video - video-search pretty_name: openvid-lance size_categories: - 100K<n<1M --- ![](https://huggingface.co/datasets/nkp37/OpenVid-1M/resolve/main/OpenVid-1M.png) # OpenVid Dataset (Lance Format) Lance format version of the [OpenVid dataset](https://huggingface.co/datasets/nkp37/OpenVid-1M) with **937,957 high-quality videos** stored with inline video blobs, embeddings, and rich metadata. ## Why Lance? Lance is an open-source format designed for multimodal AI data, offering significant advantages over traditional formats for modern AI workloads. - **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. - **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. - **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. - **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. ## Key Features The OpenVid dataset is stored in Lance format with inline video blobs, video embeddings, and rich metadata. - **Videos stored inline as blobs**: No external files to manage - **Efficient column access**: Load metadata without touching video data - **Prebuilt indices available**: IVF_PQ index for similarity search, FTS index on captions - **Fast random access**: Read any video instantly by index - **HuggingFace integration**: Load directly from the Hub ## Quick Start ### Load with `datasets.load_dataset` ```python import datasets hf_ds = datasets.load_dataset( "lance-format/openvid-lance", split="train", streaming=True, ) # Take first three rows and print captions for row in hf_ds.take(3): print(row["caption"]) ``` ### Load with Lance Use Lance for ANN search, retrieving specific blob bytes or advanced indexing, while still pointing at the dataset on the Hub: ```python import lance lance_ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance") blob_file = lance_ds.take_blobs("video_blob", ids=[0])[0] video_bytes = blob_file.read() ``` ### Load with LanceDB These tables can also be consumed by [LanceDB](https://docs.lancedb.com/), the multimodal lakehouse for AI (built on top of Lance). LanceDB provides several convenience APIs for search, index creation and data updates on top of the Lance format. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") tbl = db.open_table("train") print(f"LanceDB table opened with {len(tbl)} videos") ``` ## Blob API Lance stores videos as **inline blobs** - binary data embedded directly in the dataset. This provides: - **Single source of truth** - Videos and metadata together in one dataset - **Lazy loading** - Videos only loaded when you explicitly request them - **Efficient storage** - Optimized encoding for large binary data ```python import lance ds = lance.dataset("hf://datasets/lance-format/openvid-lance") # 1. Browse metadata without loading video data metadata = ds.scanner( columns=["caption", "aesthetic_score"], # No video_blob column! filter="aesthetic_score >= 4.5", limit=10 ).to_table().to_pylist() # 2. User selects video to watch selected_index = 3 # 3. Load only that video blob blob_file = ds.take_blobs("video_blob", ids=[selected_index])[0] video_bytes = blob_file.read() # 4. Save to disk with open("video.mp4", "wb") as f: f.write(video_bytes) ``` > **⚠️ HuggingFace Streaming Note** > > When streaming from HuggingFace (as shown above), some operations use minimal parameters to avoid rate limits: > - `nprobes=1` for vector search (lowest value) > - Column selection to reduce I/O > > **You may still hit rate limits on HuggingFace's free tier.** For best performance and to avoid rate limits, **download the dataset locally**: > > ```bash > # Download once > huggingface-cli download lance-format/openvid-lance --repo-type dataset --local-dir ./openvid > > # Then load locally > ds = lance.dataset("./openvid") > ``` > > Streaming is recommended only for quick exploration and testing. ## Usage Examples ### 1. Browse Metadata quickly (fast, no video loading) ```python # Load only metadata without heavy video blobs scanner = ds.scanner( columns=["caption", "aesthetic_score", "motion_score"], limit=10 ) videos = scanner.to_table().to_pylist() for video in videos: print(f"{video['caption']} - Quality: {video['aesthetic_score']:.2f}") ``` ### 2. Export videos from blobs Retrieve specific video files if you want to work with subsets of the data. This is done by exporting them to files on your local machine. ```python # Load specific videos by index indices = [0, 100, 500] blob_files = ds.take_blobs("video_blob", ids=indices) # Save to disk for i, blob_file in enumerate(blob_files): with open(f"video_{i}.mp4", "wb") as f: f.write(blob_file.read()) ``` ### 3. Open inline videos with PyAV and run seeks directly on the blob file Using seeks, you can open a specific set of frames within a blob. The example below shows this. ```python import av selected_index = 123 blob_file = ds.take_blobs("video_blob", ids=[selected_index])[0] with av.open(blob_file) as container: stream = container.streams.video[0] for seconds in (0.0, 1.0, 2.5): target_pts = int(seconds / stream.time_base) container.seek(target_pts, stream=stream) frame = None for candidate in container.decode(stream): if candidate.time is None: continue frame = candidate if frame.time >= seconds: break print( f"Seek {seconds:.1f}s -> {frame.width}x{frame.height} " f"(pts={frame.pts}, time={frame.time:.2f}s)" ) ``` ### 4. Inspecting Existing Indices You can inspect the prebuilt indices on the dataset: ```python import lance # Open the dataset dataset = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance") # List all indices indices = dataset.list_indices() print(indices) ``` ### 5. Create New Index While this dataset comes with pre-built indices, you can also create your own custom indices if needed. The example below creates a vector index on the `embedding` column. ```python # ds is a local Lance dataset ds.create_index( "embedding", index_type="IVF_PQ", num_partitions=256, num_sub_vectors=96, replace=True, ) ``` ### 6. Vector Similarity Search ```python import pyarrow as pa # Find similar videos ref_video = ds.take([0], columns=["embedding"]).to_pylist()[0] query_vector = pa.array([ref_video['embedding']], type=pa.list_(pa.float32(), 1024)) results = ds.scanner( nearest={ "column": "embedding", "q": query_vector[0], "k": 5, "nprobes": 1, "refine_factor": 1 } ).to_table().to_pylist() for video in results[1:]: # Skip first (query itself) print(video['caption']) ``` ### 7. Full-Text Search ```python # Search captions using FTS index results = ds.scanner( full_text_query="sunset beach", columns=["caption", "aesthetic_score"], limit=10, fast_search=True ).to_table().to_pylist() for video in results: print(f"{video['caption']} - {video['aesthetic_score']:.2f}") ``` ## Dataset Evolution Lance supports flexible schema and data evolution ([docs](https://lance.org/guide/data_evolution/?h=evol)). You can add/drop columns, backfill with SQL or Python, rename fields, or change data types without rewriting the whole dataset. In practice this lets you: - Introduce fresh metadata (moderation labels, embeddings, quality scores) as new signals become available. - Add new columns to existing datasets without re-exporting terabytes of video. - Adjust column names or shrink storage (e.g., cast embeddings to float16) while keeping previous snapshots queryable for reproducibility. ```python import lance import pyarrow as pa import numpy as np base = pa.table({"id": pa.array([1, 2, 3])}) dataset = lance.write_dataset(base, "openvid_evolution", mode="overwrite") # 1. Grow the schema instantly (metadata-only) dataset.add_columns(pa.field("quality_bucket", pa.string())) # 2. Backfill with SQL expressions or constants dataset.add_columns({"status": "'active'"}) # 3. Generate rich columns via Python batch UDFs @lance.batch_udf() def random_embedding(batch): arr = np.random.rand(batch.num_rows, 128).astype("float32") return pa.RecordBatch.from_arrays( [pa.FixedSizeListArray.from_arrays(arr.ravel(), 128)], names=["embedding"], ) dataset.add_columns(random_embedding) # 4. Bring in offline annotations with merge labels = pa.table({ "id": pa.array([1, 2, 3]), "label": pa.array(["horse", "rabbit", "cat"]), }) dataset.merge(labels, "id") # 5. Rename or cast columns as needs change dataset.alter_columns({"path": "quality_bucket", "name": "quality_tier"}) dataset.alter_columns({"path": "embedding", "data_type": pa.list_(pa.float16(), 128)}) ``` These operations are automatically versioned, so prior experiments can still point to earlier versions while OpenVid keeps evolving. ## LanceDB LanceDB users can follow the following examples to run search queries on the dataset. ### LanceDB Vector Similarity Search ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") tbl = db.open_table("train") # Get a video to use as a query ref_video = tbl.limit(1).select(["embedding", "caption"]).to_pandas().to_dict('records')[0] query_embedding = ref_video["embedding"] results = tbl.search(query_embedding, vector_column_name="embedding") \ .metric("L2") \ .nprobes(1) \ .limit(5) \ .to_list() for video in results[1:]: # Skip first (query itself) print(f"{video['caption'][:60]}...") ``` ### LanceDB Full-Text Search ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") tbl = db.open_table("train") results = tbl.search("sunset beach") \ .select(["caption", "aesthetic_score"]) \ .limit(10) \ .to_list() for video in results: print(f"{video['caption']} - {video['aesthetic_score']:.2f}") ``` ## Citation @article{nan2024openvid, title={OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation}, author={Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying}, journal={arXiv preprint arXiv:2407.02371}, year={2024} } ## License Please check the original OpenVid dataset license for usage terms.

提供机构：

lance-format

5,000+

优质数据集

54 个

任务类型

进入经典数据集