DeepInfra/variable-length-embedding-benchmark

Name: DeepInfra/variable-length-embedding-benchmark
Creator: DeepInfra
Published: 2026-01-15 17:43:18
License: 暂无描述

Hugging Face2026-01-15 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/DeepInfra/variable-length-embedding-benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - feature-extraction - sentence-similarity - text-retrieval dataset_info: features: - name: text dtype: string - name: length_category dtype: string - name: source dtype: string splits: - name: train num_bytes: 373887763 num_examples: 8000 download_size: 210093799 dataset_size: 373887763 configs: - config_name: default data_files: - split: train path: data/train-* tags: - embedding - benchmark - long-context - deepinfra - rag - wikitext language: - en size_categories: - 1K<n<10K --- # Variable Length Embedding Benchmark (VLEB) ## Dataset Summary **VLEB (Variable Length Embedding Benchmark)** is a specialized dataset designed to evaluate the performance, latency, and stability of **embedding models and rerankers** across a wide spectrum of context lengths. Unlike standard datasets that focus on short passages, VLEB provides a balanced distribution of text ranging from standard RAG chunks to maximum-context documents (up to 32k tokens). It is constructed from `wikitext-103-raw-v1` using a **smart-clipping strategy** that preserves semantic integrity without splitting words. This benchmark is essential for: - **Length Generalization:** Testing if models maintain semantic understanding as context grows. - **RAG Profiling:** Measuring encoding latency and memory usage at different bins. - **"Lost-in-the-Middle" Analysis:** Evaluating retrieval degradation in long-context windows. ## Data Structure The dataset consists of **8,000 samples**, strictly balanced across 4 length categories (2,000 samples per bin). Token counts are calculated using the `Qwen/Qwen2.5-7B-Instruct` tokenizer. | Category | Token Range (Qwen) | Typical Use Case | | :--- | :--- | :--- | | **Short** | 512 - 2,048 | Standard RAG chunks, abstracts, news snippets. | | **Medium** | 2,048 - 8,192 | Full articles, technical reports, single-file code. | | **Long** | 8,192 - 16,384 | Multiple papers, book chapters, long legal contracts. | | **Very Long** | 16,384 - 32,000 | Entire books, massive documentation, stress testing context limits. | ## Usage ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("ovuruska/variable-length-embedding-bench") # Filter for specific length requirements short_contexts = dataset.filter(lambda x: x['length_category'] == 'Short') very_long_contexts = dataset.filter(lambda x: x['length_category'] == 'Very Long') print(f"Sample Text ({very_long_contexts[0]['length_category']}):") print(very_long_contexts[0]['text'][:200] + "...") ``` ## Construction Methodology 1. **Source:** The dataset is derived from the `wikitext-103-raw-v1` corpus. 2. **Stream Buffering:** The raw text was processed as a continuous stream rather than isolated lines. 3. **Smart Clipping:** A buffer system accumulated tokens until a target length (randomly selected within bin ranges) was met. The text was then clipped at the exact token boundary and decoded back to string, ensuring **no words are split** and the text remains natural. 4. **Validation:** All samples were re-tokenized to ensure they strictly fall within their assigned bin limits. ## Citation If you use this dataset for benchmarking, please cite: ```bibtex @misc{vleb_2026, author = {DeepInfra Engineering Team}, title = {Variable Length Embedding Benchmark (VLEB)}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{[https://huggingface.co/datasets/DeepInfra/variable-length-embedding-benchmark](https://huggingface.co/datasets/DeepInfra/variable-length-embedding-benchmark)}} } ```

提供机构：

DeepInfra

5,000+

优质数据集

54 个

任务类型

进入经典数据集