makneeee/cohere_medium_1m

Name: makneeee/cohere_medium_1m
Creator: makneeee
Published: 2026-02-21 00:39:48
License: 暂无描述

Hugging Face2026-02-21 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/makneeee/cohere_medium_1m

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - feature-extraction tags: - vector-search - diskann - nearest-neighbor - benchmark - vectordbbench pretty_name: "Cohere Medium 1M - Sharded DiskANN Indices" size_categories: - 100K<n<1M --- # Cohere Medium 1M - Sharded DiskANN Indices Pre-built DiskANN indices for the Cohere Medium 1M dataset from VectorDBBench, sharded for distributed vector search. ## Dataset Info - **Source**: VectorDBBench (Cohere) - **Vectors**: 1,000,000 - **Dimensions**: 768 - **Data type**: float32 - **Queries**: 10,000 - **Distance**: L2 ## DiskANN Parameters - **R** (graph degree): 16, 32, 64 - **L** (build beam width): 100 - **PQ bytes**: 192 ## Shard Configurations - **shard_3**: 3 shards x ~333,333 vectors - **shard_5**: 5 shards x ~200,000 vectors - **shard_7**: 7 shards x ~142,857 vectors - **shard_10**: 10 shards x ~100,000 vectors ## Index Variants (per shard directory) - R=16: `cohere_medium_1m_16_100_192.shard*_disk.index` - R=32: `cohere_medium_1m_32_100_192.shard*_disk.index` - R=64: `cohere_medium_1m_64_100_192.shard*_disk.index` ## File Structure ``` fbin/ base.fbin # Base vectors (float32) queries.fbin # Query vectors (float32) parquet/ train_*.parquet # Original VectorDBBench parquet test.parquet # Original queries parquet diskann/ gt_100.fbin # Ground truth (100-NN) shard_N/ # N-shard configuration cohere_medium_1m_base.shardX.fbin # Shard base data cohere_medium_1m_R_100_192.shardX_disk.index # DiskANN disk index cohere_medium_1m_R_100_192.shardX_disk.index_512_none.indices # MinIO graph indices cohere_medium_1m_R_100_192.shardX_disk.index_base_none.vectors # MinIO vector data cohere_medium_1m_R_100_192.shardX_pq_pivots.bin # PQ pivot data cohere_medium_1m_R_100_192.shardX_pq_compressed.bin # PQ compressed data cohere_medium_1m_R_100_192.shardX_sample_data.bin # Sample data cohere_medium_1m_R_100_192.shardX_sample_ids.bin # Sample IDs ``` Where R is one of 16, 32, 64 and X is the shard index. ### Chunked Files Files larger than 5 GB are split into chunks for upload: - `*.part0000`, `*.part0001`, etc. To reassemble: `cat file.part0000 file.part0001 ... > file` ## Usage ### Download with huggingface_hub ```python from huggingface_hub import hf_hub_download # Download a specific shard file index = hf_hub_download( repo_id="makneeee/cohere_medium_1m", filename="diskann/shard_10/cohere_medium_1m_64_100_192.shard0_disk.index", repo_type="dataset" ) ``` ### Download with git-lfs ```bash git lfs install git clone https://huggingface.co/datasets/makneeee/cohere_medium_1m ``` ## License Same as source dataset (VectorDBBench).

提供机构：

makneeee

5,000+

优质数据集

54 个

任务类型

进入经典数据集