five

makneeee/cohere_medium_1m

收藏
Hugging Face2026-02-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/makneeee/cohere_medium_1m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - feature-extraction tags: - vector-search - diskann - nearest-neighbor - benchmark - vectordbbench pretty_name: "Cohere Medium 1M - Sharded DiskANN Indices" size_categories: - 100K<n<1M --- # Cohere Medium 1M - Sharded DiskANN Indices Pre-built DiskANN indices for the Cohere Medium 1M dataset from VectorDBBench, sharded for distributed vector search. ## Dataset Info - **Source**: VectorDBBench (Cohere) - **Vectors**: 1,000,000 - **Dimensions**: 768 - **Data type**: float32 - **Queries**: 10,000 - **Distance**: L2 ## DiskANN Parameters - **R** (graph degree): 16, 32, 64 - **L** (build beam width): 100 - **PQ bytes**: 192 ## Shard Configurations - **shard_3**: 3 shards x ~333,333 vectors - **shard_5**: 5 shards x ~200,000 vectors - **shard_7**: 7 shards x ~142,857 vectors - **shard_10**: 10 shards x ~100,000 vectors ## Index Variants (per shard directory) - R=16: `cohere_medium_1m_16_100_192.shard*_disk.index` - R=32: `cohere_medium_1m_32_100_192.shard*_disk.index` - R=64: `cohere_medium_1m_64_100_192.shard*_disk.index` ## File Structure ``` fbin/ base.fbin # Base vectors (float32) queries.fbin # Query vectors (float32) parquet/ train_*.parquet # Original VectorDBBench parquet test.parquet # Original queries parquet diskann/ gt_100.fbin # Ground truth (100-NN) shard_N/ # N-shard configuration cohere_medium_1m_base.shardX.fbin # Shard base data cohere_medium_1m_R_100_192.shardX_disk.index # DiskANN disk index cohere_medium_1m_R_100_192.shardX_disk.index_512_none.indices # MinIO graph indices cohere_medium_1m_R_100_192.shardX_disk.index_base_none.vectors # MinIO vector data cohere_medium_1m_R_100_192.shardX_pq_pivots.bin # PQ pivot data cohere_medium_1m_R_100_192.shardX_pq_compressed.bin # PQ compressed data cohere_medium_1m_R_100_192.shardX_sample_data.bin # Sample data cohere_medium_1m_R_100_192.shardX_sample_ids.bin # Sample IDs ``` Where R is one of 16, 32, 64 and X is the shard index. ### Chunked Files Files larger than 5 GB are split into chunks for upload: - `*.part0000`, `*.part0001`, etc. To reassemble: `cat file.part0000 file.part0001 ... > file` ## Usage ### Download with huggingface_hub ```python from huggingface_hub import hf_hub_download # Download a specific shard file index = hf_hub_download( repo_id="makneeee/cohere_medium_1m", filename="diskann/shard_10/cohere_medium_1m_64_100_192.shard0_disk.index", repo_type="dataset" ) ``` ### Download with git-lfs ```bash git lfs install git clone https://huggingface.co/datasets/makneeee/cohere_medium_1m ``` ## License Same as source dataset (VectorDBBench).
提供机构:
makneeee
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作