makneeee/bioasq_large_10m
收藏Hugging Face2026-02-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/makneeee/bioasq_large_10m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- feature-extraction
tags:
- vector-search
- diskann
- nearest-neighbor
- benchmark
- vectordbbench
pretty_name: "BioASQ Large 10M - Sharded DiskANN Indices"
size_categories:
- 1M<n<10M
---
# BioASQ Large 10M - Sharded DiskANN Indices
Pre-built DiskANN indices for the BioASQ Large 10M dataset from VectorDBBench, sharded for distributed vector search.
## Dataset Info
- **Source**: VectorDBBench (BioASQ)
- **Vectors**: 10,000,000
- **Dimensions**: 1024
- **Data type**: float32
- **Queries**: 10,000
- **Distance**: L2
## DiskANN Parameters
- **R** (graph degree): 16, 32, 64
- **L** (build beam width): 100
- **PQ bytes**: 256
## Shard Configurations
- **shard_3**: 3 shards x ~3,333,333 vectors
- **shard_5**: 5 shards x ~2,000,000 vectors
- **shard_7**: 7 shards x ~1,428,571 vectors
- **shard_10**: 10 shards x ~1,000,000 vectors
## Index Variants (per shard directory)
- R=16: `bioasq_large_10m_16_100_256.shard*_disk.index`
- R=32: `bioasq_large_10m_32_100_256.shard*_disk.index`
- R=64: `bioasq_large_10m_64_100_256.shard*_disk.index`
## File Structure
```
fbin/
base.fbin # Base vectors (float32)
queries.fbin # Query vectors (float32)
parquet/
train_*.parquet # Original VectorDBBench parquet
test.parquet # Original queries parquet
diskann/
gt_100.fbin # Ground truth (100-NN)
shard_N/ # N-shard configuration
bioasq_large_10m_base.shardX.fbin # Shard base data
bioasq_large_10m_R_100_256.shardX_disk.index # DiskANN disk index
bioasq_large_10m_R_100_256.shardX_disk.index_512_none.indices # MinIO graph indices
bioasq_large_10m_R_100_256.shardX_disk.index_base_none.vectors # MinIO vector data
bioasq_large_10m_R_100_256.shardX_pq_pivots.bin # PQ pivot data
bioasq_large_10m_R_100_256.shardX_pq_compressed.bin # PQ compressed data
bioasq_large_10m_R_100_256.shardX_sample_data.bin # Sample data
bioasq_large_10m_R_100_256.shardX_sample_ids.bin # Sample IDs
```
Where R is one of 16, 32, 64 and X is the shard index.
### Chunked Files
Files larger than 5 GB are split into chunks for upload:
- `*.part0000`, `*.part0001`, etc.
To reassemble: `cat file.part0000 file.part0001 ... > file`
## Usage
### Download with huggingface_hub
```python
from huggingface_hub import hf_hub_download
# Download a specific shard file
index = hf_hub_download(
repo_id="makneeee/bioasq_large_10m",
filename="diskann/shard_10/bioasq_large_10m_64_100_256.shard0_disk.index",
repo_type="dataset"
)
```
### Download with git-lfs
```bash
git lfs install
git clone https://huggingface.co/datasets/makneeee/bioasq_large_10m
```
## License
Same as source dataset (VectorDBBench).
提供机构:
makneeee



