NotHotTryHard/wikipedia-en-harrier-270m-emb
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/NotHotTryHard/wikipedia-en-harrier-270m-emb
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: chunk_id
dtype: int64
- name: article_title
dtype: string
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 384
license: cc-by-sa-4.0
task_categories:
- feature-extraction
- text-retrieval
tags:
- wikipedia
- embeddings
- dense-retrieval
- fact-checking
- faiss
- harrier
language:
- en
size_categories:
- 10M<n<100M
pretty_name: Wikipedia EN Chunks + Harrier 270M Embeddings
---
# Wikipedia EN Chunks + Harrier 270M Embeddings
Pre-computed dense embeddings for **23.7M English Wikipedia chunks** using [microsoft/harrier-oss-v1-270m](https://huggingface.co/microsoft/harrier-oss-v1-270m) (384-dim).
## Related Datasets
| Dataset | Description |
|---|---|
| [NotHotTryHard/wikipedia-en-harrier-270m-emb](https://huggingface.co/datasets/NotHotTryHard/wikipedia-en-harrier-270m-emb) | Same chunks, embedded with the larger **Harrier 270m** model |
| [NotHotTryHard/wikipedia-en-harrier-0.6b-emb](https://huggingface.co/datasets/NotHotTryHard/wikipedia-en-harrier-0.6b-emb) | Same chunks, embedded with the larger **Harrier 0.6B** model |
## Dataset Details
### Source
- **Wikipedia dump**: [wikimedia/wikipedia 20231101.en](https://huggingface.co/datasets/wikimedia/wikipedia) (6.4M articles)
- **Chunking**: 200-word sliding window, 50-word overlap, min 50 characters
- **Total chunks**: ~23,758,035
### Embeddings
- **Model**: `microsoft/harrier-oss-v1-270m`
- **Dimension**: 384
- **Normalization**: L2-normalized
- **Precision**: float32
### Schema
| Column | Type | Description |
|---|---|---|
| `chunk_id` | int64 | Unique chunk identifier (sequential) |
| `article_title` | string | Wikipedia article title |
| `text` | string | Chunk text (~200 words) |
| `embedding` | list[float32] x 384 | L2-normalized dense vector |
### Storage
- **Format**: Parquet shards with ZSTD compression
- **Naming**: `data/train-XXXXX-of-NNNNN.parquet`
## Usage
```python
from datasets import load_dataset
ds = load_dataset("NotHotTryHard/wiki-en-harrier-270m", split="train")
print(ds[0])
# {'chunk_id': 0, 'article_title': 'Anarchism', 'text': '...', 'embedding': [0.012, ...]}
```
### Building a FAISS Index
```python
import numpy as np
import faiss
embeddings = np.array(ds["embedding"], dtype=np.float32)
index = faiss.IndexFlatIP(384)
index.add(embeddings)
```
## Pipeline
```
Wikipedia 20231101.en (6.4M articles)
-> chunk.py (200w window, 50w overlap)
-> 23.7M chunks in SQLite
-> embed.py (harrier-oss-v1-270m, parallel GPU shards)
-> export_parquet.py -> this dataset
```
## License
Wikipedia content is under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
提供机构:
NotHotTryHard



