five

NotHotTryHard/wikipedia-en-harrier-270m-emb

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/NotHotTryHard/wikipedia-en-harrier-270m-emb
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: chunk_id dtype: int64 - name: article_title dtype: string - name: text dtype: string - name: embedding sequence: dtype: float32 length: 384 license: cc-by-sa-4.0 task_categories: - feature-extraction - text-retrieval tags: - wikipedia - embeddings - dense-retrieval - fact-checking - faiss - harrier language: - en size_categories: - 10M<n<100M pretty_name: Wikipedia EN Chunks + Harrier 270M Embeddings --- # Wikipedia EN Chunks + Harrier 270M Embeddings Pre-computed dense embeddings for **23.7M English Wikipedia chunks** using [microsoft/harrier-oss-v1-270m](https://huggingface.co/microsoft/harrier-oss-v1-270m) (384-dim). ## Related Datasets | Dataset | Description | |---|---| | [NotHotTryHard/wikipedia-en-harrier-270m-emb](https://huggingface.co/datasets/NotHotTryHard/wikipedia-en-harrier-270m-emb) | Same chunks, embedded with the larger **Harrier 270m** model | | [NotHotTryHard/wikipedia-en-harrier-0.6b-emb](https://huggingface.co/datasets/NotHotTryHard/wikipedia-en-harrier-0.6b-emb) | Same chunks, embedded with the larger **Harrier 0.6B** model | ## Dataset Details ### Source - **Wikipedia dump**: [wikimedia/wikipedia 20231101.en](https://huggingface.co/datasets/wikimedia/wikipedia) (6.4M articles) - **Chunking**: 200-word sliding window, 50-word overlap, min 50 characters - **Total chunks**: ~23,758,035 ### Embeddings - **Model**: `microsoft/harrier-oss-v1-270m` - **Dimension**: 384 - **Normalization**: L2-normalized - **Precision**: float32 ### Schema | Column | Type | Description | |---|---|---| | `chunk_id` | int64 | Unique chunk identifier (sequential) | | `article_title` | string | Wikipedia article title | | `text` | string | Chunk text (~200 words) | | `embedding` | list[float32] x 384 | L2-normalized dense vector | ### Storage - **Format**: Parquet shards with ZSTD compression - **Naming**: `data/train-XXXXX-of-NNNNN.parquet` ## Usage ```python from datasets import load_dataset ds = load_dataset("NotHotTryHard/wiki-en-harrier-270m", split="train") print(ds[0]) # {'chunk_id': 0, 'article_title': 'Anarchism', 'text': '...', 'embedding': [0.012, ...]} ``` ### Building a FAISS Index ```python import numpy as np import faiss embeddings = np.array(ds["embedding"], dtype=np.float32) index = faiss.IndexFlatIP(384) index.add(embeddings) ``` ## Pipeline ``` Wikipedia 20231101.en (6.4M articles) -> chunk.py (200w window, 50w overlap) -> 23.7M chunks in SQLite -> embed.py (harrier-oss-v1-270m, parallel GPU shards) -> export_parquet.py -> this dataset ``` ## License Wikipedia content is under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
提供机构:
NotHotTryHard
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作