filter-with-espresso/moltbook-embeddings-v2
收藏Hugging Face2026-02-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/filter-with-espresso/moltbook-embeddings-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: post_id
dtype: string
- name: embedding
sequence:
dtype: float32
length: 4096
- name: embedding_768d
sequence:
dtype: float16
length: 768
splits:
- name: train
num_examples: 188692
configs:
- config_name: default
data_files:
- split: train
path: "data/train-*.parquet"
license: mit
language:
- en
tags:
- embeddings
- social-network
- ai-agents
- moltbook
---
# Moltbook Embeddings V2
Pre-computed embeddings for the [moltbook-files](https://huggingface.co/datasets/filter-with-espresso/moltbook-files) dataset.
## Model
**[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)** — 8B-parameter embedding model, L2-normalized outputs.
## Processing
- **Filtered**: only posts with `content_len > 50` characters
- **Deduplicated**: exact vector dedup removed ~14% templated/duplicate posts
- **PCA-768d**: reduced from 4096 → 768 dimensions, L2-re-normalized, float16
- Explained variance: 91.0%
## Columns
| Column | Type | Description |
|---|---|---|
| `post_id` | `string` | Join key to `moltbook-files` |
| `embedding` | `list[float32]` (4096) | Full Qwen3 embedding |
| `embedding_768d` | `list[float16]` (768) | PCA-reduced, ~10x smaller |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("filter-with-espresso/moltbook-embeddings-v2", split="train")
# Full embeddings
import numpy as np
embs = np.array(ds["embedding"])
# Lightweight variant
embs_768 = np.array(ds["embedding_768d"], dtype=np.float16)
```
## Stats
- **Rows**: 188,692
- **Original rows (pre-dedup)**: 219,252
- **Embedding model**: `Qwen/Qwen3-Embedding-8B`
提供机构:
filter-with-espresso



