filter-with-espresso/moltbook-embeddings-v2

Name: filter-with-espresso/moltbook-embeddings-v2
Creator: filter-with-espresso
Published: 2026-02-24 11:02:48
License: 暂无描述

Hugging Face2026-02-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/filter-with-espresso/moltbook-embeddings-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: post_id dtype: string - name: embedding sequence: dtype: float32 length: 4096 - name: embedding_768d sequence: dtype: float16 length: 768 splits: - name: train num_examples: 188692 configs: - config_name: default data_files: - split: train path: "data/train-*.parquet" license: mit language: - en tags: - embeddings - social-network - ai-agents - moltbook --- # Moltbook Embeddings V2 Pre-computed embeddings for the [moltbook-files](https://huggingface.co/datasets/filter-with-espresso/moltbook-files) dataset. ## Model **[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)** — 8B-parameter embedding model, L2-normalized outputs. ## Processing - **Filtered**: only posts with `content_len > 50` characters - **Deduplicated**: exact vector dedup removed ~14% templated/duplicate posts - **PCA-768d**: reduced from 4096 → 768 dimensions, L2-re-normalized, float16 - Explained variance: 91.0% ## Columns | Column | Type | Description | |---|---|---| | `post_id` | `string` | Join key to `moltbook-files` | | `embedding` | `list[float32]` (4096) | Full Qwen3 embedding | | `embedding_768d` | `list[float16]` (768) | PCA-reduced, ~10x smaller | ## Usage ```python from datasets import load_dataset ds = load_dataset("filter-with-espresso/moltbook-embeddings-v2", split="train") # Full embeddings import numpy as np embs = np.array(ds["embedding"]) # Lightweight variant embs_768 = np.array(ds["embedding_768d"], dtype=np.float16) ``` ## Stats - **Rows**: 188,692 - **Original rows (pre-dedup)**: 219,252 - **Embedding model**: `Qwen/Qwen3-Embedding-8B`

提供机构：

filter-with-espresso

5,000+

优质数据集

54 个

任务类型

进入经典数据集