five

filter-with-espresso/moltbook-embeddings-v2

收藏
Hugging Face2026-02-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/filter-with-espresso/moltbook-embeddings-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: post_id dtype: string - name: embedding sequence: dtype: float32 length: 4096 - name: embedding_768d sequence: dtype: float16 length: 768 splits: - name: train num_examples: 188692 configs: - config_name: default data_files: - split: train path: "data/train-*.parquet" license: mit language: - en tags: - embeddings - social-network - ai-agents - moltbook --- # Moltbook Embeddings V2 Pre-computed embeddings for the [moltbook-files](https://huggingface.co/datasets/filter-with-espresso/moltbook-files) dataset. ## Model **[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)** — 8B-parameter embedding model, L2-normalized outputs. ## Processing - **Filtered**: only posts with `content_len > 50` characters - **Deduplicated**: exact vector dedup removed ~14% templated/duplicate posts - **PCA-768d**: reduced from 4096 → 768 dimensions, L2-re-normalized, float16 - Explained variance: 91.0% ## Columns | Column | Type | Description | |---|---|---| | `post_id` | `string` | Join key to `moltbook-files` | | `embedding` | `list[float32]` (4096) | Full Qwen3 embedding | | `embedding_768d` | `list[float16]` (768) | PCA-reduced, ~10x smaller | ## Usage ```python from datasets import load_dataset ds = load_dataset("filter-with-espresso/moltbook-embeddings-v2", split="train") # Full embeddings import numpy as np embs = np.array(ds["embedding"]) # Lightweight variant embs_768 = np.array(ds["embedding_768d"], dtype=np.float16) ``` ## Stats - **Rows**: 188,692 - **Original rows (pre-dedup)**: 219,252 - **Embedding model**: `Qwen/Qwen3-Embedding-8B`
提供机构:
filter-with-espresso
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作