igriv/dire-arxiv-bge-small-embeddings
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/igriv/dire-arxiv-bge-small-embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
size_categories:
- 100K<n<1M
task_categories:
- feature-extraction
tags:
- arxiv
- embeddings
- bge
- dimensionality-reduction
- topology
- dire
- umap
pretty_name: DiRe arXiv BGE-small paper-level embeddings
configs:
- config_name: default
data_files:
- split: train
path: embeddings/part-*.parquet
- config_name: metadata
data_files:
- split: train
path: metadata.parquet
- config_name: layouts_2d
data_files:
- split: train
path: layouts/layouts_2d.parquet
- config_name: layouts_3d
data_files:
- split: train
path: layouts/layouts_3d.parquet
---
# DiRe arXiv BGE-small paper-level embeddings
Paper-level mean-pooled embeddings for **723,457 arXiv papers**, produced with
[`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5),
together with arXiv descriptive metadata, 2-d and 3-d DiRe / UMAP layouts of
the full corpus, and provenance manifests.
This is the dataset underlying the arXiv experiment in:
> Kolpakov, A. and Rivin, I. *DiRe: Topology-Faithful Dimensionality Reduction.*
> (PNAS submission, 2026.) Igor Rivin: [ORCID 0000-0001-9302-2169](https://orcid.org/0000-0001-9302-2169).
The pipeline is: 130 M LaTeX-derived chunks → unit-normalized BGE-small chunk
embeddings → mean-pool per paper (one 384-d vector per paper) → DiRe / UMAP
projection. The released embeddings are the **mean-pooled, un-renormalized**
vectors (norms ~0.83-1.0). Downstream code typically L2-renormalizes them
before further use.
## What's in the release
| Config | File(s) | Rows | Schema |
|---|---|---:|---|
| `default` | `embeddings/part-*.parquet` (8 shards, ~1.03 GB) | 723,457 | `arxiv_id: string`, `embedding: fixed_size_list<float32, 384>` |
| `metadata` | `metadata.parquet` | 723,457 | `arxiv_id, title, primary_category, categories: list<string>, n_chunks` |
| `layouts_2d` | `layouts/layouts_2d.parquet` | 7,234,570 | `arxiv_id, method ∈ {dire, umap}, n_neighbors ∈ {8,16,32,64,128}, x, y` |
| `layouts_3d` | `layouts/layouts_3d.parquet` | 1,446,914 | `arxiv_id, method, n_neighbors=16, x, y, z` |
Provenance lives in `provenance/`:
- `embedding_config.json` — model, pooling, normalization, embedding dim
- `software_versions.json` — Python / library versions used at build time
- `file_hashes.json` — SHA-256 of every released file
## Quick start
```python
from datasets import load_dataset
import numpy as np
# 384-d paper embeddings (default config)
ds = load_dataset("igriv/dire-arxiv-bge-small-embeddings", split="train")
X = np.vstack(ds["embedding"]).astype("float32") # (723457, 384)
ids = np.asarray(ds["arxiv_id"])
# arXiv descriptive metadata
meta = load_dataset("igriv/dire-arxiv-bge-small-embeddings",
name="metadata", split="train").to_pandas()
# Pre-computed 2-d layouts (filter to the canonical setting used in the paper)
layouts = load_dataset("igriv/dire-arxiv-bge-small-embeddings",
name="layouts_2d", split="train").to_pandas()
canonical = layouts[(layouts.method == "dire") & (layouts.n_neighbors == 16)]
```
A minimal loader is also included as `load_example.py`.
## Embedding details
| | |
|---|---|
| Model | `BAAI/bge-small-en-v1.5` |
| Embedding dim | 384 |
| Source text | LaTeX-derived chunks (paper body + abstract) |
| Number of chunks pooled | 130,028,430 across 723,457 papers (~180/paper avg) |
| Pooling | mean over chunk vectors (each chunk vector is BGE's L2-unit output) |
| Released dtype | `float32` |
| Released normalization | **un-normalized mean-pool** (typical norm 0.83–1.0); re-normalize to L2 unit before use if you want cosine ↔ Euclidean equivalence |
| Per-paper chunk count | available as `n_chunks` in `metadata` |
## Layouts
The 2-d layouts are the full corpus projected down to 2-d for both DiRe (force-directed,
GPU; [dire-rapids](https://github.com/sashakolpakov/dire-rapids)) and cuML UMAP. The
sweep over `n_neighbors ∈ {8, 16, 32, 64, 128}` is included so readers can reproduce
every figure in the paper directly. The canonical setting is `n_neighbors=16, method=dire`.
3-d layouts are provided at `n_neighbors=16` for both methods.
Layout coordinates are in raw method output units (arbitrary; not standardized).
## Reproducing the embeddings
The pooled embeddings here are the only thing required to reproduce the paper's
arXiv experiments (DiRe / UMAP layouts, kNN-preservation evaluation, Betti-curve
DTW, island-ness diagnostics). The underlying chunk-level embeddings are
**not** redistributed — they can be regenerated from arXiv source text using the
public BGE model and the chunking pipeline at <https://github.com/sashakolpakov/dire-rapids-arxiv>.
## Source-text policy
This release contains **only**:
- arXiv descriptive metadata (titles, identifiers, categories, chunk counts) — arXiv lists this as CC0-licensed descriptive metadata
- numerical research artifacts (mean-pooled embeddings, layouts) produced by the authors
It does **not** contain raw arXiv text, PDFs, TeX source, chunk text, or anything
otherwise covered by the licenses of the original papers. Users who want the
full text of any individual paper should retrieve it directly from arXiv,
subject to that paper's license.
## Licensing
- The numerical research artifacts in this release (embeddings, layouts) are
released under [**CC-BY 4.0**](https://creativecommons.org/licenses/by/4.0/).
- arXiv descriptive metadata fields included here are CC0 per
[arXiv API terms of use](https://info.arxiv.org/help/api/tou.html).
- This is not legal advice; users are responsible for compliance with the
licenses of any arXiv paper they retrieve separately from this release.
## Citation
If you use this dataset, please cite both the paper and the dataset:
```bibtex
@article{kolpakov2026dire,
author = {Kolpakov, Alexander and Rivin, Igor},
title = {DiRe: Topology-Faithful Dimensionality Reduction},
journal = {Proceedings of the National Academy of Sciences (submitted)},
year = {2026}
}
@dataset{kolpakov_rivin_dire_arxiv_2026,
author = {Kolpakov, Alexander and Rivin, Igor},
title = {{arXiv paper embeddings for DiRe topology-faithful dimensionality reduction}},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19837856}
}
```
(The Zenodo DOI `10.5281/zenodo.19837856` is the *concept DOI*: it always
resolves to the latest version of this dataset. The version-specific DOI
for v1 is `10.5281/zenodo.19837857`. Cite the concept DOI in publications.
This Hugging Face copy is a convenience mirror.)
## Versioning
- **v1** (2026-04): Initial release. Pooled embeddings + minimal arXiv metadata + full layout sweep.
Future versions may add: enriched metadata (abstracts, authors, dates, license), chunk-level embeddings, additional reducer baselines.
提供机构:
igriv



