five

igriv/dire-arxiv-bge-small-embeddings

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/igriv/dire-arxiv-bge-small-embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en size_categories: - 100K<n<1M task_categories: - feature-extraction tags: - arxiv - embeddings - bge - dimensionality-reduction - topology - dire - umap pretty_name: DiRe arXiv BGE-small paper-level embeddings configs: - config_name: default data_files: - split: train path: embeddings/part-*.parquet - config_name: metadata data_files: - split: train path: metadata.parquet - config_name: layouts_2d data_files: - split: train path: layouts/layouts_2d.parquet - config_name: layouts_3d data_files: - split: train path: layouts/layouts_3d.parquet --- # DiRe arXiv BGE-small paper-level embeddings Paper-level mean-pooled embeddings for **723,457 arXiv papers**, produced with [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5), together with arXiv descriptive metadata, 2-d and 3-d DiRe / UMAP layouts of the full corpus, and provenance manifests. This is the dataset underlying the arXiv experiment in: > Kolpakov, A. and Rivin, I. *DiRe: Topology-Faithful Dimensionality Reduction.* > (PNAS submission, 2026.) Igor Rivin: [ORCID 0000-0001-9302-2169](https://orcid.org/0000-0001-9302-2169). The pipeline is: 130 M LaTeX-derived chunks → unit-normalized BGE-small chunk embeddings → mean-pool per paper (one 384-d vector per paper) → DiRe / UMAP projection. The released embeddings are the **mean-pooled, un-renormalized** vectors (norms ~0.83-1.0). Downstream code typically L2-renormalizes them before further use. ## What's in the release | Config | File(s) | Rows | Schema | |---|---|---:|---| | `default` | `embeddings/part-*.parquet` (8 shards, ~1.03 GB) | 723,457 | `arxiv_id: string`, `embedding: fixed_size_list<float32, 384>` | | `metadata` | `metadata.parquet` | 723,457 | `arxiv_id, title, primary_category, categories: list<string>, n_chunks` | | `layouts_2d` | `layouts/layouts_2d.parquet` | 7,234,570 | `arxiv_id, method ∈ {dire, umap}, n_neighbors ∈ {8,16,32,64,128}, x, y` | | `layouts_3d` | `layouts/layouts_3d.parquet` | 1,446,914 | `arxiv_id, method, n_neighbors=16, x, y, z` | Provenance lives in `provenance/`: - `embedding_config.json` — model, pooling, normalization, embedding dim - `software_versions.json` — Python / library versions used at build time - `file_hashes.json` — SHA-256 of every released file ## Quick start ```python from datasets import load_dataset import numpy as np # 384-d paper embeddings (default config) ds = load_dataset("igriv/dire-arxiv-bge-small-embeddings", split="train") X = np.vstack(ds["embedding"]).astype("float32") # (723457, 384) ids = np.asarray(ds["arxiv_id"]) # arXiv descriptive metadata meta = load_dataset("igriv/dire-arxiv-bge-small-embeddings", name="metadata", split="train").to_pandas() # Pre-computed 2-d layouts (filter to the canonical setting used in the paper) layouts = load_dataset("igriv/dire-arxiv-bge-small-embeddings", name="layouts_2d", split="train").to_pandas() canonical = layouts[(layouts.method == "dire") & (layouts.n_neighbors == 16)] ``` A minimal loader is also included as `load_example.py`. ## Embedding details | | | |---|---| | Model | `BAAI/bge-small-en-v1.5` | | Embedding dim | 384 | | Source text | LaTeX-derived chunks (paper body + abstract) | | Number of chunks pooled | 130,028,430 across 723,457 papers (~180/paper avg) | | Pooling | mean over chunk vectors (each chunk vector is BGE's L2-unit output) | | Released dtype | `float32` | | Released normalization | **un-normalized mean-pool** (typical norm 0.83–1.0); re-normalize to L2 unit before use if you want cosine ↔ Euclidean equivalence | | Per-paper chunk count | available as `n_chunks` in `metadata` | ## Layouts The 2-d layouts are the full corpus projected down to 2-d for both DiRe (force-directed, GPU; [dire-rapids](https://github.com/sashakolpakov/dire-rapids)) and cuML UMAP. The sweep over `n_neighbors ∈ {8, 16, 32, 64, 128}` is included so readers can reproduce every figure in the paper directly. The canonical setting is `n_neighbors=16, method=dire`. 3-d layouts are provided at `n_neighbors=16` for both methods. Layout coordinates are in raw method output units (arbitrary; not standardized). ## Reproducing the embeddings The pooled embeddings here are the only thing required to reproduce the paper's arXiv experiments (DiRe / UMAP layouts, kNN-preservation evaluation, Betti-curve DTW, island-ness diagnostics). The underlying chunk-level embeddings are **not** redistributed — they can be regenerated from arXiv source text using the public BGE model and the chunking pipeline at <https://github.com/sashakolpakov/dire-rapids-arxiv>. ## Source-text policy This release contains **only**: - arXiv descriptive metadata (titles, identifiers, categories, chunk counts) — arXiv lists this as CC0-licensed descriptive metadata - numerical research artifacts (mean-pooled embeddings, layouts) produced by the authors It does **not** contain raw arXiv text, PDFs, TeX source, chunk text, or anything otherwise covered by the licenses of the original papers. Users who want the full text of any individual paper should retrieve it directly from arXiv, subject to that paper's license. ## Licensing - The numerical research artifacts in this release (embeddings, layouts) are released under [**CC-BY 4.0**](https://creativecommons.org/licenses/by/4.0/). - arXiv descriptive metadata fields included here are CC0 per [arXiv API terms of use](https://info.arxiv.org/help/api/tou.html). - This is not legal advice; users are responsible for compliance with the licenses of any arXiv paper they retrieve separately from this release. ## Citation If you use this dataset, please cite both the paper and the dataset: ```bibtex @article{kolpakov2026dire, author = {Kolpakov, Alexander and Rivin, Igor}, title = {DiRe: Topology-Faithful Dimensionality Reduction}, journal = {Proceedings of the National Academy of Sciences (submitted)}, year = {2026} } @dataset{kolpakov_rivin_dire_arxiv_2026, author = {Kolpakov, Alexander and Rivin, Igor}, title = {{arXiv paper embeddings for DiRe topology-faithful dimensionality reduction}}, year = {2026}, publisher = {Zenodo}, doi = {10.5281/zenodo.19837856} } ``` (The Zenodo DOI `10.5281/zenodo.19837856` is the *concept DOI*: it always resolves to the latest version of this dataset. The version-specific DOI for v1 is `10.5281/zenodo.19837857`. Cite the concept DOI in publications. This Hugging Face copy is a convenience mirror.) ## Versioning - **v1** (2026-04): Initial release. Pooled embeddings + minimal arXiv metadata + full layout sweep. Future versions may add: enriched metadata (abstracts, authors, dates, license), chunk-level embeddings, additional reducer baselines.
提供机构:
igriv
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作