tsanghasona/athena-corpus

Name: tsanghasona/athena-corpus
Creator: tsanghasona
Published: 2026-02-24 03:09:32
License: 暂无描述

Hugging Face2026-02-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/tsanghasona/athena-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-retrieval - feature-extraction language: - en tags: - philosophy - literature - religion - semantic-search - embeddings - public-domain size_categories: - 100K<n<1M --- # Athena Corpus Pre-built data for [Athena](https://github.com/tsangha/athena), a semantic search engine for philosophy, literature, religion, and intellectual history. ## Contents | File | Size | Description | |------|------|-------------| | `chunks.parquet` | 610 MB | 727k passages with text + metadata (author, work, tradition, era, genre, chapter, etc.) | | `embeddings_bf16.npy` | 1.1 GB | 727k x 768 float16 vectors, L2-normalized ([jina-embeddings-v5-text-nano](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano)) | | `text/` | 1.6 GB | 5,012 cleaned source texts (plain text, one file per work) | Rows in `chunks.parquet` and `embeddings_bf16.npy` are aligned by index -- row 0 in the parquet corresponds to row 0 in the embedding matrix. ## Corpus scale - **4,980 works** from **2,310 authors** across **336 intellectual traditions** - **727k chunks** (~640-768 tokens each) - Spans ancient philosophy, sacred texts, literature, poetry, political theory, science, and more - All texts are public domain ## Usage ### With Athena (full search server) ```bash git clone https://github.com/tsangha/athena.git && cd athena uv venv --python 3.11 && source .venv/bin/activate && uv pip install -e . # Download this dataset pip install huggingface_hub hf download tsanghasona/athena-corpus --repo-type dataset --local-dir data/ # Export query encoder model pip install torch transformers onnxsim python embedder-rs/scripts/export_onnx.py --output-dir model/ python -m onnxsim model/model.onnx model/model_simplified.onnx # Start server uv run python -m uvicorn server.main:app --port 3003 ``` ### Standalone (just the data) ```python import polars as pl import numpy as np # Load chunks with metadata and text df = pl.read_parquet("chunks.parquet") print(df.columns) # ['text', 'chunk_id', 'author', 'work', 'text_type', 'tradition', # 'era', 'genre', 'chapter', 'poem_title', 'chunk_index', ...] # Load embeddings (memory-mapped for large files) embeddings = np.load("embeddings_bf16.npy", mmap_mode="r") print(embeddings.shape) # (726986, 768) print(embeddings.dtype) # float16 ``` ## Embedding model Embeddings were generated with [jina-embeddings-v5-text-nano](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano): - 768 dimensions, float16, L2-normalized - Asymmetric: document chunks use the default prefix; queries should use `"Query: "` prefix - Generated with Athena's Rust embedder (see repo for source) ## Chunking Texts are split into ~640-768 token passages using strategy-appropriate chunking: | Strategy | Used for | Examples | |----------|----------|----------| | `structured` | Numbered sections | Spinoza's Ethics, Aquinas's Summa | | `discursive` | Flowing prose | Nietzsche, Plato's dialogues | | `literary` | Chapter-aware paragraph merge | Dostoevsky, Homer, Tolstoy | | `poetic` | Poem-boundary detection | Browning, Yeats, Heine | | `annotation` | Markdown header split | Academic notes | The `text_type` column in the parquet indicates which strategy was used for each chunk. ## License MIT -- see the [Athena repo](https://github.com/tsangha/athena) for details. All source texts are public domain.

提供机构：

tsanghasona

5,000+

优质数据集

54 个

任务类型

进入经典数据集