five

tsanghasona/athena-corpus

收藏
Hugging Face2026-02-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tsanghasona/athena-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-retrieval - feature-extraction language: - en tags: - philosophy - literature - religion - semantic-search - embeddings - public-domain size_categories: - 100K<n<1M --- # Athena Corpus Pre-built data for [Athena](https://github.com/tsangha/athena), a semantic search engine for philosophy, literature, religion, and intellectual history. ## Contents | File | Size | Description | |------|------|-------------| | `chunks.parquet` | 610 MB | 727k passages with text + metadata (author, work, tradition, era, genre, chapter, etc.) | | `embeddings_bf16.npy` | 1.1 GB | 727k x 768 float16 vectors, L2-normalized ([jina-embeddings-v5-text-nano](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano)) | | `text/` | 1.6 GB | 5,012 cleaned source texts (plain text, one file per work) | Rows in `chunks.parquet` and `embeddings_bf16.npy` are aligned by index -- row 0 in the parquet corresponds to row 0 in the embedding matrix. ## Corpus scale - **4,980 works** from **2,310 authors** across **336 intellectual traditions** - **727k chunks** (~640-768 tokens each) - Spans ancient philosophy, sacred texts, literature, poetry, political theory, science, and more - All texts are public domain ## Usage ### With Athena (full search server) ```bash git clone https://github.com/tsangha/athena.git && cd athena uv venv --python 3.11 && source .venv/bin/activate && uv pip install -e . # Download this dataset pip install huggingface_hub hf download tsanghasona/athena-corpus --repo-type dataset --local-dir data/ # Export query encoder model pip install torch transformers onnxsim python embedder-rs/scripts/export_onnx.py --output-dir model/ python -m onnxsim model/model.onnx model/model_simplified.onnx # Start server uv run python -m uvicorn server.main:app --port 3003 ``` ### Standalone (just the data) ```python import polars as pl import numpy as np # Load chunks with metadata and text df = pl.read_parquet("chunks.parquet") print(df.columns) # ['text', 'chunk_id', 'author', 'work', 'text_type', 'tradition', # 'era', 'genre', 'chapter', 'poem_title', 'chunk_index', ...] # Load embeddings (memory-mapped for large files) embeddings = np.load("embeddings_bf16.npy", mmap_mode="r") print(embeddings.shape) # (726986, 768) print(embeddings.dtype) # float16 ``` ## Embedding model Embeddings were generated with [jina-embeddings-v5-text-nano](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano): - 768 dimensions, float16, L2-normalized - Asymmetric: document chunks use the default prefix; queries should use `"Query: "` prefix - Generated with Athena's Rust embedder (see repo for source) ## Chunking Texts are split into ~640-768 token passages using strategy-appropriate chunking: | Strategy | Used for | Examples | |----------|----------|----------| | `structured` | Numbered sections | Spinoza's Ethics, Aquinas's Summa | | `discursive` | Flowing prose | Nietzsche, Plato's dialogues | | `literary` | Chapter-aware paragraph merge | Dostoevsky, Homer, Tolstoy | | `poetic` | Poem-boundary detection | Browning, Yeats, Heine | | `annotation` | Markdown header split | Academic notes | The `text_type` column in the parquet indicates which strategy was used for each chunk. ## License MIT -- see the [Athena repo](https://github.com/tsangha/athena) for details. All source texts are public domain.
提供机构:
tsanghasona
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作