tsanghasona/athena-corpus
收藏Hugging Face2026-02-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tsanghasona/athena-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-retrieval
- feature-extraction
language:
- en
tags:
- philosophy
- literature
- religion
- semantic-search
- embeddings
- public-domain
size_categories:
- 100K<n<1M
---
# Athena Corpus
Pre-built data for [Athena](https://github.com/tsangha/athena), a semantic search engine for philosophy, literature, religion, and intellectual history.
## Contents
| File | Size | Description |
|------|------|-------------|
| `chunks.parquet` | 610 MB | 727k passages with text + metadata (author, work, tradition, era, genre, chapter, etc.) |
| `embeddings_bf16.npy` | 1.1 GB | 727k x 768 float16 vectors, L2-normalized ([jina-embeddings-v5-text-nano](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano)) |
| `text/` | 1.6 GB | 5,012 cleaned source texts (plain text, one file per work) |
Rows in `chunks.parquet` and `embeddings_bf16.npy` are aligned by index -- row 0 in the parquet corresponds to row 0 in the embedding matrix.
## Corpus scale
- **4,980 works** from **2,310 authors** across **336 intellectual traditions**
- **727k chunks** (~640-768 tokens each)
- Spans ancient philosophy, sacred texts, literature, poetry, political theory, science, and more
- All texts are public domain
## Usage
### With Athena (full search server)
```bash
git clone https://github.com/tsangha/athena.git && cd athena
uv venv --python 3.11 && source .venv/bin/activate && uv pip install -e .
# Download this dataset
pip install huggingface_hub
hf download tsanghasona/athena-corpus --repo-type dataset --local-dir data/
# Export query encoder model
pip install torch transformers onnxsim
python embedder-rs/scripts/export_onnx.py --output-dir model/
python -m onnxsim model/model.onnx model/model_simplified.onnx
# Start server
uv run python -m uvicorn server.main:app --port 3003
```
### Standalone (just the data)
```python
import polars as pl
import numpy as np
# Load chunks with metadata and text
df = pl.read_parquet("chunks.parquet")
print(df.columns)
# ['text', 'chunk_id', 'author', 'work', 'text_type', 'tradition',
# 'era', 'genre', 'chapter', 'poem_title', 'chunk_index', ...]
# Load embeddings (memory-mapped for large files)
embeddings = np.load("embeddings_bf16.npy", mmap_mode="r")
print(embeddings.shape) # (726986, 768)
print(embeddings.dtype) # float16
```
## Embedding model
Embeddings were generated with [jina-embeddings-v5-text-nano](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano):
- 768 dimensions, float16, L2-normalized
- Asymmetric: document chunks use the default prefix; queries should use `"Query: "` prefix
- Generated with Athena's Rust embedder (see repo for source)
## Chunking
Texts are split into ~640-768 token passages using strategy-appropriate chunking:
| Strategy | Used for | Examples |
|----------|----------|----------|
| `structured` | Numbered sections | Spinoza's Ethics, Aquinas's Summa |
| `discursive` | Flowing prose | Nietzsche, Plato's dialogues |
| `literary` | Chapter-aware paragraph merge | Dostoevsky, Homer, Tolstoy |
| `poetic` | Poem-boundary detection | Browning, Yeats, Heine |
| `annotation` | Markdown header split | Academic notes |
The `text_type` column in the parquet indicates which strategy was used for each chunk.
## License
MIT -- see the [Athena repo](https://github.com/tsangha/athena) for details.
All source texts are public domain.
提供机构:
tsanghasona



