RockyCo/jude-judaic-data
收藏Hugging Face2026-03-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/RockyCo/jude-judaic-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
- he
pretty_name: Jude Judaic Data
task_categories:
- text-generation
- text-retrieval
tags:
- agent
- rag
- judaism
- religion
size_categories:
- 100K<n<1M
---
# Dataset Card for Jude Judaic Data
<!-- Provide a quick summary of the dataset. -->
This dataset contains a processed, clean Markdown version of the expansive Jewish text library sourced from the Sefaria project. It is specifically preprocessed step-by-step for use in Retrieval-Augmented Generation (RAG) pipelines and offline local AI assistants like Jude.
### Dataset Description
- **Curated by:** Gil Caplan
- **Language(s) (NLP):** English ([en](cci:1://file:///Users/USER/Desktop/Personal_Projects/Torah_RAG/frontend/index.html:1864:4-2022:5)), Hebrew ([he](cci:1://file:///Users/USER/Desktop/Personal_Projects/Torah_RAG/backend/router.py:804:0-806:52))
- **License:** MIT (inherited from Sefaria's open data permissions)
- **Repository:** Sub-corpus generated from https://github.com/Sefaria/Sefaria-Export
This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc.
The dataset is organized as a collection of Markdown files explicitly divided into English (`sefaria_rag_markdown/en/`) and Hebrew (`sefaria_rag_markdown/he/`) directories.
Within each language directory, texts are sorted into categories (e.g., `Tanakh`, `Talmud`, `Halakhah`, `Midrash`, `Kabbalah`, `Liturgy`, `Jewish Thought`). Each specific book (e.g., `Genesis.md`, `Berakhot.md`) is a standalone Markdown file.
**Format per line:**
`[BookName ReferenceCoordinate] Text content`
**Example:**
`[Genesis 1:1] In the beginning God created the heaven and the earth.`
This line-by-line structure allows a ChromaDB indexer or standard RAG retriever to chunk the text efficiently, retaining the exact canonical citation alongside the passage. The entire corpus consists of 288,737 individual chunked passages.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
The raw data was originally sourced from the official Sefaria JSON export repository, which contains the digitized and translated canon of Jewish literature.
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
The JSON data was pushed through a custom preprocessing script (`preprocess_sefaria_to_markdown.py`) that performed the following sanitizations before exporting to this dataset:
1. **Footnote Filtering:** Sefaria HTML footnote markers (`<sup class="footnote-marker">`) and their corresponding inline bodies were aggressively stripped out to prevent them from interrupting verse sentences during LLM synthesis.
2. **HTML Stripping:** All remaining HTML stylistic tags were removed.
3. **Entity Decoding:** HTML entities (e.g., `&`, ` `) were converted to standard unicode text.
4. **Coordinate Flattening:** The hierarchical, N-dimensional JSON arrays used by Sefaria were flattened recursively into linear coordinate paths (e.g., node `[1, 2]` becomes `1:2`).
5. **Sorting:** Texts were grouped into their respective English and Hebrew files, sorted numerically by their reference coordinates rather than alphabetically.
### Annotations [optional]
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
No external manual annotations were added. The dataset relies entirely on the canonical textual numbering and categorization system natively provided by Sefaria (e.g., Chapter:Verse, Folio, or Halacha numbering).
#### Annotation process
N/A
#### Personal and Sensitive Information
This dataset consists entirely of ancient and classical religious texts, legal codes, commentaries, and historical philosophical writings. It does not contain any modern personal, sensitive, or private information.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The dataset relies heavily on the translation availability of the Sefaria project. While the Hebrew corpus is highly comprehensive across centuries of Jewish thought, the available English translations vary stylistically in vocabulary and era (ranging from archaic public domain translations to modern contemporary ones).
Certain esoteric or highly specific commentaries may lack English translations entirely within the base Sefaria dataset, meaning English-only retrievers will not access the full depth of the comprehensive Hebrew corpus.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users building AI agents on top of this dataset should implement a translation step in their pipeline to query the English text if a user asks a question in another language, or be prepared to directly embed and retrieve the Hebrew markdown files if they have an ultra-strong multilingual embedding model trained on Rabbinic Hebrew.
**BibTeX:**
```bibtex
@misc{sefaria,
title = {Sefaria: A Living Library of Jewish Texts},
author = {{Sefaria Project}},
year = {2026},
url = {https://www.sefaria.org/}
}
提供机构:
RockyCo



