RockyCo/jude-judaic-data

Name: RockyCo/jude-judaic-data
Creator: RockyCo
Published: 2026-03-07 19:05:40
License: 暂无描述

Hugging Face2026-03-07 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/RockyCo/jude-judaic-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en - he pretty_name: Jude Judaic Data task_categories: - text-generation - text-retrieval tags: - agent - rag - judaism - religion size_categories: - 100K<n<1M --- # Dataset Card for Jude Judaic Data  This dataset contains a processed, clean Markdown version of the expansive Jewish text library sourced from the Sefaria project. It is specifically preprocessed step-by-step for use in Retrieval-Augmented Generation (RAG) pipelines and offline local AI assistants like Jude. ### Dataset Description - **Curated by:** Gil Caplan - **Language(s) (NLP):** English ([en](cci:1://file:///Users/USER/Desktop/Personal_Projects/Torah_RAG/frontend/index.html:1864:4-2022:5)), Hebrew ([he](cci:1://file:///Users/USER/Desktop/Personal_Projects/Torah_RAG/backend/router.py:804:0-806:52)) - **License:** MIT (inherited from Sefaria's open data permissions) - **Repository:** Sub-corpus generated from https://github.com/Sefaria/Sefaria-Export This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. The dataset is organized as a collection of Markdown files explicitly divided into English (`sefaria_rag_markdown/en/`) and Hebrew (`sefaria_rag_markdown/he/`) directories. Within each language directory, texts are sorted into categories (e.g., `Tanakh`, `Talmud`, `Halakhah`, `Midrash`, `Kabbalah`, `Liturgy`, `Jewish Thought`). Each specific book (e.g., `Genesis.md`, `Berakhot.md`) is a standalone Markdown file. **Format per line:** `[BookName ReferenceCoordinate] Text content` **Example:** `[Genesis 1:1] In the beginning God created the heaven and the earth.` This line-by-line structure allows a ChromaDB indexer or standard RAG retriever to chunk the text efficiently, retaining the exact canonical citation alongside the passage. The entire corpus consists of 288,737 individual chunked passages. ### Source Data  The raw data was originally sourced from the official Sefaria JSON export repository, which contains the digitized and translated canon of Jewish literature. #### Data Collection and Processing  The JSON data was pushed through a custom preprocessing script (`preprocess_sefaria_to_markdown.py`) that performed the following sanitizations before exporting to this dataset: 1. **Footnote Filtering:** Sefaria HTML footnote markers (`<sup class="footnote-marker">`) and their corresponding inline bodies were aggressively stripped out to prevent them from interrupting verse sentences during LLM synthesis. 2. **HTML Stripping:** All remaining HTML stylistic tags were removed. 3. **Entity Decoding:** HTML entities (e.g., `&`, ` `) were converted to standard unicode text. 4. **Coordinate Flattening:** The hierarchical, N-dimensional JSON arrays used by Sefaria were flattened recursively into linear coordinate paths (e.g., node `[1, 2]` becomes `1:2`). 5. **Sorting:** Texts were grouped into their respective English and Hebrew files, sorted numerically by their reference coordinates rather than alphabetically. ### Annotations [optional]  No external manual annotations were added. The dataset relies entirely on the canonical textual numbering and categorization system natively provided by Sefaria (e.g., Chapter:Verse, Folio, or Halacha numbering). #### Annotation process N/A #### Personal and Sensitive Information This dataset consists entirely of ancient and classical religious texts, legal codes, commentaries, and historical philosophical writings. It does not contain any modern personal, sensitive, or private information. ## Bias, Risks, and Limitations  The dataset relies heavily on the translation availability of the Sefaria project. While the Hebrew corpus is highly comprehensive across centuries of Jewish thought, the available English translations vary stylistically in vocabulary and era (ranging from archaic public domain translations to modern contemporary ones). Certain esoteric or highly specific commentaries may lack English translations entirely within the base Sefaria dataset, meaning English-only retrievers will not access the full depth of the comprehensive Hebrew corpus. ### Recommendations  Users building AI agents on top of this dataset should implement a translation step in their pipeline to query the English text if a user asks a question in another language, or be prepared to directly embed and retrieve the Hebrew markdown files if they have an ultra-strong multilingual embedding model trained on Rabbinic Hebrew. **BibTeX:** ```bibtex @misc{sefaria, title = {Sefaria: A Living Library of Jewish Texts}, author = {{Sefaria Project}}, year = {2026}, url = {https://www.sefaria.org/} }

提供机构：

RockyCo

5,000+

优质数据集

54 个

任务类型

进入经典数据集