OzLabs/hebrew-wiktionary-articles

Name: OzLabs/hebrew-wiktionary-articles
Creator: OzLabs
Published: 2026-03-14 19:37:50
License: 暂无描述

Hugging Face2026-03-14 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/OzLabs/hebrew-wiktionary-articles

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - he license: cc-by-sa-3.0 task_categories: - text-generation - fill-mask tags: - wiktionary - hebrew - dictionary - text - dump size_categories: - 10K<n<100K pretty_name: Hebrew Wiktionary Articles dataset_info: features: - name: id dtype: int64 description: MediaWiki page ID - name: title dtype: string description: Lemma (page title) - name: section dtype: string description: Sense/variant header (e.g. מָלוֹן, מֵלוֹן); empty for intro - name: text dtype: string description: Cleaned entry text (templates removed, links simplified) splits: - name: train num_bytes: null num_examples: null --- # Hebrew Wiktionary Articles Hebrew Wiktionary (ויקימילון) entries: one row per **sense/variant** (split by `==...==`), with cleaned text. Exported 2024-09-01. ## Data - **Source**: [hewiktionary-20240901-pages-articles-multistream](https://dumps.wikimedia.org/hewiktionary/) - **Schema**: `id` (page id), `title` (lemma), `section` (sense header, e.g. מָלוֹן), `text` (cleaned: templates removed, `[[x|y]]` → y) - **License**: CC BY-SA 3.0 (Wiktionary) ## Usage ```python from datasets import load_dataset # After uploading to Hub (replace ORG/REPO with your repo id): ds = load_dataset("parquet", data_files="https://huggingface.co/datasets/ORG/REPO/resolve/main/data/train.parquet", split="train") # or ds = load_dataset("ORG/REPO", trust_remote_code=True) ``` ## Notes - One row per sense (each `==...==` section on a page). Multiple rows per lemma when a page has multiple variants (e.g. מלון → מָלוֹן, מֵלוֹן, מִלּוֹן). - Templates `{{...}}` removed; `[[link|display]]` replaced with display text. - Redirects and main-namespace-only.

提供机构：

OzLabs

5,000+

优质数据集

54 个

任务类型

进入经典数据集