OzLabs/hebrew-wiktionary-articles
收藏Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OzLabs/hebrew-wiktionary-articles
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- he
license: cc-by-sa-3.0
task_categories:
- text-generation
- fill-mask
tags:
- wiktionary
- hebrew
- dictionary
- text
- dump
size_categories:
- 10K<n<100K
pretty_name: Hebrew Wiktionary Articles
dataset_info:
features:
- name: id
dtype: int64
description: MediaWiki page ID
- name: title
dtype: string
description: Lemma (page title)
- name: section
dtype: string
description: Sense/variant header (e.g. מָלוֹן, מֵלוֹן); empty for intro
- name: text
dtype: string
description: Cleaned entry text (templates removed, links simplified)
splits:
- name: train
num_bytes: null
num_examples: null
---
# Hebrew Wiktionary Articles
Hebrew Wiktionary (ויקימילון) entries: one row per **sense/variant** (split by `==...==`), with cleaned text. Exported 2024-09-01.
## Data
- **Source**: [hewiktionary-20240901-pages-articles-multistream](https://dumps.wikimedia.org/hewiktionary/)
- **Schema**: `id` (page id), `title` (lemma), `section` (sense header, e.g. מָלוֹן), `text` (cleaned: templates removed, `[[x|y]]` → y)
- **License**: CC BY-SA 3.0 (Wiktionary)
## Usage
```python
from datasets import load_dataset
# After uploading to Hub (replace ORG/REPO with your repo id):
ds = load_dataset("parquet", data_files="https://huggingface.co/datasets/ORG/REPO/resolve/main/data/train.parquet", split="train")
# or
ds = load_dataset("ORG/REPO", trust_remote_code=True)
```
## Notes
- One row per sense (each `==...==` section on a page). Multiple rows per lemma when a page has multiple variants (e.g. מלון → מָלוֹן, מֵלוֹן, מִלּוֹן).
- Templates `{{...}}` removed; `[[link|display]]` replaced with display text.
- Redirects and main-namespace-only.
提供机构:
OzLabs



