five

OzLabs/hebrew-wiktionary-articles

收藏
Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OzLabs/hebrew-wiktionary-articles
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - he license: cc-by-sa-3.0 task_categories: - text-generation - fill-mask tags: - wiktionary - hebrew - dictionary - text - dump size_categories: - 10K<n<100K pretty_name: Hebrew Wiktionary Articles dataset_info: features: - name: id dtype: int64 description: MediaWiki page ID - name: title dtype: string description: Lemma (page title) - name: section dtype: string description: Sense/variant header (e.g. מָלוֹן, מֵלוֹן); empty for intro - name: text dtype: string description: Cleaned entry text (templates removed, links simplified) splits: - name: train num_bytes: null num_examples: null --- # Hebrew Wiktionary Articles Hebrew Wiktionary (ויקימילון) entries: one row per **sense/variant** (split by `==...==`), with cleaned text. Exported 2024-09-01. ## Data - **Source**: [hewiktionary-20240901-pages-articles-multistream](https://dumps.wikimedia.org/hewiktionary/) - **Schema**: `id` (page id), `title` (lemma), `section` (sense header, e.g. מָלוֹן), `text` (cleaned: templates removed, `[[x|y]]` → y) - **License**: CC BY-SA 3.0 (Wiktionary) ## Usage ```python from datasets import load_dataset # After uploading to Hub (replace ORG/REPO with your repo id): ds = load_dataset("parquet", data_files="https://huggingface.co/datasets/ORG/REPO/resolve/main/data/train.parquet", split="train") # or ds = load_dataset("ORG/REPO", trust_remote_code=True) ``` ## Notes - One row per sense (each `==...==` section on a page). Multiple rows per lemma when a page has multiple variants (e.g. מלון → מָלוֹן, מֵלוֹן, מִלּוֹן). - Templates `{{...}}` removed; `[[link|display]]` replaced with display text. - Redirects and main-namespace-only.
提供机构:
OzLabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作