tomron87/hebrew-wikipedia-sentences-corpus

Name: tomron87/hebrew-wikipedia-sentences-corpus
Creator: tomron87
Published: 2026-02-14 12:27:13
License: 暂无描述

Hugging Face2026-02-14 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/tomron87/hebrew-wikipedia-sentences-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - he license: cc-by-sa-3.0 tags: - hebrew - wikipedia - sentence-corpus - nlp - monolingual size_categories: - 10M<n<100M task_categories: - text-classification - token-classification - sentence-similarity - feature-extraction dataset_info: features: - name: sentence_id dtype: string - name: sentence dtype: string - name: article_id dtype: int64 - name: article_title dtype: string - name: categories dtype: string - name: sentence_position dtype: int64 - name: word_count dtype: int64 - name: hebrew_ratio dtype: float64 splits: - name: train num_examples: 10999257 configs: - config_name: default data_files: - split: train path: sentences.parquet --- # Hebrew Wikipedia Sentences Corpus A corpus of **10,999,257** cleaned, deduplicated Hebrew sentences extracted from **366,610** Hebrew Wikipedia articles. ## Dataset Description This dataset contains Hebrew sentences extracted from Hebrew Wikipedia (crawled 2026-02). Each sentence has been cleaned, filtered for quality, and deduplicated. The dataset is intended for Hebrew NLP tasks including language modeling, text classification, NER, sentence similarity, and more. ### Source [Hebrew Wikipedia](https://he.wikipedia.org/) via the MediaWiki API. ### Processing Pipeline 1. **Crawl** — fetched all Hebrew Wikipedia articles via `generator=allpages` + `prop=revisions` 2. **Extract** — converted wikitext to plain text, split into sentences using rule-based tokenization, filtered by length (5–50 words), Hebrew ratio (≥50%), and content quality 3. **Deduplicate** — removed exact duplicate sentences via SHA-256 hashing ## Schema | Column | Type | Description | |--------|------|-------------| | `sentence_id` | string | Unique ID (`wiki_{article_id}_{sentence_idx}`) | | `sentence` | string | Clean Hebrew sentence | | `article_id` | int64 | Wikipedia article page ID | | `article_title` | string | Article title | | `categories` | string | Pipe-separated Wikipedia categories | | `sentence_position` | int64 | Position of sentence within the article (0-indexed) | | `word_count` | int64 | Number of whitespace-delimited tokens | | `hebrew_ratio` | float64 | Ratio of Hebrew characters to total alphabetic characters | ## Statistics | Metric | Value | |--------|-------| | Total sentences | 10,999,257 | | Unique articles | 366,610 | | Word count (mean) | 16.6 | | Word count (median) | 15 | | Word count (range) | 5–50 | | Hebrew ratio (mean) | 0.982 | | Hebrew ratio (median) | 1.000 | ## Usage ```python from datasets import load_dataset ds = load_dataset("tomron87/hebrew-wikipedia-sentences-corpus") print(ds["train"][0]) ``` ## Intended Uses - Hebrew language modeling and pretraining - Text classification and NER - Sentence similarity and semantic search - Hebrew NLP research and benchmarking ## Limitations - **Register**: Wikipedia text is encyclopedic and formal; it does not represent spoken Hebrew, social media, or informal writing. - **Temporal**: Content reflects Hebrew Wikipedia as of 2026-02. Articles added or modified after this date are not included. - **Bias**: Wikipedia's coverage is uneven across topics and may reflect systemic biases in editor demographics. ## License CC BY-SA 3.0, inherited from Wikipedia content. See [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/). ## Citation ```bibtex @dataset{hebrew_wikipedia_sentences, title = {Hebrew Wikipedia Sentences}, author = {Tom Ron}, year = {2026}, url = {https://huggingface.co/datasets/tomron/hebrew-wikipedia-sentences}, license = {CC BY-SA 3.0}, note = {Generated on 2026-02-14} } ```

提供机构：

tomron87

5,000+

优质数据集

54 个

任务类型

进入经典数据集