tomron87/hebrew-wikipedia-sentences-corpus
收藏Hugging Face2026-02-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tomron87/hebrew-wikipedia-sentences-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- he
license: cc-by-sa-3.0
tags:
- hebrew
- wikipedia
- sentence-corpus
- nlp
- monolingual
size_categories:
- 10M<n<100M
task_categories:
- text-classification
- token-classification
- sentence-similarity
- feature-extraction
dataset_info:
features:
- name: sentence_id
dtype: string
- name: sentence
dtype: string
- name: article_id
dtype: int64
- name: article_title
dtype: string
- name: categories
dtype: string
- name: sentence_position
dtype: int64
- name: word_count
dtype: int64
- name: hebrew_ratio
dtype: float64
splits:
- name: train
num_examples: 10999257
configs:
- config_name: default
data_files:
- split: train
path: sentences.parquet
---
# Hebrew Wikipedia Sentences Corpus
A corpus of **10,999,257** cleaned, deduplicated Hebrew sentences extracted from **366,610** Hebrew Wikipedia articles.
## Dataset Description
This dataset contains Hebrew sentences extracted from Hebrew Wikipedia (crawled 2026-02). Each sentence has been cleaned, filtered for quality, and deduplicated. The dataset is intended for Hebrew NLP tasks including language modeling, text classification, NER, sentence similarity, and more.
### Source
[Hebrew Wikipedia](https://he.wikipedia.org/) via the MediaWiki API.
### Processing Pipeline
1. **Crawl** — fetched all Hebrew Wikipedia articles via `generator=allpages` + `prop=revisions`
2. **Extract** — converted wikitext to plain text, split into sentences using rule-based tokenization, filtered by length (5–50 words), Hebrew ratio (≥50%), and content quality
3. **Deduplicate** — removed exact duplicate sentences via SHA-256 hashing
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `sentence_id` | string | Unique ID (`wiki_{article_id}_{sentence_idx}`) |
| `sentence` | string | Clean Hebrew sentence |
| `article_id` | int64 | Wikipedia article page ID |
| `article_title` | string | Article title |
| `categories` | string | Pipe-separated Wikipedia categories |
| `sentence_position` | int64 | Position of sentence within the article (0-indexed) |
| `word_count` | int64 | Number of whitespace-delimited tokens |
| `hebrew_ratio` | float64 | Ratio of Hebrew characters to total alphabetic characters |
## Statistics
| Metric | Value |
|--------|-------|
| Total sentences | 10,999,257 |
| Unique articles | 366,610 |
| Word count (mean) | 16.6 |
| Word count (median) | 15 |
| Word count (range) | 5–50 |
| Hebrew ratio (mean) | 0.982 |
| Hebrew ratio (median) | 1.000 |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("tomron87/hebrew-wikipedia-sentences-corpus")
print(ds["train"][0])
```
## Intended Uses
- Hebrew language modeling and pretraining
- Text classification and NER
- Sentence similarity and semantic search
- Hebrew NLP research and benchmarking
## Limitations
- **Register**: Wikipedia text is encyclopedic and formal; it does not represent spoken Hebrew, social media, or informal writing.
- **Temporal**: Content reflects Hebrew Wikipedia as of 2026-02. Articles added or modified after this date are not included.
- **Bias**: Wikipedia's coverage is uneven across topics and may reflect systemic biases in editor demographics.
## License
CC BY-SA 3.0, inherited from Wikipedia content. See [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
## Citation
```bibtex
@dataset{hebrew_wikipedia_sentences,
title = {Hebrew Wikipedia Sentences},
author = {Tom Ron},
year = {2026},
url = {https://huggingface.co/datasets/tomron/hebrew-wikipedia-sentences},
license = {CC BY-SA 3.0},
note = {Generated on 2026-02-14}
}
```
提供机构:
tomron87



