Reza2kn/RaahNaameh-1-textual-corpus
收藏Hugging Face2026-03-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Reza2kn/RaahNaameh-1-textual-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fa
license: cc-by-4.0
task_categories:
- sentence-similarity
- text-retrieval
- text-classification
tags:
- persian
- farsi
- embeddings
- raahnaameh
size_categories:
- 10M<n<100M
---
# RaahNaameh-1 Textual Corpus
A large-scale Persian text corpus assembled for training the RaahNaameh-1 embedding model.
## Sources
| Source | Sentences | Description |
|--------|-----------|-------------|
| Jomleh | 1,002,221 | Formal Persian web text |
| LSCP | 10,257,866 | Iranian tweets — colloquial, slang, emoji |
| Persian Wikipedia | 1,107,618 | Encyclopedic articles |
| **Total** | **12,367,705** | |
## Processing
- Light normalization only: Arabic→Persian character mapping, zero-width space removal
- Emojis, Finglish, code-switching, informal spelling are all preserved
- MD5-based deduplication across all sources
- Min length: 5 chars, Max length: 2000 chars
## Purpose
This corpus is the training data for RaahNaameh-1, an open Persian embedding model
created by distilling Gemini Embedding 2's knowledge into a compact student model.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Reza2kn/RaahNaameh-1-textual-corpus", split="train", streaming=True)
for row in ds:
print(row["text"], row["source"])
```
提供机构:
Reza2kn



