Reza2kn/RaahNaameh-1-textual-corpus

Name: Reza2kn/RaahNaameh-1-textual-corpus
Creator: Reza2kn
Published: 2026-03-16 22:37:43
License: 暂无描述

Hugging Face2026-03-16 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Reza2kn/RaahNaameh-1-textual-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fa license: cc-by-4.0 task_categories: - sentence-similarity - text-retrieval - text-classification tags: - persian - farsi - embeddings - raahnaameh size_categories: - 10M<n<100M --- # RaahNaameh-1 Textual Corpus A large-scale Persian text corpus assembled for training the RaahNaameh-1 embedding model. ## Sources | Source | Sentences | Description | |--------|-----------|-------------| | Jomleh | 1,002,221 | Formal Persian web text | | LSCP | 10,257,866 | Iranian tweets — colloquial, slang, emoji | | Persian Wikipedia | 1,107,618 | Encyclopedic articles | | **Total** | **12,367,705** | | ## Processing - Light normalization only: Arabic→Persian character mapping, zero-width space removal - Emojis, Finglish, code-switching, informal spelling are all preserved - MD5-based deduplication across all sources - Min length: 5 chars, Max length: 2000 chars ## Purpose This corpus is the training data for RaahNaameh-1, an open Persian embedding model created by distilling Gemini Embedding 2's knowledge into a compact student model. ## Usage ```python from datasets import load_dataset ds = load_dataset("Reza2kn/RaahNaameh-1-textual-corpus", split="train", streaming=True) for row in ds: print(row["text"], row["source"]) ```

提供机构：

Reza2kn

5,000+

优质数据集

54 个

任务类型

进入经典数据集