five

Reza2kn/RaahNaameh-1-textual-corpus

收藏
Hugging Face2026-03-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Reza2kn/RaahNaameh-1-textual-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - fa license: cc-by-4.0 task_categories: - sentence-similarity - text-retrieval - text-classification tags: - persian - farsi - embeddings - raahnaameh size_categories: - 10M<n<100M --- # RaahNaameh-1 Textual Corpus A large-scale Persian text corpus assembled for training the RaahNaameh-1 embedding model. ## Sources | Source | Sentences | Description | |--------|-----------|-------------| | Jomleh | 1,002,221 | Formal Persian web text | | LSCP | 10,257,866 | Iranian tweets — colloquial, slang, emoji | | Persian Wikipedia | 1,107,618 | Encyclopedic articles | | **Total** | **12,367,705** | | ## Processing - Light normalization only: Arabic→Persian character mapping, zero-width space removal - Emojis, Finglish, code-switching, informal spelling are all preserved - MD5-based deduplication across all sources - Min length: 5 chars, Max length: 2000 chars ## Purpose This corpus is the training data for RaahNaameh-1, an open Persian embedding model created by distilling Gemini Embedding 2's knowledge into a compact student model. ## Usage ```python from datasets import load_dataset ds = load_dataset("Reza2kn/RaahNaameh-1-textual-corpus", split="train", streaming=True) for row in ds: print(row["text"], row["source"]) ```
提供机构:
Reza2kn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作