five

aligh4699/persian-spell-correction-dataset

收藏
Hugging Face2025-11-15 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/aligh4699/persian-spell-correction-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: original_id dtype: int64 - name: noisy_text dtype: string - name: corrected_text dtype: string splits: - name: CorpusToCorrect num_bytes: 106937316 num_examples: 99899 - name: CorrectToCorrect num_bytes: 107668886 num_examples: 99899 - name: Augm05ToCorrect num_bytes: 106925362 num_examples: 99899 - name: Augm10ToCorrect num_bytes: 106954013 num_examples: 99899 - name: Augm15ToCorrect num_bytes: 106979581 num_examples: 99899 - name: Augm20ToCorrect num_bytes: 107008008 num_examples: 99899 - name: Augm25ToCorrect num_bytes: 107036623 num_examples: 99899 - name: Augm30ToCorrect num_bytes: 107063497 num_examples: 99899 - name: Augm35ToCorrect num_bytes: 107088354 num_examples: 99899 - name: Augm40ToCorrect num_bytes: 107116062 num_examples: 99899 - name: Augm45ToCorrect num_bytes: 107143838 num_examples: 99899 - name: Augm50ToCorrect num_bytes: 107172142 num_examples: 99899 download_size: 674076108 dataset_size: 1285093682 configs: - config_name: default data_files: - split: CorpusToCorrect path: data/CorpusToCorrect-* - split: CorrectToCorrect path: data/CorrectToCorrect-* - split: Augm05ToCorrect path: data/Augm05ToCorrect-* - split: Augm10ToCorrect path: data/Augm10ToCorrect-* - split: Augm15ToCorrect path: data/Augm15ToCorrect-* - split: Augm20ToCorrect path: data/Augm20ToCorrect-* - split: Augm25ToCorrect path: data/Augm25ToCorrect-* - split: Augm30ToCorrect path: data/Augm30ToCorrect-* - split: Augm35ToCorrect path: data/Augm35ToCorrect-* - split: Augm40ToCorrect path: data/Augm40ToCorrect-* - split: Augm45ToCorrect path: data/Augm45ToCorrect-* - split: Augm50ToCorrect path: data/Augm50ToCorrect-* --- # Persian Spell Correction & Augmentation Dataset This is a large-scale, parallel dataset for Persian spell correction, text normalization, and augmentation. It is designed to train and evaluate models for correcting a wide variety of common and synthetic errors in Persian text. The dataset is built from two main components: 1. **Natural Data:** Text from diverse Persian corpora and its corresponding clean, corrected version (`corrected_text`) generated by an LLM. 2. **Augmented Data:** The `original_text` has been synthetically "noised" using a sophisticated Persian augmentation pipeline (`Ashoob`) with 10 different noise levels. This multi-split structure allows for training models on specific noise types, from naturally occurring errors to highly degraded text. ## How to Use You can load the entire dataset (all 12 configurations) at once using the `datasets` library: ```python from datasets import load_dataset # Load the entire DatasetDict all_splits = load_dataset("your-username/my-persian-spell-correction-dataset") # You can then access any split by its key print(all_splits["Augm25ToCorrect"][0]) ```` Alternatively, you can load a single, specific split (e.g., only the 25% augmented data): ```python from datasets import load_dataset # Load just one split aug_25_data = load_dataset( "your-username/my-persian-spell-correction-dataset", split="Augm25ToCorrect" ) print(aug_25_data[0]) ``` ## Dataset Structure The dataset is a `DatasetDict` containing 12 splits (configurations). Each split contains the same 100,000+ rows, ensuring a one-to-one mapping for all entries. ### Data Splits The 12 splits are designed for different training strategies: | Split Name | `noisy_text` Source | `corrected_text` Source | Noise Parameters | | :--- | :--- | :--- | :--- | | **CorpusToCorrect** | `original_text` (from corpora) | LLM-Corrected | Natural errors | | **CorrectToCorrect** | LLM-Corrected Text | LLM-Corrected Text | None (Clean-to-Clean) | | **Augm05ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 5% density, min\_dist=3 | | **Augm10ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 10% density, min\_dist=3 | | **Augm15ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 15% density, min\_dist=3 | | **Augm20ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 20% density, min\_dist=2 | | **Augm25ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 25% density, min\_dist=2 | | **Augm30ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 30% density, min\_dist=2 | | **Augm35ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 35% density, min\_dist=2 | | **Augm40ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 40% density, min\_dist=1 | | **Augm45ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 45% density, min\_dist=1 | | **Augm50ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 50% density, min\_dist=1 | ### Data Fields Each split shares the same structure with three columns: * **original\_id**: An integer ID to map rows across all splits. * **noisy\_text**: The input text, which is either from the original corpus or a synthetically noised version. * **corrected\_text**: The target clean/corrected version of the text. ### Data Instances Here is an example from the **CorpusToCorrect** split, showing a typical correction: ```json { "original_id": 0, "noisy_text": "مواد نانو متخلخل زیر مجموعه ای از مواد نانو ساختار است که حفراتی در مقیاس نانومتر دارد. این نوع مواد با مساحت سطح درونی بسیار بالا، قابلیت بسیار زیادی در جذب و برهمکنش با اتم ها، مولکول ها و یون ها داشته و اهمیت زیادی از لحاظ تئوریک و کاربردی پیدا کرده اند.", "corrected_text": "مواد نانو متخلخل زیرمجموعه‌ای از مواد نانوساختار است که حفراتی در مقیاس نانومتر دارد. این نوع مواد با مساحت سطح درونی بسیار بالا، قابلیت بسیار زیادی در جذب و برهمکنش با اتم‌ها، مولکول‌ها و یون‌ها داشته و اهمیت زیادی از لحاظ تئوریک و کاربردی پیدا کرده‌اند." } ``` ----- ## Dataset Creation ### Source Data The `original_text` was aggregated from a diverse set of large-scale Persian corpora, including: * [Hamshahri Corpus](https://www.kaggle.com/datasets/ehsankhani/hamshahri-corpus/) * [Tasnim News Dataset](https://www.kaggle.com/datasets/amirpourmand/tasnimdataset/) * [Persian Wikipedia Dataset](https://www.kaggle.com/datasets/miladfa7/persian-wikipedia-dataset/) * [Ensani Abstracts Dataset](https://www.kaggle.com/datasets/amirpourmand/ensani-abstracts/) A single file of 100,000+ samples was curated from these sources to form the basis of this dataset. ### Correction & Annotation The `corrected_text` column was generated by processing the `original_text` through a fine-tuned LLM (based on `llama4-maverick`). This model was instructed to fix spelling, grammar, punctuation, and spacing errors (e.g., "می" and "ها" affixes) to produce a clean, standardized version of the text. ### Data Augmentation (Ashoob Pipeline) The augmented splits (`Augm05ToCorrect` through `Augm50ToCorrect`) were created using a custom Persian noise pipeline. This `Ashoob` (آشوب) pipeline applies a variety of realistic errors to the `original_text`. The noise generation is based on 13 different "AshoobSaz" (آشوب‌ساز) modules, including: * **Word-Level Noise:** * `PatternColloquial`: Converts formal text to colloquial (spoken) forms. * `CommonColloquial`: Replaces words with common misspellings or slang. * **Character-Level Noise:** * `FaKeyboardTouch`: Simulates typos based on adjacent keys on a Persian keyboard. * `CharacterDeletion`: Randomly deletes characters. * `CharacterInsertion`: Randomly inserts characters. * `CharacterTansposition`: Swaps adjacent characters. * `CharacterRepetition`: Randomly repeats a character. * `CharacterVisual`: Replaces characters with visually similar ones (e.g., "ی" vs "ی"). * `CharacterPhonetic`: Replaces characters with phonetically similar ones (e.g., "ذ" vs "ز"). * **Structural Noise:** * `MiPrefixSpacing`: Creates spacing errors for the "می" (mi-) prefix in verbs. * `HaSuffixSpacingNoise`: Creates spacing errors for the "ها" (-ha) plural suffix. * `GeneralSpacingNoise`: Introduces other random spacing errors (deletions/insertions). * `PunctuationNoise`: Randomly deletes or adds punctuation. <!-- end list --> ``` ```
提供机构:
aligh4699
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作