aligh4699/persian-spell-correction-dataset
收藏Hugging Face2025-11-15 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/aligh4699/persian-spell-correction-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: original_id
dtype: int64
- name: noisy_text
dtype: string
- name: corrected_text
dtype: string
splits:
- name: CorpusToCorrect
num_bytes: 106937316
num_examples: 99899
- name: CorrectToCorrect
num_bytes: 107668886
num_examples: 99899
- name: Augm05ToCorrect
num_bytes: 106925362
num_examples: 99899
- name: Augm10ToCorrect
num_bytes: 106954013
num_examples: 99899
- name: Augm15ToCorrect
num_bytes: 106979581
num_examples: 99899
- name: Augm20ToCorrect
num_bytes: 107008008
num_examples: 99899
- name: Augm25ToCorrect
num_bytes: 107036623
num_examples: 99899
- name: Augm30ToCorrect
num_bytes: 107063497
num_examples: 99899
- name: Augm35ToCorrect
num_bytes: 107088354
num_examples: 99899
- name: Augm40ToCorrect
num_bytes: 107116062
num_examples: 99899
- name: Augm45ToCorrect
num_bytes: 107143838
num_examples: 99899
- name: Augm50ToCorrect
num_bytes: 107172142
num_examples: 99899
download_size: 674076108
dataset_size: 1285093682
configs:
- config_name: default
data_files:
- split: CorpusToCorrect
path: data/CorpusToCorrect-*
- split: CorrectToCorrect
path: data/CorrectToCorrect-*
- split: Augm05ToCorrect
path: data/Augm05ToCorrect-*
- split: Augm10ToCorrect
path: data/Augm10ToCorrect-*
- split: Augm15ToCorrect
path: data/Augm15ToCorrect-*
- split: Augm20ToCorrect
path: data/Augm20ToCorrect-*
- split: Augm25ToCorrect
path: data/Augm25ToCorrect-*
- split: Augm30ToCorrect
path: data/Augm30ToCorrect-*
- split: Augm35ToCorrect
path: data/Augm35ToCorrect-*
- split: Augm40ToCorrect
path: data/Augm40ToCorrect-*
- split: Augm45ToCorrect
path: data/Augm45ToCorrect-*
- split: Augm50ToCorrect
path: data/Augm50ToCorrect-*
---
# Persian Spell Correction & Augmentation Dataset
This is a large-scale, parallel dataset for Persian spell correction, text normalization, and augmentation. It is designed to train and evaluate models for correcting a wide variety of common and synthetic errors in Persian text.
The dataset is built from two main components:
1. **Natural Data:** Text from diverse Persian corpora and its corresponding clean, corrected version (`corrected_text`) generated by an LLM.
2. **Augmented Data:** The `original_text` has been synthetically "noised" using a sophisticated Persian augmentation pipeline (`Ashoob`) with 10 different noise levels.
This multi-split structure allows for training models on specific noise types, from naturally occurring errors to highly degraded text.
## How to Use
You can load the entire dataset (all 12 configurations) at once using the `datasets` library:
```python
from datasets import load_dataset
# Load the entire DatasetDict
all_splits = load_dataset("your-username/my-persian-spell-correction-dataset")
# You can then access any split by its key
print(all_splits["Augm25ToCorrect"][0])
````
Alternatively, you can load a single, specific split (e.g., only the 25% augmented data):
```python
from datasets import load_dataset
# Load just one split
aug_25_data = load_dataset(
"your-username/my-persian-spell-correction-dataset",
split="Augm25ToCorrect"
)
print(aug_25_data[0])
```
## Dataset Structure
The dataset is a `DatasetDict` containing 12 splits (configurations). Each split contains the same 100,000+ rows, ensuring a one-to-one mapping for all entries.
### Data Splits
The 12 splits are designed for different training strategies:
| Split Name | `noisy_text` Source | `corrected_text` Source | Noise Parameters |
| :--- | :--- | :--- | :--- |
| **CorpusToCorrect** | `original_text` (from corpora) | LLM-Corrected | Natural errors |
| **CorrectToCorrect** | LLM-Corrected Text | LLM-Corrected Text | None (Clean-to-Clean) |
| **Augm05ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 5% density, min\_dist=3 |
| **Augm10ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 10% density, min\_dist=3 |
| **Augm15ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 15% density, min\_dist=3 |
| **Augm20ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 20% density, min\_dist=2 |
| **Augm25ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 25% density, min\_dist=2 |
| **Augm30ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 30% density, min\_dist=2 |
| **Augm35ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 35% density, min\_dist=2 |
| **Augm40ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 40% density, min\_dist=1 |
| **Augm45ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 45% density, min\_dist=1 |
| **Augm50ToCorrect** | `original_text` + Augmentation | LLM-Corrected | 50% density, min\_dist=1 |
### Data Fields
Each split shares the same structure with three columns:
* **original\_id**: An integer ID to map rows across all splits.
* **noisy\_text**: The input text, which is either from the original corpus or a synthetically noised version.
* **corrected\_text**: The target clean/corrected version of the text.
### Data Instances
Here is an example from the **CorpusToCorrect** split, showing a typical correction:
```json
{
"original_id": 0,
"noisy_text": "مواد نانو متخلخل زیر مجموعه ای از مواد نانو ساختار است که حفراتی در مقیاس نانومتر دارد. این نوع مواد با مساحت سطح درونی بسیار بالا، قابلیت بسیار زیادی در جذب و برهمکنش با اتم ها، مولکول ها و یون ها داشته و اهمیت زیادی از لحاظ تئوریک و کاربردی پیدا کرده اند.",
"corrected_text": "مواد نانو متخلخل زیرمجموعهای از مواد نانوساختار است که حفراتی در مقیاس نانومتر دارد. این نوع مواد با مساحت سطح درونی بسیار بالا، قابلیت بسیار زیادی در جذب و برهمکنش با اتمها، مولکولها و یونها داشته و اهمیت زیادی از لحاظ تئوریک و کاربردی پیدا کردهاند."
}
```
-----
## Dataset Creation
### Source Data
The `original_text` was aggregated from a diverse set of large-scale Persian corpora, including:
* [Hamshahri Corpus](https://www.kaggle.com/datasets/ehsankhani/hamshahri-corpus/)
* [Tasnim News Dataset](https://www.kaggle.com/datasets/amirpourmand/tasnimdataset/)
* [Persian Wikipedia Dataset](https://www.kaggle.com/datasets/miladfa7/persian-wikipedia-dataset/)
* [Ensani Abstracts Dataset](https://www.kaggle.com/datasets/amirpourmand/ensani-abstracts/)
A single file of 100,000+ samples was curated from these sources to form the basis of this dataset.
### Correction & Annotation
The `corrected_text` column was generated by processing the `original_text` through a fine-tuned LLM (based on `llama4-maverick`). This model was instructed to fix spelling, grammar, punctuation, and spacing errors (e.g., "می" and "ها" affixes) to produce a clean, standardized version of the text.
### Data Augmentation (Ashoob Pipeline)
The augmented splits (`Augm05ToCorrect` through `Augm50ToCorrect`) were created using a custom Persian noise pipeline. This `Ashoob` (آشوب) pipeline applies a variety of realistic errors to the `original_text`.
The noise generation is based on 13 different "AshoobSaz" (آشوبساز) modules, including:
* **Word-Level Noise:**
* `PatternColloquial`: Converts formal text to colloquial (spoken) forms.
* `CommonColloquial`: Replaces words with common misspellings or slang.
* **Character-Level Noise:**
* `FaKeyboardTouch`: Simulates typos based on adjacent keys on a Persian keyboard.
* `CharacterDeletion`: Randomly deletes characters.
* `CharacterInsertion`: Randomly inserts characters.
* `CharacterTansposition`: Swaps adjacent characters.
* `CharacterRepetition`: Randomly repeats a character.
* `CharacterVisual`: Replaces characters with visually similar ones (e.g., "ی" vs "ی").
* `CharacterPhonetic`: Replaces characters with phonetically similar ones (e.g., "ذ" vs "ز").
* **Structural Noise:**
* `MiPrefixSpacing`: Creates spacing errors for the "می" (mi-) prefix in verbs.
* `HaSuffixSpacingNoise`: Creates spacing errors for the "ها" (-ha) plural suffix.
* `GeneralSpacingNoise`: Introduces other random spacing errors (deletions/insertions).
* `PunctuationNoise`: Randomly deletes or adds punctuation.
<!-- end list -->
```
```
提供机构:
aligh4699



