Syrinesmati/tunisian-dialect-corpus
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Syrinesmati/tunisian-dialect-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# Dataset Card for Tunisian Dialect Corpus (Cleaned Arabic-Only)
## 1. Dataset Overview
### 1.1 Description
This dataset is a cleaned corpus of Tunisian Arabic dialect text, aggregated from multiple public sources on Hugging Face. It is designed for **Continual Pretraining (CPT)** and general NLP research.
A dedicated preprocessing pipeline was applied to:
- Normalize text
- Remove noise and artifacts
- Filter non-Arabic content
- Ensure higher overall data quality
The final dataset focuses on **Arabic-script Tunisian dialect (ar-TN)** with reduced noise and improved consistency.
---
### 1.2 Key Information
| Field | Value |
|---|---|
| **Curated by** | Syrinesmati |
| **Language(s)** | Arabic (Tunisian dialect, ar-TN) |
| **License** | Apache-2.0 |
| **Format** | Parquet |
| **Primary Field** | `text` |
---
## 2. Dataset Sources
This dataset is a merged and cleaned derivative of the following public resources:
| # | Source | Notes |
|---|---|---|
| 1 | `linagora/Tunisian_Derja_Dataset` | Transcripts kept as-is |
| 2 | `atakaboudi/Dialect_of_Tunisia-Work_Collection` | — |
| 3 | `tunis-ai/tunisian-msa-parallel-corpus` | — |
| 4 | `Arbi-Houssem/Tunisian_dataset_STT-TTS15s_filtred_organiser_Mixed` | STT/TTS dataset; target sentences and transcripts kept as-is |
| 5 | `linagora/linto-asr-ar-tn-0.1` | ASR dataset from multiple sources (YouTube_TNScrapped, TunswitchTO, TunswitchCS, ApprendreLeTunisien, Taric, OneStory); sentences kept as-is |
| 6 | Tunisian Arabic Dialects Identification — TADI | Binary classification dataset; only rows identified as Tunisian dialect extracted |
| 7 | Tunisian Algerian Dialect — TAD ([instadeepai/tunbert](https://github.com/instadeepai/tunbert)) | Binary classification dataset; only rows identified as Tunisian dialect extracted |
| 8 | `khaled123/tunninjaar` | Extracted from [derja.ninja](https://derja.ninja/) |
| 9 | Hala-Mulki / T-HSAB — Tunisian Hate Speech and Abusive Dataset | Only rows marked as non-hate-speech retained |
| 10 | Tunisian Reading Comprehension Dataset | QA dataset based on the Tunisian Constitution; 144 documents × 3 paragraphs × 3 QA pairs |
| 11 | Naim Mhedhbi — Tunisian Dialect Corpus v0 | ~40,000 Facebook comments/posts; only positive and neutral rows retained |
| 12 | TSAC — Tunisian Sentiment Analysis Corpus ([paperswithcode](https://paperswithcode.com/dataset/tsac)) | ~17,000 Facebook comments in Tunisian dialect |
| 13 | `khaled123/Testtun` | — |
| 14 | `khaled123/Tuniset` | — |
---
## 3. Intended Uses
### 3.1 Direct Use
This dataset is suitable for:
- Continual pretraining of Arabic or multilingual LLMs
- Domain adaptation for Tunisian dialect
- Text generation and understanding in Tunisian Arabic
- Corpus and linguistic analysis
- Data preparation for downstream NLP tasks
### 3.2 Out-of-Scope Use
This dataset is **not intended for**:
- High-stakes decision-making systems
- Surveillance or identity profiling
- Medical, legal, or financial applications without safeguards
- Claims of full representativeness of Tunisian dialects
---
## 4. Dataset Statistics
| Metric | Value |
|---|---|
| **Total Tokens (GPT tokenization)** | 168,371,728 |
---
## 5. Dataset Structure
The dataset is distributed in **Parquet format** and contains:
- `text`: Cleaned Tunisian Arabic text
Optional metadata fields may be included depending on preprocessing stages.
---
## 6. Dataset Creation
The dataset is distributed in **Parquet format** and contains:
- `text`: Cleaned Tunisian Arabic text
Optional metadata fields may be included depending on preprocessing stages.
---
## 6. Dataset Creation
### 6.1 Motivation
Tunisian Arabic is significantly underrepresented in open NLP resources. This dataset aims to:
- Provide a **large-scale, clean corpus**
- Enable **reproducible research**
- Improve **dialectal Arabic model performance**
### 6.2 Data Collection
The dataset is built by:
- Aggregating multiple public datasets
- Merging and standardizing formats
- Removing redundancy and inconsistencies
### 6.3 Processing Pipeline
The cleaning pipeline includes:
- Text normalization (whitespace, formatting)
- Removal of noisy artifacts (HTML tags, social media patterns)
- Emoji and symbol cleanup
- Filtering non-Arabic or mixed-script text
- Optional removal of digits and hashtags
- Duplicate and near-duplicate removal
- Minimum length and word count filtering
- Arabic-only filtering (final dataset)
- Data shuffling before export
### 6.4 Source Data Producers
Original data originates from contributors and maintainers of the upstream datasets. This dataset is a **cleaned and merged derivative version**.
### 6.5 Personal and Sensitive Information
As with many web-based corpora:
- Some personal or sensitive information may remain
- No guarantee of full anonymization is provided
Users should apply additional filtering if required for sensitive applications.
---
## 7. Bias, Risks, and Limitations
- **Source Bias:** Reflects biases of original datasets and platforms
- **Language Bias:** Focus on Arabic script reduces code-switching diversity
- **Coverage Limitation:** Does not fully represent all Tunisian regions or sociolects
---
## 8. Citation
### 8.1 This Dataset
**BibTeX:**
```bibtex
@dataset{tunisian_dialect_corpus_cleaned_2026,
title = {Tunisian Dialect Corpus (Cleaned Arabic-Only)},
author = {Syrinesmati},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Syrinesmati/tunisian-dialect-corpus}
}
```
**APA:**
Syrinesmati. (2026). *Tunisian Dialect Corpus (Cleaned Arabic-Only)* [Dataset]. Hugging Face. https://huggingface.co/datasets/Syrinesmati/tunisian-dialect-corpus
---
### 8.2 Citations for Included Datasets
**Tunisian Derja Dataset**
```bibtex
@dataset{linagora2025LLM-tn,
author = {Wajdi Ghezaiel and Jean-Pierre Lorré},
title = {Tunisian Derja Dataset},
year = {2025},
month = {January},
url = {https://huggingface.co/datasets/linagora/Tunisian_Derja_Dataset}
}
```
**Tunisian-English Dialectic Derja**
```bibtex
@dataset{Tunisian_English_dialectic_Derja,
author = {Khaled Bouzaiene},
title = {Tunisian-English Dialectic Derja Dataset},
year = {2024},
url = {https://huggingface.co/datasets/khaled123/Tunisian_English_dialectic_Derja}
}
```
**TunSwitch**
```bibtex
@misc{abdallah2023leveraging,
title = {Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition},
author = {Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah Zaiem},
year = {2023},
eprint = {2309.11327},
archivePrefix = {arXiv},
primaryClass = {eess.AS}
}
```
**LinTO Textual Dataset (Tunisian Arabic)**
```bibtex
@misc{linagora2024Linto-tn,
author = {Hedi Naouara and Jérôme Louradour and Jean-Pierre Lorré},
title = {LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect},
year = {2024},
month = {October},
note = {Good Data Workshop, AAAI 2025},
howpublished = {\url{https://huggingface.co/linagora/linto-asr-ar-tn-0.1}}
}
```
**Arbi-Houssem STT/TTS Dataset**
```bibtex
@dataset{arbi_houssem_stt_tts,
author = {Arbi Houssem},
title = {Tunisian Dataset STT-TTS 15s Filtered Organised Mixed},
url = {https://huggingface.co/datasets/Arbi-Houssem/Tunisian_dataset_STT-TTS15s_filtred_organiser_Mixed}
}
```
**LinTO ASR — Tunisian Arabic (linto-asr-ar-tn-0.1)**
```bibtex
@misc{linagora2024linto_asr,
author = {Hedi Naouara and Jérôme Louradour and Jean-Pierre Lorré},
title = {LinTO ASR Tunisian Arabic Dialect Dataset},
year = {2024},
howpublished = {\url{https://huggingface.co/datasets/linagora/linto-asr-ar-tn-0.1}}
}
```
**Tunisian Arabic Dialects Identification (TADI)**
```bibtex
@dataset{tadi,
title = {Tunisian Arabic Dialects Identification (TADI)},
note = {Binary classification dataset for Tunisian vs. non-Tunisian Arabic dialect identification}
}
```
**Tunisian Algerian Dialect (TAD) — TunBERT**
```bibtex
@misc{tunbert_tad,
author = {InstaDeep},
title = {Tunisian Algerian Dialect Dataset},
howpublished = {\url{https://github.com/instadeepai/tunbert}}
}
```
**Tunninjaar — derja.ninja**
```bibtex
@dataset{tunninjaar,
author = {Khaled Bouzaiene},
title = {Tunninjaar},
url = {https://huggingface.co/datasets/khaled123/tunninjaar}
}
```
**T-HSAB — Tunisian Hate Speech and Abusive Dataset**
```bibtex
@inproceedings{mulki2019tsab,
author = {Hala Mulki and Hatem Haddad and Chedi Bechikh Ali and Halima Alshabani},
title = {T-HSAB: A Tunisian Hate Speech and Abusive Language Dataset},
booktitle = {Proceedings of the 7th International Conference on Arabic Language Processing},
year = {2019}
}
```
**Tunisian Reading Comprehension Dataset**
```bibtex
@dataset{tunisian_rc,
title = {Tunisian Reading Comprehension Dataset},
note = {Question-Answering dataset based on the Tunisian constitution; 144 documents, 3 paragraphs each, 3 QA pairs per paragraph}
}
```
**Naim Mhedhbi — Tunisian Dialect Corpus v0**
```bibtex
@dataset{mhedhbi_tunisian_v0,
author = {Naim Mhedhbi},
title = {Tunisian Dialect Corpus v0},
note = {~40,000 Facebook comments and posts labeled for sentiment}
}
```
**TSAC — Tunisian Sentiment Analysis Corpus**
```bibtex
@dataset{tsac,
title = {Tunisian Sentiment Analysis Corpus (TSAC)},
note = {~17,000 Facebook comments in Tunisian dialect},
howpublished = {\url{https://paperswithcode.com/dataset/tsac}}
}
```
**khaled123/Testtun**
```bibtex
@dataset{testtun,
author = {Khaled Bouzaiene},
title = {Testtun},
url = {https://huggingface.co/datasets/khaled123/Testtun}
}
```
**khaled123/Tuniset**
```bibtex
@dataset{tuniset,
author = {Khaled Bouzaiene},
title = {Tuniset},
url = {https://huggingface.co/datasets/khaled123/Tuniset}
}
```
---
## 9. Dataset Card Authors
[Syrinesmati](https://huggingface.co/Syrinesmati)
---
## 10. Contact
For questions, issues, or updates, please use the **Hugging Face dataset repository discussion/issues page**.
提供机构:
Syrinesmati



