Syrinesmati/tunisian-dialect-corpus

Name: Syrinesmati/tunisian-dialect-corpus
Creator: Syrinesmati
Published: 2026-04-15 10:51:23
License: 暂无描述

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Syrinesmati/tunisian-dialect-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # Dataset Card for Tunisian Dialect Corpus (Cleaned Arabic-Only) ## 1. Dataset Overview ### 1.1 Description This dataset is a cleaned corpus of Tunisian Arabic dialect text, aggregated from multiple public sources on Hugging Face. It is designed for **Continual Pretraining (CPT)** and general NLP research. A dedicated preprocessing pipeline was applied to: - Normalize text - Remove noise and artifacts - Filter non-Arabic content - Ensure higher overall data quality The final dataset focuses on **Arabic-script Tunisian dialect (ar-TN)** with reduced noise and improved consistency. --- ### 1.2 Key Information | Field | Value | |---|---| | **Curated by** | Syrinesmati | | **Language(s)** | Arabic (Tunisian dialect, ar-TN) | | **License** | Apache-2.0 | | **Format** | Parquet | | **Primary Field** | `text` | --- ## 2. Dataset Sources This dataset is a merged and cleaned derivative of the following public resources: | # | Source | Notes | |---|---|---| | 1 | `linagora/Tunisian_Derja_Dataset` | Transcripts kept as-is | | 2 | `atakaboudi/Dialect_of_Tunisia-Work_Collection` | — | | 3 | `tunis-ai/tunisian-msa-parallel-corpus` | — | | 4 | `Arbi-Houssem/Tunisian_dataset_STT-TTS15s_filtred_organiser_Mixed` | STT/TTS dataset; target sentences and transcripts kept as-is | | 5 | `linagora/linto-asr-ar-tn-0.1` | ASR dataset from multiple sources (YouTube_TNScrapped, TunswitchTO, TunswitchCS, ApprendreLeTunisien, Taric, OneStory); sentences kept as-is | | 6 | Tunisian Arabic Dialects Identification — TADI | Binary classification dataset; only rows identified as Tunisian dialect extracted | | 7 | Tunisian Algerian Dialect — TAD ([instadeepai/tunbert](https://github.com/instadeepai/tunbert)) | Binary classification dataset; only rows identified as Tunisian dialect extracted | | 8 | `khaled123/tunninjaar` | Extracted from [derja.ninja](https://derja.ninja/) | | 9 | Hala-Mulki / T-HSAB — Tunisian Hate Speech and Abusive Dataset | Only rows marked as non-hate-speech retained | | 10 | Tunisian Reading Comprehension Dataset | QA dataset based on the Tunisian Constitution; 144 documents × 3 paragraphs × 3 QA pairs | | 11 | Naim Mhedhbi — Tunisian Dialect Corpus v0 | ~40,000 Facebook comments/posts; only positive and neutral rows retained | | 12 | TSAC — Tunisian Sentiment Analysis Corpus ([paperswithcode](https://paperswithcode.com/dataset/tsac)) | ~17,000 Facebook comments in Tunisian dialect | | 13 | `khaled123/Testtun` | — | | 14 | `khaled123/Tuniset` | — | --- ## 3. Intended Uses ### 3.1 Direct Use This dataset is suitable for: - Continual pretraining of Arabic or multilingual LLMs - Domain adaptation for Tunisian dialect - Text generation and understanding in Tunisian Arabic - Corpus and linguistic analysis - Data preparation for downstream NLP tasks ### 3.2 Out-of-Scope Use This dataset is **not intended for**: - High-stakes decision-making systems - Surveillance or identity profiling - Medical, legal, or financial applications without safeguards - Claims of full representativeness of Tunisian dialects --- ## 4. Dataset Statistics | Metric | Value | |---|---| | **Total Tokens (GPT tokenization)** | 168,371,728 | --- ## 5. Dataset Structure The dataset is distributed in **Parquet format** and contains: - `text`: Cleaned Tunisian Arabic text Optional metadata fields may be included depending on preprocessing stages. --- ## 6. Dataset Creation The dataset is distributed in **Parquet format** and contains: - `text`: Cleaned Tunisian Arabic text Optional metadata fields may be included depending on preprocessing stages. --- ## 6. Dataset Creation ### 6.1 Motivation Tunisian Arabic is significantly underrepresented in open NLP resources. This dataset aims to: - Provide a **large-scale, clean corpus** - Enable **reproducible research** - Improve **dialectal Arabic model performance** ### 6.2 Data Collection The dataset is built by: - Aggregating multiple public datasets - Merging and standardizing formats - Removing redundancy and inconsistencies ### 6.3 Processing Pipeline The cleaning pipeline includes: - Text normalization (whitespace, formatting) - Removal of noisy artifacts (HTML tags, social media patterns) - Emoji and symbol cleanup - Filtering non-Arabic or mixed-script text - Optional removal of digits and hashtags - Duplicate and near-duplicate removal - Minimum length and word count filtering - Arabic-only filtering (final dataset) - Data shuffling before export ### 6.4 Source Data Producers Original data originates from contributors and maintainers of the upstream datasets. This dataset is a **cleaned and merged derivative version**. ### 6.5 Personal and Sensitive Information As with many web-based corpora: - Some personal or sensitive information may remain - No guarantee of full anonymization is provided Users should apply additional filtering if required for sensitive applications. --- ## 7. Bias, Risks, and Limitations - **Source Bias:** Reflects biases of original datasets and platforms - **Language Bias:** Focus on Arabic script reduces code-switching diversity - **Coverage Limitation:** Does not fully represent all Tunisian regions or sociolects --- ## 8. Citation ### 8.1 This Dataset **BibTeX:** ```bibtex @dataset{tunisian_dialect_corpus_cleaned_2026, title = {Tunisian Dialect Corpus (Cleaned Arabic-Only)}, author = {Syrinesmati}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Syrinesmati/tunisian-dialect-corpus} } ``` **APA:** Syrinesmati. (2026). *Tunisian Dialect Corpus (Cleaned Arabic-Only)* [Dataset]. Hugging Face. https://huggingface.co/datasets/Syrinesmati/tunisian-dialect-corpus --- ### 8.2 Citations for Included Datasets **Tunisian Derja Dataset** ```bibtex @dataset{linagora2025LLM-tn, author = {Wajdi Ghezaiel and Jean-Pierre Lorré}, title = {Tunisian Derja Dataset}, year = {2025}, month = {January}, url = {https://huggingface.co/datasets/linagora/Tunisian_Derja_Dataset} } ``` **Tunisian-English Dialectic Derja** ```bibtex @dataset{Tunisian_English_dialectic_Derja, author = {Khaled Bouzaiene}, title = {Tunisian-English Dialectic Derja Dataset}, year = {2024}, url = {https://huggingface.co/datasets/khaled123/Tunisian_English_dialectic_Derja} } ``` **TunSwitch** ```bibtex @misc{abdallah2023leveraging, title = {Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition}, author = {Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah Zaiem}, year = {2023}, eprint = {2309.11327}, archivePrefix = {arXiv}, primaryClass = {eess.AS} } ``` **LinTO Textual Dataset (Tunisian Arabic)** ```bibtex @misc{linagora2024Linto-tn, author = {Hedi Naouara and Jérôme Louradour and Jean-Pierre Lorré}, title = {LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect}, year = {2024}, month = {October}, note = {Good Data Workshop, AAAI 2025}, howpublished = {\url{https://huggingface.co/linagora/linto-asr-ar-tn-0.1}} } ``` **Arbi-Houssem STT/TTS Dataset** ```bibtex @dataset{arbi_houssem_stt_tts, author = {Arbi Houssem}, title = {Tunisian Dataset STT-TTS 15s Filtered Organised Mixed}, url = {https://huggingface.co/datasets/Arbi-Houssem/Tunisian_dataset_STT-TTS15s_filtred_organiser_Mixed} } ``` **LinTO ASR — Tunisian Arabic (linto-asr-ar-tn-0.1)** ```bibtex @misc{linagora2024linto_asr, author = {Hedi Naouara and Jérôme Louradour and Jean-Pierre Lorré}, title = {LinTO ASR Tunisian Arabic Dialect Dataset}, year = {2024}, howpublished = {\url{https://huggingface.co/datasets/linagora/linto-asr-ar-tn-0.1}} } ``` **Tunisian Arabic Dialects Identification (TADI)** ```bibtex @dataset{tadi, title = {Tunisian Arabic Dialects Identification (TADI)}, note = {Binary classification dataset for Tunisian vs. non-Tunisian Arabic dialect identification} } ``` **Tunisian Algerian Dialect (TAD) — TunBERT** ```bibtex @misc{tunbert_tad, author = {InstaDeep}, title = {Tunisian Algerian Dialect Dataset}, howpublished = {\url{https://github.com/instadeepai/tunbert}} } ``` **Tunninjaar — derja.ninja** ```bibtex @dataset{tunninjaar, author = {Khaled Bouzaiene}, title = {Tunninjaar}, url = {https://huggingface.co/datasets/khaled123/tunninjaar} } ``` **T-HSAB — Tunisian Hate Speech and Abusive Dataset** ```bibtex @inproceedings{mulki2019tsab, author = {Hala Mulki and Hatem Haddad and Chedi Bechikh Ali and Halima Alshabani}, title = {T-HSAB: A Tunisian Hate Speech and Abusive Language Dataset}, booktitle = {Proceedings of the 7th International Conference on Arabic Language Processing}, year = {2019} } ``` **Tunisian Reading Comprehension Dataset** ```bibtex @dataset{tunisian_rc, title = {Tunisian Reading Comprehension Dataset}, note = {Question-Answering dataset based on the Tunisian constitution; 144 documents, 3 paragraphs each, 3 QA pairs per paragraph} } ``` **Naim Mhedhbi — Tunisian Dialect Corpus v0** ```bibtex @dataset{mhedhbi_tunisian_v0, author = {Naim Mhedhbi}, title = {Tunisian Dialect Corpus v0}, note = {~40,000 Facebook comments and posts labeled for sentiment} } ``` **TSAC — Tunisian Sentiment Analysis Corpus** ```bibtex @dataset{tsac, title = {Tunisian Sentiment Analysis Corpus (TSAC)}, note = {~17,000 Facebook comments in Tunisian dialect}, howpublished = {\url{https://paperswithcode.com/dataset/tsac}} } ``` **khaled123/Testtun** ```bibtex @dataset{testtun, author = {Khaled Bouzaiene}, title = {Testtun}, url = {https://huggingface.co/datasets/khaled123/Testtun} } ``` **khaled123/Tuniset** ```bibtex @dataset{tuniset, author = {Khaled Bouzaiene}, title = {Tuniset}, url = {https://huggingface.co/datasets/khaled123/Tuniset} } ``` --- ## 9. Dataset Card Authors [Syrinesmati](https://huggingface.co/Syrinesmati) --- ## 10. Contact For questions, issues, or updates, please use the **Hugging Face dataset repository discussion/issues page**.

提供机构：

Syrinesmati

5,000+

优质数据集

54 个

任务类型

进入经典数据集